There are a number of reasons why the R programming language is such a popular choice when people work with large statistical collections. The most obvious reason is R’s support for structures that work seamlessly with data science solutions. But R is also notable for how it elegantly combines complex procedures with elegant simplicity. There are times when you’ll want to fine-tune a procedure to ensure lower-level control. Other situations will let you run through data in a fairly standardized way that doesn’t need fine-tuning. R makes it easy to implement the latter without removing options for the former. Not only does R have powerful functions like lapply. It also gives you wrapper function systems like sapply. You’ll soon see how sapply can give you the power of lapply but with some extra ease of use.
The Apply Family of Functions
Sapply is part of a larger family of functions called apply. Apply and its variants all work within a central principle. They all apply a function to a collection of variables without the need for a loop. You essentially get all of the power of a loop without needing to actually create one. Using apply rather than a manual loop is typically faster to write, more efficient to run, and generally, the best way to work through large data sets.
However, there’s one big problem any data-focused language would face when implementing something like apply. A lot of R’s power comes from its data types and collections. And each data type in R has unique features and benefits which differentiate it from the rest. That type of system makes R powerful. But it also precludes a one size fits all solution for something like apply. And that’s why we see a number of apply variants. For example, lapply’s system is specially tailored around lists. It’s literally “list apply”. And we even have functions like sapply which act as a superset over lapply’s system. This provides us with some additional flexibility when using a list.
Starting Out With the Basics
We can start to see what makes sapply so special by first looking at how many steps are needed to perform some fairly simple calculations without it. Take a look at the following example.
ourSumVar <- 0
ourCol <- list(1,4,6,9,5,6,7,1,2,55,56,57)
for (x in ourCol) {
ourSumVar <- ourSumVar+x
}
print(ourSumVar)
In this example we start by defining a variable, ourSumVar, to hold our progress as we work over a collection of numbers. This collection is in turn defined as ourCol in the following line. Next, we create a for loop to iterate over the values in ourCol. Each number is added to ourSumVar. And, finally, we print out the results on the last line. If you’re familiar with R’s sum you might wonder why we wouldn’t just use that. But try adding this line to the end of the script.
sum(ourCol)
Instead of a number, you’ll just see an error message stating that we’re using an incompatible data type. This harkens back to the earlier discussion of data types in R. Each data type has unique properties. And because data types are unique, we don’t see full compatibility between every function and every data type. But this is also where we can start to see the power of apply, sapply, and the lapply function.
Diving Into Sapply
We can take the earlier example up a level of complexity and still work through it with just a single line of code using the sappy function.
ourCol <- list(first = c(1,4,6), second = c(9,NA,6), third = c(7,1,2), fourth = c(55,56,57))
sapply(ourCol, sum)
In this script we begin by creating ourCol with a varient of our prior data. This time around we’re defining the collection with multiple elements. But instead of using a for loop we simply call sapply and pass our collection and the function we want to perform on the individual elements. In this case, that means running sum on the first, second, third, and fourth elements and returning the result. This mostly goes as we’d expect from a standard loop. There are two main differences though.
The most important difference is that we don’t need to reformat or unpack ourCol before passing it to sum. However, also take note of the fact that the call to sapply returns an NA value. This is because the singular NA value in second essentially winds up defining the entire column. It’s somewhat akin to the fact that you’d receive a blank value by multiplying a final summed value by 0. Getting around this with sapply is extremely simple though. Just change the sapply call to the following.
sapply(ourCol, sum, na.rm = TRUE)
As the name suggests, the na.rm argument removes (rm) the NA (na) value from a collection. And going along with sapply’s moniker of “simple apply” we’re able to keep things simple. With this change in place sapply loops through all of the vector values to give us the sum of each element.
One interesting aspect of the apply family is that we’re technically using a loop when we call it. And we can also take advantage of that functionality to pass individual values to a function during the loop. For example, consider a case where we might want to get the square value of every item within a list. We can accomplish this through a for loop with the following code.
for (x in ourCol) {
print(sqrt(x))
}
But if you replace the prior sapply call with the following code we can perform this process in a more efficient way.
sapply(ourCol, function(x) sqrt(x))
In this example, we’re still calling a function within sapply. But the main difference is that we’re using one that we create within the instance of sapply. The variable supplied to “function” is used within the loop. So when we supply x to our function we can then use it with sqrt to get a square value. The x is supplied by the ourCol loop performed by sapply.
Need more options? Check Out The articles below…
- How To Use the apply function (matrix or data frame)
- How To Use the sapply function (simplified version of lapply)
- How To Use the lapply function (list or vector)
- How To Use the mapply function (applying a function to multiple lists or vectors)
- How To Use the tapply function (levels of a factor)
- While Loops
- For Loops
- Creating Anonymous Functions in R