How To Use the Tapply in R To Apply a Function to Levels of Factor

The R language is uniquely positioned to work with statistics and data science. It obviously has all of the features you’d expect from a modern programming language. But the language’s real power is found in the various options for R object manipulation. For example, the apply function lets you run procedures on every element of a data frame. And R is also host to a wide range of apply variants which provide more specialized functionality for other data collections. For example, as you’ll soon see, you can use tapply’s functionality to apply a function to levels of factor.

Defining Our Terms

The first step to using tapply’s functionality is actually defining it. The function’s capabilities are quite extensive. But it can be best described as a way to create summaries of a data collection based on a supplied function and with an ability to sort by factors.

Tapply’s parameters include the initial data, an index variable that describes factor levels, and an additional function to run on every element. The system will automatically split the main data into subgroups by using the supplied index levels. It’s only after doing so that an additional functional parameter will execute on every element. The process as a whole can be incredibly helpful when putting together descriptive statistics for a project. But to fully understand what the function’s capable of we can jump into a simple example.

Starting Out With Tapply’s Functionality

Tapply’s functionality can be best understood with a large data set. But we’ll need to start out with the basics before working through multiple factor levels. To begin with we’ll create a simple data set with information about some of the planets in our solar system.

ourDataFrame <- data.frame(
Planet <- c(“Mercury”, “Venus”, “Earth”, “Mars”, “Jupiter”,”Saturn”),
Type <- factor(c(“Terrestrial”, “Terrestrial”, “Terrestrial”,
“Terrestrial”, “Gas”, “Gas”)),
Distance <- c(35000000,67000000,93000000,
142000000,484000000,889000000 ),
Circumference <- c(9522,23617,24889,13256,278985,235185)
ourResults <- tapply(ourDataFrame$Distance,ourDataFrame$Type,mean)

We begin with a standard data frame assignment to a variable called ourDataFrame. Note that we have various different data types in our frame. This includes everything from character vector assignments to numerics. But one of the most important parts of the data stems from how it’s organized. The multiple columns can all be sorted into groups. And this is exactly what we do when the function’s called on the following line.

If you run this code you’ll see that the data is now sorted into two columns – gas and terrestrial. This is thanks to the power of a few of the function’s built-in processes. Note the first argument we use. Tapply’s initial call is provided with the data frame’s distance value. Next, we provide it with the type as a second argument. This accounts for the two columns that were output when the script ran. The third argument calls on R’s mean functionality. And, indeed, the output provided by our script gives us the mean value of the data we fed into it.

One of the most impressive parts of this procedure is its sheer simplicity. A single line of code was able to accomplish a considerable amount of program logic without any need to manually create a loop. And we can create permutations on the theme just as easily. Try replacing tapply’s arguments with the following.

(ourDataFrame$Circumference,ourDataFrame$Type,sum, simplify=FALSE)

This is fairly similar to the initial call. But this time around we’re using circumference as the source of our numeric data. And we’re calling sum to act on that data. But take note that we now have an additional argument. The simplify statement is an optional argument that defaults to true unless we override it by specifically stating that it should be false. With the changes in place, we now have list output instead of the more simplified and human-readable form. Changing simplify back to true, or removing the optional argument, will return us to the more neatly formatted style.

Moving Into More Advanced Functionality

You now have a solid grasp of the function’s basics. But the central question still remains. How can we use it to apply a function to levels of factor? And how would this work if we were using larger data sets rather than the small sample we’ve been working with up until this point? Take a look at the following code for the answer.

ourResult <- tapply(CO2$uptake,list(CO2$Type, CO2$Treatment),mean)

We begin by importing one of R’s sample datasets. The CO2 dataset contains 84 rows and 5 columns which list the results of cold tolerance experiments on a species of grass. In our sample, we’re particularly interested in the type and treatment factors. Each of these two factors contains two levels. And the print statement following the initial import shows how the experiment’s data is laid out.

And we once again see how powerful tapply’s functionality is by how easily it spans these factors. We essentially just need to pass the factors just as we did with the less complex data in the first example. CO2’s uptake is supplied as the first argument. This splits it into subtypes based on the list supplied afterward. Note that in this case, we’re creating a list by passing CO2’s type and treatment to the list function. This returns the proper data in list format. Next, we pass mean as the actual functionality to apply to the items.

Tapply’s system essentially loops through the data and assigns the result to ourResult. And, finally, we print out ourResult. The R interpreter should print out the mean data and group by nonchilled and chilled categories. We’ve successfully applied mean’s functionality to multiple levels of factor in the CO2 data frame. Keep in mind that all of the previous information about the function’s formatting holds true here as well. We can change the optional final argument, call various other functions, etc.

Need more options? Check Out The articles below…

Scroll to top