Aggregate function in R is similar to tapply function, but it can accomplish more than tapply. There are multiple ways to use aggregate function, but we will show you the most straightforward and most popular way. This aggregation function can be used in an R data frame or similar data structure to create a summary statistic that combines different functions and descriptive statistics to get a sum of multiple columns of your data frame. It will sum each of the duplicate and unique values from each numeric column, combining each one into one data set and numeric value expression. This aggregate calculation can be done in base R, and the general form is:
aggregate(formula, data, function, …)
So, the aggregation function takes at least three numeric value arguments. First one is formula which takes form of y~x, where y is numeric variable to be divided and x is grouping variable. Next we specify the data, which is name of a dataframe or a list, a categorical variable that helps the aggregate calculation determine which column names from your data set to use in the data aggregation. The input parameter for this data aggregation must be a data frame, or the query will return null or missing values instead of sourcing the correct grouping elements. If you have a vector, you must convert it to dataframe and then use it. So, data argument is name of a dataframe or a list.
The scalar function we pass as an argument is a function we want to apply to each of subsets x (from formula argument). We can also pass function specific arguments (…) We can also pass function specific arguments (…)
For the demonstration purposes, I will use the CO2 example data dataset that is built-in in base R:
CO2data <- CO2
CO2data $height <- rnorm(84, mean=35, sd=0.5) ## To define a hypothetical plant height
We have two levels for variable Type: Mississippi and Quebec.
We have two levels for variable Treatment: chilled and nonchilled.
For every plant, CO2 uptake is measured.
We have six collumns and 84 rows. Lets say I am interested in looking at columns conc and uptake. So we want to ignore other columns and find mean uptake on different levels of conc variable.
The aggregate() function can help us here:
aggregate(uptake~conc, CO2data,mean)
First (before ~) we specify the uptake column because it contains the values on which we want to perform a function. After ~ we specify the conc variable, because it contains 7 categories that we will use to subset the uptake values. We also want to indicate that these values are from the CO2data dataframe. Finally we specify that we want to take a mean of each of the subsets of uptake value.
> aggregate(uptake~conc, CO2data,mean)
conc uptake
1 95 12.25833
2 175 22.28333
3 250 28.87500
4 350 30.66667
5 500 30.87500
6 675 31.95000
7 1000 33.58333
># The list of average uptake by levels of concentration
We can also use tapply() function to get same result, but the benefit of aggregate() is that it can go above and beyond what tapply() can do.
So, what if we want to subset by concentration, but we also want to subset by treatment type?
In the above example, we get average uptake by levels of concentration.
In the next example with treatment type also included, we will get average uptake on chilled treatment and nonchilled treatment by levels of concentration:
aggregate(uptake~conc+Treatment, CO2data,mean)
conc Treatment uptake
1 95 nonchilled 13.28333
2 175 nonchilled 25.11667
3 250 nonchilled 32.46667
4 350 nonchilled 35.13333
5 500 nonchilled 35.10000
6 675 nonchilled 36.01667
7 1000 nonchilled 37.38333
8 95 chilled 11.23333
9 175 chilled 19.45000
10 250 chilled 25.28333
11 350 chilled 26.20000
12 500 chilled 26.65000
13 675 chilled 27.88333
14 1000 chilled 29.78333
Compare these results with the one above where we only included conc variable.
But we can subset with more than just 2 columns. Here we add Type to other 2 columns to subset:
aggregate(uptake~conc+Treatment+Type, CO2data,mean)
conc Treatment Type uptake
1 95 nonchilled Quebec 15.26667
2 175 nonchilled Quebec 30.03333
3 250 nonchilled Quebec 37.40000
4 350 nonchilled Quebec 40.36667
5 500 nonchilled Quebec 39.60000
6 675 nonchilled Quebec 41.50000
7 1000 nonchilled Quebec 43.16667
8 95 chilled Quebec 12.86667
9 175 chilled Quebec 24.13333
10 250 chilled Quebec 34.46667
11 350 chilled Quebec 35.80000
12 500 chilled Quebec 36.66667
13 675 chilled Quebec 37.50000
14 1000 chilled Quebec 40.83333
15 95 nonchilled Mississippi 11.30000
16 175 nonchilled Mississippi 20.20000
17 250 nonchilled Mississippi 27.53333
18 350 nonchilled Mississippi 29.90000
19 500 nonchilled Mississippi 30.60000
20 675 nonchilled Mississippi 30.53333
21 1000 nonchilled Mississippi 31.60000
22 95 chilled Mississippi 9.60000
23 175 chilled Mississippi 14.76667
24 250 chilled Mississippi 16.10000
25 350 chilled Mississippi 16.60000
26 500 chilled Mississippi 16.63333
27 675 chilled Mississippi 18.26667
28 1000 chilled Mississippi 18.73333
If we want to subset to get more means instead of just uptake value, for example if we want to get means of uptake and height as well, we cannot do it in same way as we did with subset columns.
What we need to do is to use cbind():
aggregate(cbind(uptake,height)~conc, CO2data, mean)
conc uptake height
1 95 12.25833 35.09564
2 175 22.28333 35.22596
3 250 28.87500 34.85416
4 350 30.66667 34.93658
5 500 30.87500 35.05123
6 675 31.95000 35.08222
7 1000 33.58333 35.01246
So, here we have average values of uptake and height variables for each level of conc variable.
Now, in all our examples, we used mean function to get mean values of uptake (and height in one example), but that does not mean that we cannot use different functions. We could, for example, use length() to determine how many entries there is in each subset:
aggregate(uptake~conc,CO2data,length)
conc uptake
1 95 12
2 175 12
3 250 12
4 350 12
5 500 12
6 675 12
7 1000 12
In the upper example, we found out length of subsets in dataframe called CO2data. We are using groups in concentration vector (conc) to subset the data, and we want to find a length of the uptake value. So we want to know how many uptake measurements were taken out of concentration of 95, 175, 250 and so on.
You can see that we have 12 values for each group.
Note that instead of uptake, we could specify any other vector from our dataframe if we wanted to, such as height:
aggregate(height~conc, CO2data,length)
conc uptake
1 95 12
2 175 12
3 250 12
4 350 12
5 500 12
6 675 12
7 1000 12
And we get same asnwers because all of these vectors contained within CO2data dataframe have the same number of elements and therefore they would all be subsetted in same way.
Thank you for reading our tutorial.
If you want to read more about powerful functions such as aggregate(), you can check out free ebook ‘R for data science’ by Garrett Grolemund