How to aggregate multiple columns at once in R

Aggregate function in R is similar to tapply function, but it can accomplish more than tapply. There are multiple ways to use aggregate function, but we will show you the most straightforward and most popular way. This aggregation function can be used in an R data frame or similar data structure to create a summary statistic that combines different functions and descriptive statistics to get a sum of multiple columns of your data frame. It will sum each of the duplicate and unique values from each numeric column, combining each one into one data set and numeric value expression. This aggregate calculation can be done in base R, and the general form is:

aggregate(formula, data, function, …)

So, the aggregation function takes at least three numeric value arguments. First one is formula which takes form of y~x, where y is numeric variable to be divided and x is grouping variable. Next we specify the data, which is name of a dataframe or a list, a categorical variable that helps the aggregate calculation determine which column names from your data set to use in the data aggregation. The input parameter for this data aggregation must be a data frame, or the query will return null or missing values instead of sourcing the correct grouping elements. If you have a vector, you must convert it to dataframe and then use it. So, data argument is name of a dataframe or a list.

The scalar function we pass as an argument is a function we want to apply to each of subsets x (from formula argument). We can also pass function specific arguments (…) We can also pass function specific arguments (…)

For the demonstration purposes, I will use the CO2 example data dataset that is built-in in base R:

CO2data <- CO2
CO2data $height <- rnorm(84, mean=35, sd=0.5) ## To define a hypothetical plant height

We have two levels for variable Type: Mississippi and Quebec.

We have two levels for variable Treatment: chilled and nonchilled.

For every plant, CO2 uptake is measured.

We have six collumns and 84 rows. Lets say I am interested in looking at columns conc and uptake. So we want to ignore other columns and find mean uptake on different levels of conc variable.

The aggregate() function can help us here:

 aggregate(uptake~conc, CO2data,mean)

First (before ~) we specify the uptake column because it contains the values on which we want to perform a function. After ~ we specify the conc variable, because it contains 7 categories that we will use to subset the uptake values. We also want to indicate that these values are from the CO2data dataframe. Finally we specify that we want to take a mean of each of the subsets of uptake value.

> aggregate(uptake~conc, CO2data,mean)
             conc      uptake
1            95         12.25833
2            175        22.28333
3            250        28.87500
4            350        30.66667
5            500        30.87500
6            675        31.95000
7            1000       33.58333
># The list of average uptake by levels of concentration

We can also use tapply() function to get same result, but the benefit of aggregate() is that it can go above and beyond what tapply() can do.

So, what if we want to subset by concentration, but we also want to subset by treatment type?

In the above example, we get average uptake by levels of concentration.

In the next example with treatment type also included, we will get average uptake on chilled treatment and nonchilled treatment by levels of concentration:

aggregate(uptake~conc+Treatment, CO2data,mean)

   conc          Treatment            uptake

1    95          nonchilled           13.28333

2   175          nonchilled           25.11667

3   250          nonchilled           32.46667

4   350          nonchilled           35.13333

5   500          nonchilled           35.10000

6   675          nonchilled           36.01667

7  1000          nonchilled           37.38333

8    95          chilled              11.23333

9   175          chilled              19.45000

10  250          chilled              25.28333

11  350          chilled              26.20000

12  500          chilled              26.65000

13  675          chilled              27.88333

14 1000          chilled              29.78333

Compare these results with the one above where we only included conc variable.

But we can subset with more than just 2 columns. Here we add Type to other 2 columns to subset:

aggregate(uptake~conc+Treatment+Type, CO2data,mean)

    conc             Treatment                         Type                                                 uptake

1    95              nonchilled                        Quebec                                            15.26667

2   175              nonchilled                        Quebec                                             30.03333

3   250              nonchilled                        Quebec                                            37.40000

4   350              nonchilled                        Quebec                                            40.36667

5   500              nonchilled                        Quebec                                           39.60000

6   675              nonchilled                        Quebec                                             41.50000

7  1000              nonchilled                        Quebec                                            43.16667

8    95              chilled                           Quebec                                            12.86667

9   175              chilled                           Quebec                                            24.13333

10  250              chilled                           Quebec                                           34.46667

11  350              chilled                           Quebec                                           35.80000

12  500              chilled                           Quebec                                            36.66667

13  675              chilled                           Quebec                                            37.50000

14 1000              chilled                           Quebec                                          40.83333

15   95              nonchilled                        Mississippi                                      11.30000

16  175              nonchilled                        Mississippi                                     20.20000

17  250              nonchilled                        Mississippi                                      27.53333

18  350              nonchilled                        Mississippi                                      29.90000

19  500              nonchilled                        Mississippi                                      30.60000

20  675              nonchilled                        Mississippi                                     30.53333

21 1000              nonchilled                        Mississippi                                     31.60000

22   95              chilled                           Mississippi                                     9.60000

23  175              chilled                           Mississippi                                      14.76667

24  250              chilled                           Mississippi                                      16.10000

25  350              chilled                           Mississippi                                       16.60000

26  500              chilled                           Mississippi                                      16.63333

27  675              chilled                           Mississippi                                      18.26667

28 1000              chilled                           Mississippi                                       18.73333

If we want to subset to get more means instead of just uptake value, for example if we want to get means of uptake and height as well, we cannot do it in same way as we did with subset columns.

What we need to do is to use cbind():

              aggregate(cbind(uptake,height)~conc, CO2data, mean)

  conc                   uptake                               height

1   95                   12.25833                           35.09564

2  175                  22.28333                           35.22596

3  250                  28.87500                           34.85416

4  350                   30.66667                          34.93658

5  500                   30.87500                          35.05123

6  675                  31.95000                           35.08222

7 1000                 33.58333                           35.01246

So, here we have average values of uptake and height variables for each level of conc variable.

Now, in all our examples, we used mean function to get mean values of uptake (and height in one example), but that does not mean that we cannot use different functions. We could, for example, use length() to determine how many entries there is in each subset:

              aggregate(uptake~conc,CO2data,length)
            conc                     uptake

1             95                         12

2            175                      12

3            250                       12

4            350                        12

5            500                      12

6             675                     12

7            1000                    12

In the upper example, we found out length of subsets in dataframe called CO2data. We are using groups in concentration vector (conc) to subset the data, and we want to find a length of the uptake value. So we want to know how many uptake measurements were taken out of concentration of 95, 175, 250 and so on.

You can see that we have 12 values for each group.

Note that instead of uptake, we could specify any other vector from our dataframe if we wanted to, such as height:

aggregate(height~conc, CO2data,length)

             conc                     uptake

1             95                         12

2            175                      12

3            250                       12

4            350                        12

5            500                      12

6             675                     12

7            1000                    12

And we get same asnwers because all of these vectors contained within CO2data dataframe have the same number of elements and therefore they would all be subsetted in same way.

Thank you for reading our tutorial.

If you want to read more about powerful functions such as aggregate(), you can check out free ebook ‘R for data science’ by Garrett Grolemund