How to aggregate multiple columns at once in R

Aggregate function is similar to tapply function, but it can accomplish more than tapply. There are multiple ways to use aggregate function, but we will show you the most straightforward and most popular way. The general form is:

aggregate(formula, data, function, …)

So, the function takes at least three arguments. First one is formula which takes form of y~x, where y is numeric variable to be divided and x is grouping variable. Next we specify the data, which is name of a dataframe or a list. If you have a vector, you must convert it to dataframe and then use it. So, data argument is name of a dataframe or a list.

Function we pass as an argument is a function we want to apply to each of subsets x (from formula argument).
We can also pass function specific arguments (…)

For the demonstration purposes, I will use CO2 dataset that is built-in in R:

CO2data <- CO2

CO2data $height <- rnorm(84, mean=35, sd=0.5) ## To define a hypothetical plant height


We have two levels for variable Type: Mississippi and Quebec.

We have two levels for variable Treatment: chilled and nonchilled.

For every plant, CO2 uptake is measured.

We have six collumns and 84 rows. Lets say I am interested in looking at columns conc and uptake. So we want to ignore other columns and find mean uptake on different levels of conc variable.

The aggregate() function can help us here:

              aggregate(uptake~conc, CO2data,mean)

 First (before ~) we specify the uptake column because it contains the values on which we want to perform a function. After ~ we specify the conc variable, because it contains 7 categories that we will use to subset the uptake values. We also want to indicate that these values are from the CO2data dataframe. Finally we specify that we want to take a mean of each of the subsets of uptake value.

              > aggregate(uptake~conc, CO2data,mean)

              conc      uptake

1            95          12.25833

2            175        22.28333

3            250        28.87500

4             350       30.66667

5             500       30.87500

6            675        31.95000

7            1000      33.58333

># The list of average uptake by levels of concentration

We can also use tapply() function to get same result, but the benefit of aggregate() is that it can go above and beyond what tapply() can do.

So, what if we want to subset by concentration, but we also want to subset by treatment type?

In the above example, we get average uptake by levels of concentration.

In the next example with treatment type also included, we will get average uptake on chilled treatment and nonchilled treatment by levels of concentration:

              aggregate(uptake~conc+Treatment, CO2data,mean)

                 conc                  Treatment          uptake

1    95                  nonchilled           13.28333

2   175                 nonchilled           25.11667

3   250                 nonchilled           32.46667

4   350                 nonchilled           35.13333

5   500                 nonchilled           35.10000

6   675                 nonchilled           36.01667

7  1000                nonchilled           37.38333

8    95                   chilled                11.23333

9   175                  chilled                19.45000

10  250                chilled                 25.28333

11  350                chilled                 26.20000

12  500                chilled                 26.65000

13  675                 chilled                27.88333

14 1000                chilled                29.78333

Compare these results with the one above where we only included conc variable.

But we can subset with more than just 2 columns. Here we add Type to other 2 columns to subset:

              aggregate(uptake~conc+Treatment+Type, CO2data,mean)

       conc             Treatment                         Type                                                 uptake

1    95                  nonchilled                        Quebec                                            15.26667

2   175                 nonchilled                         Quebec                                            30.03333

3   250                 nonchilled                          Quebec                                           37.40000

4   350                 nonchilled                        Quebec                                            40.36667

5   500                 nonchilled                          Quebec                                           39.60000

6   675                 nonchilled                         Quebec                                            41.50000

7  1000                nonchilled                          Quebec                                           43.16667

8    95                  chilled                                 Quebec                                           12.86667

9   175                   chilled                              Quebec                                            24.13333

10  250                chilled                                 Quebec                                           34.46667

11  350                 chilled                                 Quebec                                          35.80000

12  500                 chilled                               Quebec                                            36.66667

13  675                  chilled                              Quebec                                            37.50000

14 1000               chilled                                  Quebec                                          40.83333

15   95                 nonchilled                         Mississippi                                      11.30000

16  175                nonchilled                         Mississippi                                     20.20000

17  250                nonchilled                         Mississippi                                      27.53333

18  350                nonchilled                        Mississippi                                      29.90000

19  500                nonchilled                         Mississippi                                      30.60000

20  675                nonchilled                         Mississippi                                     30.53333

21 1000                             nonchilled                        Mississippi                                     31.60000

22   95                  chilled                               Mississippi                                     9.60000

23  175                chilled                                Mississippi                                      14.76667

24  250                  chilled                            Mississippi                                      16.10000

25  350                chilled                                Mississippi                                      16.60000

26  500                 chilled                               Mississippi                                      16.63333

27  675                   chilled                             Mississippi                                      18.26667

28 1000                chilled                               Mississippi                                      18.73333

If we want to subset to get more means instead of just uptake value, for example if we want to get means of uptake and height as well, we cannot do it in same way as we did with subset columns.

What we need to do is to use cbind():

              aggregate(cbind(uptake,height)~conc, CO2data, mean)

                conc                   uptake                               height

1   95                   12.25833                           35.09564

2  175                  22.28333                           35.22596

3  250                  28.87500                           34.85416

4  350                  30.66667                          34.93658

5  500                  30.87500                          35.05123

6  675                  31.95000                           35.08222

7 1000                 33.58333                           35.01246

So, here we have average values of uptake and height variables for each level of conc variable.

Now, in all our examples, we used mean function to get mean values of uptake (and height in one example), but that does not mean that we cannot use different functions. We could, for example, use length() to determine how many entries there is in each subset:

              aggregate(uptake~conc,CO2data,length)

                            conc                     uptake

1             95                         12

2            175                      12

3            250                       12

4            350                        12

5            500                      12

6             675                     12

7            1000                    12

In  the upper example, we found out length of subsets in dataframe called CO2data. We are using groups in concentration vector (conc) to subset the data, and we want to find a length of the uptake value. So we want to know how many uptake measurements were taken out of concentration of 95, 175, 250 and so on.

You can see that we have 12 values for each group.

Note that instead of uptake, we could specify any other vector from our dataframe if we wanted to, such as height:

aggregate(height~conc, CO2data,length)

              conc                     uptake

1             95                         12

2            175                      12

3            250                       12

4            350                        12

5            500                      12

6             675                     12

7            1000                    12

And we get same asnwers because all of these vectors contained within CO2data dataframe have the same number of elements and therefore they would all be subsetted in same way.

Thank you for reading our tutorial.

If you want to read more about powerful functions such as aggregate(), you can check out free ebook ‘R for data science’ by Garrett Grolemund