R is easily one of the most popular options for any task related to data science. It ships with a default library that can handle a vast number of different statistical challenges. The language is also easy to extend with user-defined functionality. Because of that fact it has a number of powerful third-party libraries. Between the default library and additional packages, you can accomplish almost anything related to data science by using R. And you can usually do so quickly and efficiently.
However, there are so many options in R that it’s not always easy to know what the most efficient way to handle any given task might be. For example, most people working with data sets will need to sum across columns. And there are quite a few different ways to go about that task. But what’s the most efficient way to sum across multiple columns in R?
Finding the Best Way To Tackle the Problem
The traveling salesman problem is one of the most widely known problems in mathematical optimization. It presents you with a seemingly simple question. How would a traveling salesman plan the shortest possible route between every location he needs to visit? It’s an important experiment that can help people optimize various elements of data science. But it also highlights an important element of anything related to data.
It’s not enough to present a solution. What we’re really after is the most efficient solution. And this is the exact issue we face when summing columns in R. There’s a wide variety of methods we could use to get the sum of columns. For example, we could create a loop that looks at the length of dataset elements and then loops by specified criteria. That method would work. But would it be the most efficient use of both programming time and computational resources? The answer is that it probably wouldn’t do well with either of those criteria.
We’d ideally want to write a function that could handle those tasks and more while utilizing a high level of code optimization. However, we’re also in luck there because that feat’s already been accomplished. The dplyr library provides R with a collection of functions related to data manipulation. It’s a new collection of verbs for R’s larger grammar library. These functions make it easy to take longer chains of tasks and implement them with only a line or two of code. And as you’ll soon see, that also includes an extremely efficient way to sum across multiple columns in R.
A Basic Sum Operation With Dplyr
We can begin by looking at a simple example. Take a look at the following code.
df <- data.frame(First=c(1, 2, 3, 4, 5),
Second=c(6, 7, 8, 9, 10),
Third=c(11, 12, 13, 14, 15),
df2 <- df %>% mutate(Fifth = rowSums(.))
We begin by importing the dplyr library. This is what gives us access to premade functionality that makes cross-column data manipulation a snap. Next, we define a simple dataframe that consists of four rows and five columns.
The following line is where we actually sum the columns. We go about the process by creating a new dataframe called df2. We then output our original dataframe, df, into an infix operator pipe. This process pipes the information in df to the mutate function from dplyr. Mutate is an incredibly useful and versatile function that can manipulate dataframes or elements within them. In this case, we supply mutate with an initial argument of Fifth along with the piped data from df. In doing so we essentially create a new column called Fifth. The mutate function then assigns the result of running rowSums to Fifth. Note that rowSums has an “.” argument. This notation is another part of the dplyr library. In this context, the period essentially acts as a wildcard value to select all columns. And, finally, we print out the result of that process.
The end result of all of this is a new column called Fifth which contains the result of summing across every column in df. This is a fairly basic example of how we can sum across multiple columns. However, it doesn’t provide us with much flexibility. It’s quite literally an “all or nothing” solution. It’s true that this is a fairly common usage scenario. But what if you wanted to be more selective with your sum designations?
More Advanced Scenarios
Dplyr is still the most efficient way to selectively sum. Even when we’re performing that action across multiple columns. And our code will remain just as concise. In fact, you just need to replace the df2 assignment with the following line.
df2 <- df %>% mutate(Fifth = rowSums(across(c(First, Third))))
This is quite similar to the original mutate call. The main difference is that we’re specifying that we want to run rowSums across our First and Third columns. But in many cases, it’s easier to use numerical positioning rather than specifying column names. Thankfully dplyr makes summing using numerical designations just as easy as using column names. Try replacing the df2 assignment again, but this time with the code listed below.
df2 <- df %>% mutate(Fifth = rowSums(across(c(1,3))))
This line is extremely similar to the previous version. The difference is that we used the numeric position of each column rather than the name. Instead of First we used 1 and instead of third, we used 3. But specifying a large number of columns could be tedious in larger datasets. Thankfully it’s just as easy to automatically designate a collection of columns. You can once again replace the df2 assignment with the following line.
df2 <- df %>% mutate(Fifth = rowSums(across(c(1:3))))
In this modified code we use : to specify a range of values between 1 and 3. You’d almost certainly be using a wider scope in real-world situations. But selecting three columns is sufficient for our tiny dataset. And if you run the code you’ll see that we do indeed get values summed across columns 1 to 3.
At this point, you’ve seen how easy it is to cause dramatic changes to column and data selection with dplyr. We often just need to change a single character to produce dramatically different approaches to summing across multiple columns. And all of this highlights the library’s efficiency.