R is a robust language that can be used in a variety of different ways. But it’s primarily leveraged for tasks related to statistics and data science. As such, the language’s basic library consists of a wealth of mathematical functions. But there are areas where you’ll find yourself presented with more of a challenge than anticipated. For example, there’s more to converting data from long to wide than you might expect. But you’ll soon discover how to efficiently perform that conversion in R using pivot_longer.
R and Tidyr
Converting or pivoting data between wide and long formats with R’s standard library often requires convoluted use of reshape, stack, melt and unstack. But you’ll typically find that any missing functionality in R has already been implemented in third-party libraries. And that is indeed the case for pivoting from wide format to long format. In this case, the tidyr library has everything we’re looking for.
Tidyr is a popular library that, as the name suggests, provides extra functionality to make many R calculations neater and tidier. Instead of micromanaging elements while converting data between wide and long we can use a clean function that returns equally clean results. And tidyr provides that capability through the pivot_longer function.
Initial Steps To Pivoting
Pivot_longer is best understood by just jumping in and seeing it in action. Take a look at the following code.
library(tidyr)
df <- data.frame(Planet=c(‘Mercury’, ‘Venus’, ‘Earth’, ‘Mars’,’Jupiter’,’Saturn’,’Uranus’,’Neptune’),
transmission1=c(5, 10, 15, 20, 25,30,35,40),
transmission2=c(45, 50, 55, 60, 65, 70, 75, 80),
transmission3=c(85, 90, 95, 100, 105, 110, 115, 120))
df2 <- df %>% pivot_longer(cols=c(‘transmission1’, ‘transmission2′,’transmission3’),
names_to=’Chronology’,
values_to=’Transmissions’)
df2 %>% print(n = Inf)
In this example, we’ll step into the shoes of a researcher looking at information transmitted from orbital satellites around the planets in our solar system. We’re looking at three separate sets of data that relate to the eight planets. And we assign that information as a dataframe to df. Next, we want to concatenate our frame into a long format.
We’re going to assign the results to df2 and go about it in the usual way. But note that we’re using an infix operator to send df’s information to pivot_longer. This essentially acts as a pipe to pass information from left to right rather than the reverse. So the flow of data goes from df to pivot_longer and then back to df2.
We’ll circle back to take a more in-depth look at the pivotlonger function. But for now, we can end by noting how the finalized information is printed to screen. The long data is, of course, fairly long. So we’ll use the infix operator again to pipe the df2 information to the print function. The print function is also provided with an Inf argument to specify that we want the totality of our vector printed to the screen.
At this point, we’ve seen how easy it is to convert from wide to long. And as you might suspect, there’s even a similar pivot wider function to take things in the other direction. But there’s still the larger question of how the function works. What is pivot_longer actually doing to the information and how do we work with its various options?
A Deeper Dive Into Pivot_longer
The pivot_longer function is part of the tidyr library. And, as you might expect from a tidyr function, pivot_longer’s design is quite tidy. Take a look at how we used it in the previous example. It might seem like a complex collection of variables at first glance. But it’s actually fairly straightforward when you look at it from a perspective focused on functionality.
Go back and consider the first variables passed to pivot_longer. We use the cols argument to specify which columns we want to work with. In the case of our example, we want to use all of the columns, so we use every column name. Of course, we also need to tell pivot_longer how to make use of the multiple columns. This is done by using the function’s keyword arguments. For example, names_to specifies a new named column to use for the conversion. We use a simple “Chronology” filled with the order of transmission from the hypothetical satellites. But you can modify this in different usage scenarios. Multiple values indicate that multiple columns will be created. Keep in mind that there are some naming restrictions. For example, the first character in a column name can’t be a number. However, you can generally get past many invalid column names issues by just encasing it in backticks.
Invalid names might also pop up as part of the conversion process when using pivot_longer. For example, by default, the system won’t allow duplicate names. But we can supply a names_repair value to get past that problem. If we used a minimal value with names_repair it would simply allow duplicates. And if we specified unique the system would add a numeric suffix to duplicate names to make them unique.
Next, we supply the values_to arguments. This provides the system with a names pattern to use for statistical information in the individual cells. In our code, we specify this as Transmissions. And this creates a specific column with our transmission values.
In our usage scenario, we were working with numbers that perfectly align with a new structure. But pivot_longer also has us covered if there were any missing value instances in our frame. We simply need to set values_drop_na to TRUE and it would drop rows that consist solely of NA values. If we wanted to add those additional arguments we’d modify the previous code in the manner displayed below.
df2 <- df %>% pivot_longer(cols=c(‘transmission1’, ‘transmission2′,’transmission3’),
names_to=’Chronology’,
values_to=’Transmissions’,
values_drop_na = TRUE,
names_repair = “minimal”
)
In this modified version of the pivot_longer code, we add in arguments for value_drop_na and names_repair. This adds some extra flexibility to the function call and can allow for elements that would normally cause an error during the conversion. Of course, default values are set for a reason. In many situations, you’ll want to receive an error if your information isn’t formatted in the manner you expect. As such, both values_drop_na and names_repair should be used with caution.