The R programming language is one of the most popular options for advanced statistical analysis and data science. Part of this popularity is thanks to the fact that the R language has a number of built-in data types which are especially useful for analytical work. But the flip side of R’s data types is the ease in which you can manipulate them.
The base R library provides users with an impressive selection of functions. And libraries like tidyr wrap that potential into easier-to-use systems. For example, tidyr provides a special function called spread. The spread function in R makes it easy to spread a key value pair across different columns of a structure. It’s true that analogous functionality can be found in most programming languages. But what sets R apart is the sheer ease of use and brevity of the code required to do so. And you’ll soon see how to use spread to perform some complex manipulations with key-value pairs.
The Basics of Spread and Tidyr
The tidyr library doesn’t ship with R by default. But it’s so useful that if you’re using R you’ll probably find yourself dipping into tidyr as well. As the name suggests, tidyr is used to tidy up data in R. It also helps to keep your code tidy by providing methods to implement complex functionality with only a small amount of code. Many of its functions relate to data conversion. And it really shines when you need to drastically modify both the layout and presentation of data frames. This is an especially common problem when dealing with wide and long formatted data. People often need to move data from one to the other. It’s a tedious process when done by hand. But it couldn’t be easier with R and the spread function.
Data Formats and Tidyr’s Spread Function
The nature of wide and long formats isn’t always self-apparent. Consider a situation where you have a number of subjects for observation. Each subject corresponds to specific variables and a values. We can model this with the following R code.
ourData <- data.frame(
ourSubject = c(“A”, “A”, “B”, “B”,”C”,”C”),
ourVariables = c(“X”, “Y”, “X”, “Y”,”X”,”Y”),
ourValues = c(5, 10, 15, 20, 25, 30)
)
print(ourData)
We begin by creating a dataframe called ourData and populating it with three rows and six columns. These rows correspond to subjects, variables, and values. We then proceed to print out the contents of ourData. And this shows why we might want to tweak the formatting. By default, our data is relatively messy. Thankfully that’s where tidyr shines.
The Spread Function in Theory and Practice
We can use tidyr’s spread function to reformat our data so that each subject has its own row. Take a look at the following code.
library(tidyr)
ourData <- data.frame(
ourSubject = c(“A”, “A”, “B”, “B”,”C”,”C”),
ourVariables = c(“X”, “Y”, “X”, “Y”,”X”,”Y”),
ourValues = c(5, 10, 15, 20, 25, 30)
)
ourSpread <- spread(ourData, key = ourVariables, value = ourValues)
print(ourSpread)
We begin by importing the tidyr library. This is where we’ll get our spread function from. We proceed to create the ourData data frame again. The following line is where the real magic happens. We create a new variable called ourSpread to hold the output of the spread function. We call it with a few seperate arguments. As you might expect, the arguments begin with the data frame we want to work with. In this case, that’s ourData. Next, we pass our key-value pair. We’re using ourVariables and ourValues for the pair. We then print out the result to see what’s now stored in ourSpread. Saying that the resulting data is tidier would be an understatement.
We haven’t just made the data easier to read either. The data is now organized around the subject, with each assigned to a row. And each variable is now represented by a column with the appropriate values. This type of formatting is known as long. And the prior format can be classified as wide. So we haven’t just spread a key-value pair across different columns. We’ve also converted the data format from wide to long.
But you might wonder what would happen if our data wasn’t perfectly aligned. For example, what if we were scraping data or importing it from a csv and wound up with missing values? The bad news is that spread can be a little finicky with data parity. But that’s fairly easy to take care of when constructing the data frame. We can just insert NA value s for any missing data. This could be done during import or through custom functions tailored to your needs.
Data with missing values filled in with NA are called explicit missing values. And these are a lot easier for the function to work with than if there was nothing present in that position at all. Tidyr can try to compensate for this type of missingness, called implicit missing, but the results aren’t always as clean as we might like. As such it’s generally a good idea to avoid implicit missings by just cleaning data during the import process. Take a look at the following code to see how we could handle the results of a data set with explicit missing values.
library(tidyr)
ourData <- data.frame(
ourSubject = c(“A”, “A”, “B”, “B”,”C”,”C”),
ourVariables = c(“X”, “Y”, “X”, “Y”,”X”,”Y”),
ourValues = c(NA, 10, 15, 20, 25, 30)
)
ourSpread <- spread(ourData, key = ourVariables, value = ourValues, fill = 1)
print(ourSpread)
We begin in a fairly similar way to the prior example. The first difference can be seen in the ourValues row. We simulate a null import by assigning NA to the first data point in ourValues. And we see how to deal with that change in the following ourSpread assignment. We call the function with the same arguments as in the prior example. However, take a look at the last argument. We now have a fill option. This number tells the function what to do if it finds a missing value.
In this example, we’re using NA to specify explicit missingness. And we supply the function with a fill of 1. The specified number means that when the function encounters missingness it’ll replace that null value with 1. The next line prints out the results. And we do, indeed, see that NA is now 1 in the newly formatted data. We could take this even further by supplying a drop = FALSE to keep factor levels that aren’t in the data. When that happens the function will simply use the fill value.
It’s generally best to limit things to a key-value pair and no other arguments when first starting out with a new data set. This can help you work through the data to find out where discrepancies might arise. But once you’re familiar with the data you can start mixing in more advanced logic like the drop and fill arguments.