R has a well-deserved reputation among data scientists. The language is specifically designed around concepts related to statistics and data analysis. This means that R doesn’t just let you work with large collections of data. You can also manipulate that data into a wide variety of different formats and presentations.
R essentially lets you manage data fluidly. Your variables aren’t fixed in a specific form. You can instead easily mix and match them together as needed. And this is plainly seen in operations related to joining and merging data. For example, performing a right join in many languages can be an exercise in frustration. But you’re about to see just how easy it is to set up a right join in R using the merge function.
Joins, Right Joins, and Merge in R
Before moving on to a right join it’s important to take a step back and consider what’s really happening during the process. People often think of joins as analogous to perfectly fitting one data set into another. And to be sure, that is the ideal for a join operation. But in the real world data tends to be a lot messier than we’d like and we’re often stuck working with instances that are full of only semi-related metrics.
This boils down to the fact that we often need to decide what elements of our data collection are the most important. And we then need to fit all of our other data in place around it. There’s a variety of different approaches that we can take. And each will move data around in a different way. For example, we can try to do a full outer join to fit everything in place even if there are rows with no matches. We could do a left join and include all rows from our left data frame and matching rows from the right. Or, as we’ll soon see, we can do a right join and perform what’s essentially the opposite of a left join.
Performing a Standard Right Join
Performing the various join operations would be a difficult process in most other programming languages. But thankfully R’s data-focused nature makes joins surprisingly easy. And much of that is thanks to a single function called merge. As the name suggests, merge will merge data that we pass to it as an argument. Take a look at the following example to see just how easily we can use merge to perform a right join in R.
ourData1 <- data.frame(
id = c(1, 2, 3),
name = c(“T-Rex”, “Stegosaurus”, “Utahraptor”)
)
ourData2 <- data.frame(
id = c(1, 2, 3, 4),
scientificInterest = c(90, 50, 100, 1),
publicInterest = c(90, 40, 5, 2)
)
ourMerged <- merge(ourData2, ourData1, by = “id”, all.x = TRUE)
print(ourMerged)
We begin by creating two data frames. Each data frame contains information about dinosaurs. ourData1 is a list of dinosaurs grouped by ID number in a hypothetical index. Our second ranks the scientific and public interest in dinosaurs by ID number. We have two data frames that are mostly compatible aside from the fact that ourData1 is missing a listing found in ourData2. Thankfully R is extremely good at handling these types of conflicts. And we try to do so by initiating a new variable called ourMerged on the next line.
The ourMerged variable takes the output of our call to the merge function. And merge is itself called by supplying a few arguments. The first is our data frames. Note the order we’re supplying the data frames in. We have ourData2 as the first variable and ourData1 as the second. At first glance, you might assume that this would define what’s a right and left join. But that’s actually what follows with the “by” and “all.x”.
The by argument indicates the point we’re using as the shared element to pin the process on. In this case, we specify that we want to join on the id. The all.x is the positional element of merge. It tells merge that we want to include all of the rows from the right data frame, ourData2, in the result. And that matching rows from the left frame, ourData1, should be added in. If anything in ourData1 doesn’t match ourData2 it’ll show up as an NA value. And, when we print everything out, that’s exactly what we see. The row names show up according to our id. With the 4th row showing an NA value for the missing dinosaur name.
Taking More Control Over the Join
Merge makes joins relatively simple. But there are still a few caveats to keep in mind. One of the most important is how we specify which column element we’re working with. Try changing the merge line to the following.
ourMerged <- merge(ourData2, ourData1, by = “id”, all.x = TRUE, all.y = TRUE)
Using both all.x and all.y tells merge that we want to keep all of the rows from both data frames. But this doesn’t have any effect on our results since our prior join was performed on the more data-heavy frame. But that’s not the case if we make one more tweak to the code. Try replacing the merge with the following example.
ourMerged <- merge(ourData2, ourData1, by = “id”, all.x = TRUE, all.y = FALSE)
We once again see results mirroring what was observed with our initial code because this is what is implicitly stated with the initial all.x = TRUE. But there’s one last variation to try.
ourMerged <- merge(ourData2, ourData1, by = “id”, all.x = FALSE, all.y = TRUE)
This time around our results are a little sparse and the row names have been cut down. This is due to the fact that we’ve drastically changed the nature of the join by specifying x as FALSE and y as TRUE. As noted at the very start, a very small change in R’s syntax can have a huge impact on your data. In this case, we’ve changed the right join to an inner join. Not by removing anything that was present in our code, but by specifying a point of equal importance to the initial value used for the right join. It’s extremely important to be careful with your merge syntax specifically because such small changes can modify the entire nature of your join.