How To Do a Left Join in R Without Duplicates From Right Table (Merge Function)

The R programming language’s relationship to data is one of its most important and unique traits. Of course all programming languages need to perform calculations and manipulate large amounts of data. But R takes that capability to a whole new level. It essentially packs the power of a database into basic elements like its data frame objects. This unique relationship to data means that R can also use techniques on its variables that are more commonly used within databases. For example, you can use a left join in R to merge data while avoiding duplicates from the right table. And you’re about to see just how easily this technique can be incorporated into your own code.

R’s Approach to Joins

Before moving on to left joins it’s important to take a look at R’s approach to the subject as a whole. If you’re familiar with databases and SQL then you already have a rough idea of what to expect from R. The R language shares a lot of the same concepts for joins with SQL. However, that familiarity can also be a detriment if it leads you to make too many assumptions. Because while R and SQL use similar terminology for joins, the actual syntax involved is quite dissimilar. R uses an approach to join syntax that’s better suited to a larger-scale programming language. In short, R opens the door for more advanced data manipulation where your results can be easily funneled into various additional functions.

The terminology used for these joins is similar to SQL in that it’s heavily based on spatial dimensions. This goes in line with the fact that the joins are done within a two dimensional context. You’re presented with left and right joins, which refers to which data is given preference. With a left join we’d keep all the rows from the left frame and try to match them with data from the right.

There are additional variations on this concept such as anti joins which look at what doesn’t fit into the join rather than what does. And, as you’ll soon see, joins in R also put some focus on data duplication. You’re given a lot of options to remove duplicate data or reference it in different ways. Likewise, we have a lot of freedom to decide how to retain or even modify duplicate data before moving it into another contex.

Starting Out With a Simple Left Join in R

One thing that both R and SQL joins have in common is that they can seem quite complex when you’re just hearing them described. But, as with many spatial concepts, it’s much easier to understand them by seeing actual examples. So lets dive into a simple left join using R and the merge function.

ourFrame <- data.frame(
key = c(“Troodon”, “Allosaurus”, “Apatosaurus”, “Stegosaurus”, “Triceratops”),
height = c(40, 1500, 12000, 2500, 9000)
)
ourFrame2 <- data.frame(
key = c(“Troodon”, “Allosaurus”, “Utahraptor”, “Apatosaurus”, “Stegosaurus”, “Stegosaurus”, “Triceratops”),
weight = c(40, 2000, 2000, 13000, 3000, 3000, 10000)
)

ourMergedData <- merge(ourFrame, ourFrame2, by = “key”, all.x = TRUE)
print(ourMergedData)

We begin by defining two data frames – ourFrame and ourFrame2. Each has a key consisting of a dinosaur name and some corresponding information. With ourFrame that’s height, while ourFrame2 provides us with information about the dinosaur’s estimated weight.

We proceed to do our left join by passing ourFrame and ourFrame2 to the merge function. Note that we also point to “key” as our join key for the “by” argument. We’re joining “by” the “key”. We also specify all.x = TRUE to signify the join type. Remember that R does operate through a spatial metaphor with joins. Here x signifies left and y right. Likewise, all’s TRUE state means that we want to include “all” data from the left frame. This will of course create a left join.

Note that merge will always use an “all” argument even if you don’t specify one. If the argument isn’t explicitly passed within merge it defaults to a FALSE value. So if you create a merge without specifying “all” then it’s implicitly stating “all.x = FALSE, all.y = FALSE”. And when we use a single all.x = TRUE we keep all.y in its default FALSE state. We see that in action in the next line as we print out the results. We’ve successfully joined our data. But there’s one problem – duplicates.

Taking the Concept of Left Joins a Little Further to Remove Duplicates

You might imagine that we’d have to perform a more complex join procedure in order to remove our duplicate information. But this is where we see how a combination of SQL’s database concepts and a robust programming language can come together into something truly amazing with R. Try running the following code to see just how easily we can remove the duplicates.

library(dplyr)
ourFrame <- data.frame(
key = c(“Troodon”, “Allosaurus”, “Apatosaurus”, “Stegosaurus”, “Triceratops”),
height = c(40, 1500, 12000, 2500, 9000)
)
ourFrame2 <- data.frame(
key = c(“Troodon”, “Allosaurus”, “Utahraptor”, “Apatosaurus”, “Stegosaurus”, “Stegosaurus”, “Triceratops”),
weight = c(40, 2000, 2000, 13000, 3000, 3000, 10000)
)

ourFrame2 <- ourFrame2 %>% distinct(key)
ourMergedData <- merge(ourFrame, ourFrame2, by = “key”, all.x = TRUE)
print(ourMergedData)

We start out in a similar way to the initial example. But this time around we’re importing an additional library called dplyr. This package contains a number of functions related to R joins that can make even the most complex join relatively easy. Once we’ve loaded up dplyr we proceed to define and fill ourFrame and ourFrame2. Thanks to dplyr we’re able to take a different approach to their contents. This time around we redefine ourFrame2 through dplyr’s distinct function.

Distinct can remove duplicate rows from a data frame when presented with a specific column. In this case, we use key as the argument. Distinct removes the duplicate and passes the result back to ourFrame2. This essentially redefines ourFrame2. With that out of the way, we run the same merge function that we’d used in the previous example. And you’ll see a very similar result to that example when it’s printed to screen. The main difference is that the results are now unique.