Most good data science projects involve merging data from multiple sources. We covered the basics of how to use the merge() function in our earlier tutorial about data manipulation. This article is going to go a level deeper, specifically looking at the “left join” operation between two tables.
To set up a left join in R using the merge function, we are going to need to do two things. First, we need to set the x.all = True parameter. This configures the merge to return all records with a match on the left. Second, you will need to set up the “by statement” for the right column(s) to define how the two data frames should be matched up. See the examples below for more detail and an explanation of how the left join works in R.
See also our materials on inner joins and cross joins.
What is a Left Join In R?
A left join in R is a merge operation between two data frames where the merge returns all of the rows from one table (the left side) and any matching rows from the second table. A left join in R will NOT return values of the second table which do not already exist in the first table.
For example, let us suppose we’re going to analyze a collection of insurance policies written in Georgia, Alabama, and Florida. We want to see if they are compliant with our official state underwriting standards, which we keep in a table by state for all of the 38 states where we’re licensed to sell insurance. We’re going to need to merge these two data frames together.
The left join will return a data set consisting of all of the initial insurance policies and values for the three rows on the second table they matched to. There will not be values for states outside of the three listed (GA, FL, AL).
This is in contrast to an inner join, where you only return records which match on both tables. Suppose we had policies from a 39th state we were not allowed to operate in. If we ran this as an inner join, these records will be dropped since they were present on one table but not the other.
Left Join in R – Setting Up Merge ()
Here’s the merge function that will get this done.
- source – the names of our two data frames
- by – this parameter identifies the field in the dataframes to use to match records together. Trying to merge two different column names? No problem, we’ve got you covered – article covering this point.
- all.x and all.y = Boolean which indicates if you want this to be an inner join (matches only) or an outer join (all records on one side)
Left Join in R – Code Example
We’re going to go ahead and set up the data:
# left join in R - create first data set > policies <- data.frame(Policy = c(1:9), State=c(rep("GA",3), rep("FL", 3), rep("AL", 3)), Limit=c(rep(50000,3), rep(75000, 3), rep(85000, 3))) > policies Policy State Limit 1 1 GA 50000 2 2 GA 50000 3 3 GA 50000 4 4 FL 75000 5 5 FL 75000 6 6 FL 75000 7 7 AL 85000 8 8 AL 85000 9 9 AL 85000 # left join in R - create second data set > limits <- data.frame(State=c("FL","GA","AL"), regulatory_limit=c(75000,75000,65000)) > limits State regulatory_limit 1 FL 75000 2 GA 75000 3 AL 65000
So now we’re going to merge the two data frames together. After that, we can compare the amount of the policy with the acceptable limits.
# left join in R - join the dataframes! > scored_policies<-merge(x=policies,y=limits,by="State",all.x=TRUE) > scored_policies State Policy Limit regulatory_limit 1 AL 7 85000 65000 2 AL 8 85000 65000 3 AL 9 85000 65000 4 FL 4 75000 75000 5 FL 5 75000 75000 6 FL 6 75000 75000 7 GA 1 50000 75000 8 GA 2 50000 75000 9 GA 3 50000 75000 # check to see if we're writing policies over the limit > which(scored_policies$Limit > scored_policies$regulatory_limit)  1 2 3 # three violations...apparently there is a problem in Alabama...
The last part was an example of using the which function (tutorial link).
FAQ – Common Questions related to left join in R…
How would you implement a right join?
We would support a right join by changing our merge declaration to all.y = TRUE. That flips the join clause to looking the other way; it would bring in matching values for X (left table) and all values of Y (right table b). (Technically, this would be a right outer join – the same way our left join clause is a left outer join). Either way, the join function is giving you every row in one of the dataframe(s) plus any matching row in the other.
Implementation with multiple data frames is tricky; I recommend doing joins one at a time to simplify debugging.
What is a natural join?
Within databases, a natural join is a default join type where the columns have the same names in both tables. Generally speaking, this should be used as a last resort: it’s better to explicitly define the matching value / matching key column so the merge function is safe from database changes.
Using selected columns is more work but gives you far more control. A null value (na values) in the wrong place (due to a convert error) or corrupted variable names can easily generate tricky errors to debug.
What is a cross join?
The bane of database administrators everywhere, a cross join returns the Cartesian product of the two sets of matching columns. In effect, every record in the left table matched to every record in the right table. If you’re dealing with any dataset of respectable size, this will quickly return a massive amount of data. RIP your memory…
(There are actual legitimate business analytics needs for the cross join, but they are very rare. The existence of cross joins in your code design is cause for reassessment before proceeding.)
Want to Learn more R shortcuts?
Check out our tutorial on helpful R functions. In a language where there seems to be several ways to solve any problems, this reference page can help guide you to good options for getting things done. We seek to interject a little Pythonic clarity and sustainability to the “just get it done” world of R programming.