Most good data science projects involve merging data from multiple sources. We covered the basics of how to use the merge() function in our earlier tutorial about data manipulation. This article is going to go a level deeper, specifically looking at the “left join” operation between two tables.
What is a Left Join In R?
A left join in R is a merge operation between two data frames where the merge returns all of the rows from one table (the left side) and any matching rows from the second table. A left join in R will NOT return values of the second table which do not already exist in the first table.
For example, let us suppose we’re going to analyze a collection of insurance policies written in Georgia, Alabama, and Florida. We want to see if they are compliant with our official state underwriting standards, which we keep in a table by state for all of the 38 states where we’re licensed to sell insurance. We’re going to need to merge these two data frames together.
The left join will return a data set consisting of all of the initial insurance policies and values for the three rows on the second table they matched to. There will not be values for states outside of the three listed (GA, FL, AL).
This is in contrast to an inner join, where you only return records which match on both tables. Suppose we had policies from a 39th state we were not allowed to operate in. If we ran this as an inner join, these records will be dropped since they were present on one table but not the other.
Left Join in R – Setting Up Merge ()
Here’s the merge function that will get this done.
- source – the names of our two data frames
- by – this parameter identifies the field in the dataframes to use to match records together. Trying to merge two different column names? No problem, we’ve got you covered – article covering this point.
- all.x and all.y = Boolean which indicates if you want this to be an inner join (matches only) or an outer join (all records on one side)
Left Join in R – Code Example
We’re going to go ahead and set up the data:
# left join in R - create first data set > policies <- data.frame(Policy = c(1:9), State=c(rep("GA",3), rep("FL", 3), rep("AL", 3)), Limit=c(rep(50000,3), rep(75000, 3), rep(85000, 3))) > policies Policy State Limit 1 1 GA 50000 2 2 GA 50000 3 3 GA 50000 4 4 FL 75000 5 5 FL 75000 6 6 FL 75000 7 7 AL 85000 8 8 AL 85000 9 9 AL 85000 # left join in R - create second data set > limits <- data.frame(State=c("FL","GA","AL"), regulatory_limit=c(75000,75000,65000)) > limits State regulatory_limit 1 FL 75000 2 GA 75000 3 AL 65000
So now we’re going to merge the two data frames together. After that, we can compare the amount of the policy with the acceptable limits.
# left join in R - join the dataframes! > scored_policies<-merge(x=policies,y=limits,by="State",all.x=TRUE) > scored_policies State Policy Limit regulatory_limit 1 AL 7 85000 65000 2 AL 8 85000 65000 3 AL 9 85000 65000 4 FL 4 75000 75000 5 FL 5 75000 75000 6 FL 6 75000 75000 7 GA 1 50000 75000 8 GA 2 50000 75000 9 GA 3 50000 75000 # check to see if we're writing policies over the limit > which(scored_policies$Limit > scored_policies$regulatory_limit)  1 2 3 # three violations...apparently there is a problem in Alabama...
The last part was an example of using the which function (tutorial link).
Want to Learn more R shortcuts?
Check out our tutorial on helpful R functions. In a language where there seems to be several ways to solve any problems, this reference page can help guide you to good options for getting things done. We seek to interject a little Pythonic clarity and sustainability to the “just get it done” world of R programming.