Most good data science projects involve merging data from multiple sources. We covered the basics of how to use the merge() function in our earlier tutorial about data manipulation. We’re going to take a deeper look at the “inner join” operation between two tables.
You should probably also read our tutorial on the left join in R.
What is an Inner Join In R?
An inner join is a merge operation between two data frame which seeks to only return the records which matched between the two data frames. This is in contrast to a left join, which will return all records from one table (plus any matches) and an outer join which returns everything from both sides.
An inner join in R is a merge operation between two data frames where the merge returns all of the rows that match from both tables. You are going to need to specify a common key for R use to use to match the data elements.
Continuing our insurance example, we’re going to analyze a collection of insurance policies written in Georgia, Alabama, and Florida. We wrote a bunch of policies and have experienced a couple of claims (payments) that we should compare against our original data. I’d like to compare the loss amounts to the original policy limits.null
Inner Join in R – Setting Up Merge ()
Here’s the merge function that will get this done.
results<-merge(x=source1,y=source2,by=”State”,all.x=FALSE, all.y=FALSE)
- source – the names of our two data frames
- by – this parameter identifies the field in the data frames to use to match records together. Trying to merge two different column names? No problem, we’ve got you covered – article covering this point.
- all.x and all.y = Boolean which indicates if you want this to be an inner join (matches only) or an outer join (all records on one side). Since we’re doing an inner join, we only need the matches for both.
Inner Join in R – Code Example
We’re going to go ahead and set up the data:
# inner join in R - create first data set > policies <- data.frame(Policy = c(1:9), State=c(rep("GA",3), rep("FL", 3), rep("AL", 3)), Limit=c(rep(50000,3), rep(75000, 3), rep(85000, 3))) > policies Policy State Limit 1 1 GA 50000 2 2 GA 50000 3 3 GA 50000 4 4 FL 75000 5 5 FL 75000 6 6 FL 75000 7 7 AL 85000 8 8 AL 85000 9 9 AL 85000 # inner join in R - create second data set for losses > losses <- data.frame(Policy=c(1,4,5), loss=c(500,25000,85000)) > losses Policy loss 1 1 500 2 4 25000 3 5 85000
So now we’re going to merge the two data frames together. After that, we can compare the losses on our policies with the original policy limits.
# inner join in R - join the data-frames! > scored_losses<-merge(x=policies,y=losses,by="Policy",all.x=FALSE, all.y=FALSE) > scored_losses Policy State Limit loss 1 1 GA 50000 500 2 4 FL 75000 25000 3 5 FL 75000 85000
This (admittedly contrived) analysis is actually somewhat interesting. Note that we’ve got two losses in Florida (out of three policies). Furthermore, one of our losses managed to exceed the limits of the original policy.
Were this a real business question, I’d be asking some questions in Florida.
In any event, that concludes our tutorial on how to do an inner join in R.
Want to Learn more R shortcuts?
Check out our tutorial on helpful R functions. In a language where there seems to be several ways to solve any problems, this reference page can help guide you to good options for getting things done. We seek to interject a little Pythonic clarity and sustainability to the “just get it done” world of R programming.