Most good data science projects involve merging data from multiple sources. We covered the basics of how to use the merge() function in our earlier tutorial about data manipulation. We’re going to take a deeper look at the “cross join” operation between two tables. While cross joins tend to be much more rare than inner joins and left joins, they are part of the data manipulation lineup.
What is a Cross Join In R?
An cross join is a merge operation between two data frame which seeks to return every matching records in the two data frames, including ones that lack a matching pair on the other side.
Cross joins have a bad reputation among developers. Most queries which include a cross join are in error; there are very few real world applications for this query. Unless, of course, you are a Probabilist tasked with analyzing the Cartesian explosion of a pair of variables (yes – a true real world case).
For our example, we return to the insurance losses analysis. We have a set of insurance claims (5) that we want to map back to the policy underwriting data.
Cross Join in R – Setting Up Merge ()
Here’s the merge function that will get this done.
- source – the names of our two data frames
- by – this parameter identifies the field in the dataframes to use to match records together. Trying to merge two different column names? No problem, we’ve got you covered – article covering this point.
- all.x and all.y = Boolean which indicates if you want this to be an inner join (matches only) or an outer join (all records on one side). In this example, we want everything from both sides of the merge.
Cross Join in R – Code Example
We’re going to go ahead and set up the data:
# cross join in R - create first data set > policies <- data.frame(Policy = c(1:9), State=c(rep("GA",3), rep("FL", 3), rep("AL", 3)), Limit=c(rep(50000,3), rep(75000, 3), rep(85000, 3))) > policies Policy State Limit 1 1 GA 50000 2 2 GA 50000 3 3 GA 50000 4 4 FL 75000 5 5 FL 75000 6 6 FL 75000 7 7 AL 85000 8 8 AL 85000 9 9 AL 85000 > losses <- data.frame(Policy=c(1,4,5,10,11), loss=c(500,25000,85000,5000,10000)) > losses Policy loss 1 1 500 2 4 25000 3 5 85000 4 10 5000 5 11 10000
So now we’re going to merge the two data frames together.
# cross join in R - join the dataframes! > scored_losses<-merge(x=policies,y=losses,by="Policy",all.x=TRUE, all.y=TRUE) > scored_losses Policy State Limit loss 1 1 GA 50000 500 2 2 GA 50000 NA 3 3 GA 50000 NA 4 4 FL 75000 25000 5 5 FL 75000 85000 6 6 FL 75000 NA 7 7 AL 85000 NA 8 8 AL 85000 NA 9 9 AL 85000 NA 10 10 <NA> NA 5000 11 11 <NA> NA 10000
This actually could work as a real world query example. We managed to yield a fairly nice data set of policy vs. loss data for the three states we expected in our sample. And then we have items 10 and 11…
Apparently we’re paying losses on policies where we don’t have any underwriting data. That’s a data integrity problem (plus potentially a couple of other things) and worth discussing with your IT team.
So in this case, the cross join was successful in spotting a data problem.
Want to Learn more R shortcuts?
Check out our tutorial on helpful R functions. In a language where there seems to be several ways to solve any problems, this reference page can help guide you to good options for getting things done. We seek to interject a little Pythonic clarity and sustainability to the “just get it done” world of R programming.