Cross Join in R – Merging Data Frames

Most good data science projects involve merging data from multiple sources. We covered the basics of how to use the merge() function in our earlier tutorial about data manipulation. We’re going to take a deeper look at the “cross join” operation between two tables. While cross joins tend to be much more rare than inner joins and left joins, they are part of the data manipulation lineup.

You should probably also read our tutorial on the left join in R and inner joins. Most of your real world requests will likely fall into this bucket.

What is a Cross Join In R?

An cross join is a merge operation between two data frame which seeks to return every matching records in the two data frames, including ones that lack a matching pair on the other side.

Cross joins have a bad reputation among developers. Most queries which include a cross join are in error; there are very few real world applications for this query. Unless, of course, you are a Probabilist tasked with analyzing the Cartesian explosion of a pair of variables (yes – a true real world case).

For our example, we return to the insurance losses analysis. We have a set of insurance claims (5) that we want to map back to the policy underwriting data.

Cross Join in R – Setting Up Merge ()

Here’s the merge function that will get this done.

results<-merge(x=source1,y=source2,by=”State”,all.x=TRUE, all.y=TRUE)

source – the names of our two data frames
by – this parameter identifies the field in the dataframes to use to match records together. Trying to merge two different column names? No problem, we’ve got you covered – article covering this point.
all.x and all.y = Boolean which indicates if you want this to be an inner join (matches only) or an outer join (all records on one side). In this example, we want everything from both sides of the merge.

Cross Join in R – Code Example

We’re going to go ahead and set up the data:

# cross join in R - create first data set
> policies <- data.frame(Policy = c(1:9), State=c(rep("GA",3), rep("FL", 3), rep("AL", 3)), Limit=c(rep(50000,3), rep(75000, 3), rep(85000, 3)))

> policies
  Policy State Limit
1      1    GA 50000
2      2    GA 50000
3      3    GA 50000
4      4    FL 75000
5      5    FL 75000
6      6    FL 75000
7      7    AL 85000
8      8    AL 85000
9      9    AL 85000

> losses <- data.frame(Policy=c(1,4,5,10,11), loss=c(500,25000,85000,5000,10000))
> losses
  Policy  loss
1      1   500
2      4 25000
3      5 85000
4     10  5000
5     11 10000

So now we’re going to merge the two data frames together.

# cross join in R - join the dataframes! 
> scored_losses<-merge(x=policies,y=losses,by="Policy",all.x=TRUE, all.y=TRUE)
> scored_losses
   Policy State Limit  loss
1       1    GA 50000   500
2       2    GA 50000    NA
3       3    GA 50000    NA
4       4    FL 75000 25000
5       5    FL 75000 85000
6       6    FL 75000    NA
7       7    AL 85000    NA
8       8    AL 85000    NA
9       9    AL 85000    NA
10     10  <NA>    NA  5000
11     11  <NA>    NA 10000

This actually could work as a real world query example. We managed to yield a fairly nice data set of policy vs. loss data for the three states we expected in our sample. And then we have items 10 and 11…

Apparently we’re paying losses on policies where we don’t have any underwriting data. That’s a data integrity problem (plus potentially a couple of other things) and worth discussing with your IT team.

So in this case, the cross join was successful in spotting a data problem.

Want to Learn more R shortcuts?

Check out our tutorial on helpful R functions. In a language where there seems to be several ways to solve any problems, this reference page can help guide you to good options for getting things done. We seek to interject a little Pythonic clarity and sustainability to the “just get it done” world of R programming.