The R programming language is capable of a remarkable level of flexibility. On one hand, you can use it in a fairly analogous way to programming languages like Python or even Perl. And it can easily emulate a lot of the functionality found in databases. But on the other hand, you can leverage R’s capabilities in truly unique ways that give you the best of both worlds. You can combine the scope of database queries with the power of a modern programming language. Those elements are the driving forces behind something known as an anti join. The technique lets you find unmatched rows in your datasets without needing to resort to complex investigations. And you’ll soon see exactly how to do it in R.
When Is a Join Not a Join?
Rather unsurprisingly, joins in R tend to focus on joining two collections of data together at a specified point. There are a considerable number of ways to perform a join. And each type of join has unique properties. But the different join types are heavily differentiated by the way they handle mismatched data. Do you keep mismatched data or throw it away? Do you want to keep data from one side but not the other? If so, should the right or left side of the data collections be given priority?
There’s a wide variety of ways that you can handle mismatched data. But all of it leads to an important question. What if you wanted to focus on the elements that don’t fit together instead of those which do? Or in other words, look at what doesn’t fit within a join rather than what does. That’s the basis for something known as an anti join.
Anti joins are also noteworthy for the fact that they’re not a pre-made solution that you’ll find in the standard R lexicon. You’ll need to to install the dplyr library before you can make use of anti joins. Or, at least, make use of them in a user-friendly manner that’s wrapped up into a single function. Thankfully you can install R by simply typing the following on the command line.
install.packages(“dplyr”)
If you’re using an IDE like R Studio then it’s just as easy. Dplyr is an incredibly popular R library. As such you’re almost certain to find it as an option for installation within any given R development environment. But no matter how it’s installed, you can call it in your code by simply adding this line.
library(dplyr)
Now that all of the pieces are in play it’s time to see how it all comes together. Get your favorite IDE or text editor ready because you’re about to dive into some practical anti joins.
First Step Toward an Anti Join
We can begin with a simple example. Imagine that we needed to quickly categorize improperly sorted data describing moons within our solar system. We could write a simple script to work as a diagnostic via anti join.
library(dplyr)
ourDataFrame <- data.frame(
id = c(1, 2, 3, 4, 5),
name = c(“Aegaeon”, “Atlas”, “Calypso”, “Daphnis”, “Epimetheus”),
stringsAsFactors = FALSE
)
ourDataFrame2 <- data.frame(
id = c(3, 4, 5, 6, 7),
name = c(“Calypso”, “Daphnis”, “Epimetheus”, “Himalia”, “Janus”),
stringsAsFactors = FALSE
)
unmatchedRows <- anti_join(ourDataFrame, ourDataFrame2, by = “id”)
print(unmatchedRows)
We begin by importing the dplyr library. This is where we get our anti join function from. We proceed to create the ourDataFrame and ourDataFrame2 data frames. The frames contain an id, moon name, and pass an argument to preserve our data as a string.
Next, we run our actual anti join and pass the result to unmatchedRows. This is accomplished through dplyr’s anti_join. We simply need to pass the two data frames we’re using and a key, id, to use as the reference point.
The anti_join function will compare ourDataFrame to ourDataFrame2. Data that doesn’t match will be assigned to unmatchedRows. We then proceed to print out our results. And it shows us that we’ve successfully extracted the unmatched rows from our two data frames.
Moving Forward With Anti Joins
Now that you know how to work with a basic anti join you might find yourself wondering what else it can do. Imagine a situation where you needed to quickly work through data while sorting for different parameters. To do so we’ll take on the role of a museum employee who needs to determine whether there’s enough space to realistically model various dinosaur species together in the same display. The primary criteria will be height and weight. Take a look at the following code to see how we could get around that problem through the use of an anti join.
library(dplyr)
dinoDNA <- data.frame(
species = c(“Troodon”,”Allosaurus”, “Apatosaurus”, “Stegosaurus”, “Triceratops”, “Utahraptor”, “Archaeopteryx”),
height = c(40, 1500, 12000, 2500, 9000, 1000, 99),
weight = c(40, 2000, 13000, 3000, 10000, 1100, 99),
stringsAsFactors = FALSE
)
dinoSubset <- dinoDNA %>%
filter(weight < 1999) %>%
select(species, height)
unmatchedRows <- anti_join(dinoDNA, dinoSubset, by = “species”)
print(dinoSubset)
print(unmatchedRows)
We begin by once again loading up the dplyr library. Next, we create our data frame – dinoDNA. We have all of the criteria we need to judge within this data frame. And we’re also ensuring our strings remain as characters by using the stringsAsFactors=FALSE argument. Note that we only create one data frame in this example. We’re actually going to be using an anti join with data that we generate from the dinoDNA data frame. And that’s what the declaration of dinoSubset is for on the next line.
We use a pipe operator to act as a “pipe” to filter and move our data. It begins by filtering for weight under 1999 from dinoDNA. Then the results are further refined by using select to grab species and height. The final results are passed through the pipe to dinoSubset. We then create a new variable called unmatchedRows.
It’s in unmatchedRows that we actually call anti_join. We pass our original dinoDNA data frame, and the subset of information we extracted from it as dinoSubset, as arguments to anti_join. Then we specify that we want to use species as our key. The results are then passed on to unmatchedRows. Finally, we can contrast and compare both the original and anti join data by printing both out on the following two lines. And in doing so we see that we’ve successfully used an anti join on a single data frame by creating a secondary subset of its contents.