How to Perform a Full Join in R (using the merge function)

You’d be hard-pressed to find any real debate about the R language’s versatility in the face of data science-based challenges. R is one of the most beloved computational tools for statistical analysis, machine learning, and similar fields. And it provides you with a true wealth of options for almost any mathematical challenge you could ever imagine. However, there is one unexpected downside to R’s versatility. You’ll often find many different ways to work with the same data. And this situation can often make it difficult to know how to, for example, join data from two different data frames. Even slight variations on a join can produce wildly divergent results. But you’ll soon discover how to take control of your data and perform a full join in R using the merge function.

The Complex Relationship Between R and Data Types

The idea of merging data can seem fairly straightforward at first glance. And it’s true that merging simple data collections together in some languages is just as straightforward as you might imagine. But there’s an important point to keep in mind. R is designed around data science and it uses more complex data structures by default than most languages. For example, data frames are one of the most common variable types in R. But R data frames are fairly complex structures that are essentially little databases in the form of a variable. You’re naturally going to be presented with a lot of options when working with data types that have an equally large number of features. Remember, the SQL language is entirely focused on database management. And R needs to pack similar power into variable manipulation given the complexity of its objects.

Take joins for example. You might simply want to create a new data frame that has the contents of two existing frames. But what exactly does that mean in literal, specific, terms? Think about two spreadsheets of data. Each has a name column. But what does a name actually mean? On one spreadsheet the names might be first names. The others might have people organized by surname. Names could even refer to someone’s school, workplace, etc. We often find ourselves needing to join data in less obvious areas due to the ambiguity of natural languages. For example, we might use a primary field of one data frame and a tertiary field of another.

But even if we have two fully compatible data frames there’s still a lot to consider. In those instances, we do have a useful merge function that can essentially turn two objects into one. But there are still a lot of options to consider when using it. For example, this is a full list of the arguments supported by merge.

(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(“.x”,”.y”), no.dups = TRUE, incomparables = NULL, …)

That’s a large number of options to be sure. But don’t worry, we actually only need a small subset of those arguments for most full join operations.

The Fundamentals of Merge

At this point, you might be thinking that a full join is going to be a rather arcane endeavor. But the following example highlights that a full join using merge can be quite simple if we’re only trying to perform an equally simple operation.

jupitersMoons <- data.frame(
moonName = c(“Io”, “Europa”, “Ganymede”, “Callisto”),
moonDistance = c(421000, 671000, 1070000, 1880000)
)

marsMoons <- data.frame(
moonName = c(“Phobos”, “Deimos”),
moonDistance = c(9378, 23463)
)

mergedFrame <- merge(jupitersMoons, marsMoons, all = TRUE)
print(mergedFrame)

We begin creating two data frames – jupitersMoons and marsMoons. These two data frames contain the names of each moon and their distance from the parent planet. This is about as compatible as anyone could hope for since we’re using the same concepts and the same names. And this allows us to see the realization of a previously discussed concept. The simpler the data management job the simpler the code. So even though merge supports a vast number of arguments, we only need three. And this process begins by declaring a new variable called mergedFrame. This will of course hold the newly merged frame once our operation has been completed.

Next, we make the actual call to merge. The first two arguments are just the names of our data frames. We follow that up with “all = TRUE”. This is where we really define the nature of our merge as a full join. When we pass this argument we’re essentially saying to just use everything from both data frames. And when everything is fully compatible, as it is in this example, we receive a perfectly formatted data frame that contains everything from its parent frames.

Taking the Full Join to the Next Level

Keep in mind that R typically scales to our needs. So that simple example had an equally simple syntax. But what if we had a slightly more difficult scenario? Say, we wanted to perform a full join using the provided data while leaving it open to calculate the distance from one planet’s moon to the other planet’s orbit. We could do so by explicitly targeting the moonName for our key. Take a look at the following code to see that idea in action.

jupitersMoons <- data.frame(
moonName = c(“Io”, “Europa”, “Ganymede”, “Callisto”),
moonDistance = c(421000, 671000, 1070000, 1880000)
)

marsMoons <- data.frame(
moonName = c(“Phobos”, “Deimos”),
moonDistance = c(9378, 23463)
)

mergedFrame <- merge(jupitersMoons, marsMoons, by = “moonName”, all = TRUE)
print(mergedFrame)

This time around when we call merge we’re also passing a “by” argument. And this changes things in more ways than you might expect. The most important point is that we’re now specifically focusing on the moonName column. You can think of it as hyper-focusing on that one element instead of the whole. And because of that fact we now have a moonDistance.x and moonDistance.y column. A merge will generally create new columns when there’s an overlap of names within the two data frames. This allows us the extra space we need to add in future distance calculations. But the naming conventions don’t really give us a very good idea of what we’re looking at. We can fix that by placing the following line directly below the assignment of mergedFrame.

names(mergedFrame) <- c(“moonName”, “moonDistanceJupiter”, “moonDistanceMars”)

The names function can change or set the names of objects and their various inherited properties. In this case, we’re altering the moonName columns into a more descriptive form. It’s generally best to put as much care into your new data frame structure as possible when performing a join. You don’t have to micromanage every detail. But just providing a key for the join can help you create more predictable results. However, as we’ve seen, R can also handle a lot on its own even when we don’t specify a key.

Scroll to top
Privacy Policy