The R programming language is one of the undisputed stars of the data science world. Whether you’re working through an advanced machine learning algorithm or searching for quirks in combined meta-analysis data, one thing’s sure. The R programming language is well suited for the task. But there are times when R’s ease of use can be somewhat perplexing. There can be situations where doing something is easy but doing it perfectly seems to come with a lot of issues. Take data frames for example. Merging two data frames is extremely simple. In fact, it’s so simple that there are a number of different ways to go about it. But that simplicity can make it hard to know how to fine-tune a singular method to accurately fit into your particular need. For example, how would you merge two data frames in R without duplicating columns? Read on to find out what the best way to accomplish that task really is.
Data Frames, Formatting, and Structure
Data frames are a relatively common concept in programming languages that are designed around more advanced data or calculations. And those programming languages which lack data frames tend to have some form of 3rd party support for them. But with that said, a data frame in one language will usually differ in some respects from data frames in another.
R data frames are two-dimensional structures that consist of rows and columns. They’re fairly analogous to a standard spreadsheet format with an x and y-axis. For example, take a look at the following two data frames.
ourData1 <- data.frame(
planet = c(“Mercury”, “Venus”, “Earth”),
distanceFromSun = c(57.9, 108.2, 149.6)
)
ourData2 <- data.frame(
planet = c(“Mars”, “Jupiter”, “Saturn”),
numberOfMoons = c(2, 79, 82)
)
In this example, we create two data frames – ourData1 and ourData2. The ourData1 frame contains the names and distances of the first three planets in our solar system. The ourData2 frame contains the next three planets and their respective moons.
So far this is all fairly standard. But what really makes R’s data frames so powerful is the language structure. R data frames can be easily converted to and from different formats with a single function. And this isn’t something that’s expected to be a rare occurrence. The language is designed around the idea of data being somewhat fluid. You’re expected to move data around inside different containers and formats. And this concept is reflected in R data frames.
For example, you can work functions like sum into specialized loop systems like the various apply variants. This will let you loop through logistically specified parts of a data frame while assigning its contents to another variable. You can use this concept to essentially rebuild a data frame in a single line of code. And there are also specialized methods to join data frames in a specific order or by other criteria.
All of this highlights why merging data frames can be both simple and complex at the same time. It’s easy to do a simple merge of two data frames. But the system is built around the idea that users will want to have a high level of control over the process. As such, it tends to expect that you’ll add some additional arguments or use more specialized logic. And if that’s not done you can wind up with unexpected results. One of the more common examples is duplicated or missing data. But it’s extremely easy to avoid that problem by looking into the specifics of merge operations.
The Basic Merge Mechanics
Think back to our prior two data frame examples. There’s one point that might raise some concern if you’ve worked with data frames in other languages. Our column data isn’t fully compatible. If we merge our two data frames we’ll have missing values. It’s true that R will take note of that fact. But it will also try to fix it rather than exiting out with an error. Take a look at one solution to the problem in the following code.
ourData1 <- data.frame(
planet = c(“Mercury”, “Venus”, “Earth”),
distanceFromSun = c(57.9, 108.2, 149.6)
)
ourData2 <- data.frame(
planet = c(“Mars”, “Jupiter”, “Saturn”),
numberOfMoons = c(2, 79, 82)
)
mergedData <- merge(ourData1, ourData2, by = “planet”, all = TRUE)
print(mergedData)
If you run this script you might be surprised by the fact that it works so smoothly. This is thanks to the arguments passed in the merge command. The merge command is initiated with the names of the two data frames we want to merge. But we follow it with a “by” argument. This is the name of the point where the merge is going to occur. You can think of it as essentially putting a thumbtack in place to hold two printed documents together.
In this case, we use planets because we have two compatible columns named the same. We can easily merge those two data frames together thanks to the shared planets format. However, there’s also the matter of the distanceFromSun and numberOfMoons columns. Each data frame has a category that’s not present in the other. This is where the “all” argument comes in. We explicitly tell R that we want to merge all of the data. When there’s no equivalent data to be found R will instead substitute a not applicable (NA) value. This also points to the potential problems that can result in column duplication. Merge will avoid column duplication or identical columns by appending a suffix to the column names. For example, run the same code but change the first data frame to the example below.
ourData1 <- data.frame(
planet = c(“Mercury”, “Venus”, “Earth”),
distanceFromSun = c(57.9, 108.2, 149.6),
numberOfMoons = c(0, 0, 1)
)
You’ll now have a column called planet, numberOfMoons.x, and numberOfMoons.y. However, this isn’t due to a conflict between the data. It’s because merge operates by providing separate columns for items other than the column specified in the “by” argument. Note that while this can be a somewhat visually distracting option, merge is protecting us from duplicate columns through this practice. You might wonder how we’d tidy all of this up into a fully joined numberOfMoons columnn. To do that we’d be best served by using rbind, as seen below.
ourData1 <- data.frame(planet = c(“Mercury”, “Venus”, “Earth”),
distanceFromSun = c(57.9, 108.2, 149.6),
numberOfMoons = c(0, 0, 1))
ourData2 <- data.frame(planet = c(“Mars”, “Jupiter”, “Saturn”),
distanceFromSun = c(227.9, 778.3, 1433.5),
numberOfMoons = c(2, 79, 82))
combinedData <- rbind(ourData1, ourData2)
print(combinedData)
The previous example with merge avoided duplicates by renaming columns with an additional suffix. But with rbind we combine the data frames by essentially appending data to the original. This also highlights one of the great things about R. There’s no one size fits all solution. We’re instead given the option to decide which methodology is the best fit for our particular project.