We’re going to walk through how to merge two data frames in R.
This article continues the examples started in our data frame tutorial. We’re using the ChickWeight data frame example which is included in the standard R distribution. You can easily get to this by typing: data(ChickWeight) in the R console. This data frame captures the weight of chickens that were fed different diets over a period of 21 days. If you can imagine someone walking around a research farm with a clipboard for an agricultural experiment, you’ve got the right idea….
This series has a couple of parts – feel free to skip ahead to the most relevant parts.
- Inspecting your data
- Ways to Select a Subset of Data From an R Data Frame
- Create an R Data Frame
- Sort an R Data Frame
- Add and Remove Columns
- Renaming Columns
- Add and Remove Rows
- Merge Two Data Frames
This section of our tutorial is going to deal with how to combine datasets in R. There are three main techniques we are going to look at:
- cbind() – combining the columns of two data frames side-by-side
- rbind() – stacking two data frames on top of each other, appending one to the other
- merge() – joining two data frames using a common column
Using cbind() to merge two R data frames
We will start with the cbind() R function. This a simple way to join datasets in R where the rows are in the same order and the number of records are the same.
Syntax is straightforward – we’re going to use two imaginary data frames here, chicken and eggs:
# combine two datasets in r everything <-cbind(chicken, eggs)
The final result of this operation is the two data frames appended side by side. It is recommended but not required that the two data frames have the same number of rows. In the event one data frame is shorter than the other, R will recycle the values of the smaller data frame to fill the missing space.
Now, if you need to do a more complicated merge, read below. We will discuss how to merge data frames by multiple columns, set up complex joins to handle missing values, and merge using fields with different row names.
Using rbind() to merge two R data frames
We’ve encountered rbind() before, when appending rows to a data frame. This function stacks the two data frames on top of each other, appending the second data frame to the first.
For this function to operate, both data frames need to have the same number of columns and the same column names.
Using Merge to join Two Data Frames by A Common Field
This is one of the more common applications of merging two different but related data frames. We covered a simple version of this already in our example of setting buckets and flags, where we used R code to set the value of a flag. But we usually need to integrate a much larger amount of data.
For this example, assume we have a large data frame containing a detailed nutritional analysis of each diet formula, assembled via laboratory testing each sample for a variety of nutritional components, vitamins, and minerals. We want to append this to our weights data frame using the diet id as a common key.
# merge two data frames in r # r merge by rownames jointdataset <- merge(ChickWeight, LabResults, by = 'Diet')
Implementing more complicated merges
The merge operation will return a data frame that contains all records which can be matched between the two datasets. If you wanted to join a data frame on two fields, perhaps based on a daily analysis of what the chicks are fed, you could set up something like the following:
# merge two data frames in r # r merge data frames by multiple columns jointdataset <- merge(ChickWeight, LabResults, by = c('Diet','Time'))
This would match the records using the two fields.
When it comes to seeing what records are returned from the merge, you have options beyond the default criteria (the equivalent of an SQL inner join, returning only records which match both data frames). You can specify an additional parameter ‘all’ which controls which records are returned.
- Outer Join – returns all records including no-matches (in either direction)
- Left / Right Join – returns all records in on data frame and matching records in the other
- Cross Join – returns all possible combination of the rows in the two data frames; this is the infamous Cartesian join in SQL
The R code implementation of these additional joins:
# join datasets in r Outer Join: jointdataset <- merge(ChickWeight, LabResults,by = 'Diet', all=TRUE) Left Join: jointdataset <- merge(ChickWeight, LabResults, by = 'Diet', all.x= TRUE) Right Join: jointdataset <- merge(ChickWeight, LabResults, by = 'Diet', all.y=TRUE) Cross Join: jointdataset <- merge(ChickWeight, LabResults, by = Null)
Finally, in the event the two columns you want to merge on have different names, this can be addressed by adjusting your ‘by’ parameter to handle each one separately. Sample code looks like:
# r merge by rownames - merge by different rownames jointdataset <- merge(ChickWeight, LabResults, by.x = 'Diet', by.y ='Sample')
Collectively, these options for merging an R dataframe replicate the core of SQL’s join function. You can use subset selection and other operations to implement filters as needed.
In the event you need to review another section of the tutorial…