How To Subset An R Data Frame – Practical Examples

This article continues the examples started in our data frame tutorial. We’re using the ChickWeight data frame example which is included in the standard R distribution. You can easily get to this by typing: data(ChickWeight) in the R console. This data frame captures the weight of chickens that were fed different diets over a period of 21 days. If you can imagine someone walking around a research farm with a clipboard for an agricultural experiment, you’ve got the right idea….

We’re going to walk through how to extract slices of a data frame in R. This series has a couple of parts – feel free to skip ahead to the most relevant parts.

Selecting A Subset of a R Data Frame

So let us suppose we only want to look at a subset of the data, perhaps only the chicks that were fed diet #4?

To do this, we’re going to use the subset command. We are also going to save a copy of the results into a new dataframe (which we will call testdiet) for easier manipulation and querying.


testdiet <- subset(ChickWeight, Diet==4)

nrow(testdiet)

length(unique(testdiet$Chick))

Running our row count and unique chick counts again, we determine that our data has a total of 118 observations from the 10 chicks fed diet 4.

The subset command is extremely useful and can be used to filter information using multiple conditions. For example, perhaps we would like to look at only observations taken with a late time value. This allows us to ignore the early “noise” in the data and focus our analysis on mature birds. Returning to the subset function, we enter:

 
subset(ChickWeight, Diet==4 & & Time == 21) 

You can also use the subset command to select specific fields within your data frame, to simplify processing.

 

testdiet <- subset(ChickWeight, select=c(weight, Time), subset=(Diet==4 && Time > 20))

This version of the subset command narrows your data frame down to only the elements you want to look at.

Other Ways to Subset A Data Frame in R

There are actually many ways to subset a data frame using R. While the subset command is the simplest and most intuitive way to handle this, you can manipulate data directly from the data frame syntax. Consider:


testdiet <- ChickWeight[ChickWeight$Diet==4,]

This approach is referred to as conditional indexing. We can select rows from the data frame by applying a condition to the overall data frame. Any row meeting that condition is returned, in this case, the observations from birds fed the test diet.

You can, in fact, use this syntax for selections with multiple conditions. The code below yields the same result as the examples above.


bigbirds <- ChickWeight[(ChickWeight$Diet==4) && (ChickWeight$Time==21),]

The AND operator (&) indicates both conditions are required. You also have the option of using an OR operator, indicating a record should be included in the event it meets either condition. A possible example of this is below.

 

endpoints <-ChickWeight[(ChickWeight$Time < 3) | (ChickWeight$Time >20),]

In this case, we are asking for all of the observations recorded either early in the experiment or late in the experiment.

There is also the which function, which is slightly easier to read.

ChickWeight[which((ChickWeight$Diet == 4) && (ChickWeight$Time==21)),
                   names(ChickWeight) %in% c("weight","Time")]

This also yields the same basic result as the examples above, although we are also demonstrating in this example how you can use the which function to reduce the number of columns returned. We specify that we only want to look at weight and time in our subset of data.

Ready for more? Lets move on to creating your own R data frames from raw data. Or feel free to skip around.