This article continues the data science examples started in our data frame tutorial. We’re using the ChickWeight data frame example which is included in the standard R distribution. You can easily get to this by typing: data(ChickWeight) in the R console. This data frame captures the weight of chickens that were fed different diets over a period of 21 days. If you can imagine someone walking around a research farm with a clipboard for an agricultural experiment, you’ve got the right idea….
We’re going to walk through how to extract slices of a data frame in R programming. This series has a couple of parts – feel free to skip ahead to the most relevant parts.
- Inspecting your data
- Ways to Select a Subset of Data From an R Data Frame
- How To Create an R Data Frame
- How To Sort an R Data Frame
- How to Add and Remove Columns
- Renaming Columns
- How To Add and Remove Rows
- How to Merge Two Data Frames
Selecting A Subset of a R Data Frame
So let us suppose we only want to look at a subset of the data, perhaps only the chicks that were fed diet #4?
To do this, we’re going to use the subset command. We are also going to save a copy of the results into a new dataframe (which we will call testdiet) for easier manipulation and querying. Nrow and length do the rest.
# subset in r example testdiet <- subset(ChickWeight, Diet==4) nrow(testdiet) length(unique(testdiet$Chick))
Running our row count and unique chick counts again, we determine that our data has a total of 118 observations from the 10 chicks fed diet 4.
How to Subset Data in R – Multiple Conditions
The subset command in base R (subset in R) is extremely useful and can be used to filter information using multiple conditions. For example, perhaps we would like to look at only observations taken with a late time value. This allows us to ignore the early “noise” in the data and focus our analysis on mature birds. Returning to the subset function, we enter:
# subset in r data frame multiple conditions subset(ChickWeight, Diet==4 && Time == 21)
You can also use the subset command to select specific fields within your data frame, to simplify processing. In this case, we will filter based on column value.
# subset in r testdiet <- subset(ChickWeight, select=c(weight, Time), subset=(Diet==4 && Time > 20))
This version of the subset command narrows your data frame down to only the elements you want to look at.
Other Ways to Subset A Data Frame in R
There are actually many ways to subset a data frame using R. While the subset command is the simplest and most intuitive way to handle this, you can manipulate data directly from the data frame syntax. Consider:
# subset in r - conditional indexing testdiet <- ChickWeight[ChickWeight$Diet==4,]
This approach is referred to as conditional indexing. We can select rows from the data frame column by applying a logical condition to the overall data frame. Any row meeting that condition (within that column) is returned, in this case, the observations from birds fed the test diet. You could also use it to filter out a record from the data set with a missing value in a specific column name – where the data collection process failed, for example…
You can, in fact, use this syntax for selections with multiple conditions (using the column name and column value for multiple columns). The code below yields the same result as the examples above.
# subset in r data frame multiple conditions bigbirds <- ChickWeight[(ChickWeight$Diet==4) && (ChickWeight$Time==21),]
You can use logical operators to combine conditions. The AND operator (&) indicates both logical conditions are required. You also have the option of using an OR operator, indicating a record should be included in the event it meets either logical condition. A possible example of this is below.
# subset in r endpoints <-ChickWeight[(ChickWeight$Time < 3) | (ChickWeight$Time > 20),]
In this case, we are asking for all of the observations recorded either early in the experiment or late in the experiment.
This can be a powerful way to transform your original data frame, using logical subsetting to prune specific elements (selecting rows with missing value(s) or multiple columns with bad values). This allows you to remove the observation(s) where you suspect external factors (data collection error, special causes) has distorted your results.
There is also the which function, which is slightly easier to read.
# which function in R - select columns returned ChickWeight[which((ChickWeight$Diet == 4) && (ChickWeight$Time==21)), names(ChickWeight) %in% c("weight","Time")]
This also yields the same basic result as the examples above, although we are also demonstrating in this example how you can use the which function to reduce the number of columns returned. We specify that we only want to look at weight and time in our subset of data.
Ready for more? Lets move on to creating your own R data frames from raw data. Or feel free to skip around our tutorial on manipulating a data set using the R language.