How To Extract Rows From an R Data Frame With a Certain Value

R is one of the top programming languages targeting statistics and data science. One of the reasons for this popularity comes from the fact that the language, syntax, and base R library are all tightly focused on data manipulation. You can, of course, use R for more than data manipulation. But there aren’t many languages that offer R’s level of power for that particular discipline. For example, you can easily manipulate the structure and contents of R native data type options. As you’re about to see, R even makes it easy to extract rows from an R data frame with a specific value.

Powerful Types Powered by Powerful Logic Systems

R gets some well-deserved credit for its basic data type design. R makes it easy to work with data that can naturally fit into almost any methodology you could dream of. And of course, R’s base library is also extremely impressive. But these elements are only half of the story. Another key to R’s power can be found in the language’s syntax. Your R programming style can take a variety of different forms according to your needs. You can write basic desktop functionality in scripts that are fairly similar to what you’d find in a language like Python. But you can also combine R’s data types with syntax similar to what you’d use when working with databases or Perl. This is especially important when discussing anything related to complex selection rules within data frames.

Data frames can be used as a fairly simple collection system. In this context, they’d function in a similar way to lists in Python. But it’s important to keep in mind that data frames are laid out in a fairly analogous way to databases. And we may well need to work with analogous programming methodology to work with R data as we would in a database. But don’t worry, doing so is far easier than it sounds.

Databases, Frames, and Selective Syntax

The key to the practices that will let us extract rows from data frames comes in two main parts – the regular expression and logical criterion. We also find a third emergent element, logical subsetting, that essentially acts as a bridge between the two. This might sound complex at first. But we can highlight how to use some of these elements with just a few lines of code. Take a look at the following script.

ourData <-- data.frame(
animal = c(“snail”,”lion”, “tiger”, “giraffe”, “zebra”, “panda”, “puma”),
habitat = c(“grassland”,”savanna”, “forest”, “savanna”, “grassland”, “forest”, “forest”),
size = c(“small”,”large”, “large”, “large”, “medium”, “medium”, “small”),
stringsAsFactors = FALSE
)
ourSubset <-- ourData[ourData$animal %in% c("panda", "puma"), ]
print(ourSubset)

We begin by defining a data frame’s contents with information about animals. This frame, ourData, is then directly accessed in the next line and selectively assigned to the ourSubset variable. But take note of how we access ourData. We use two square bracket characters immediately after ourData. This indicates that we want a subset of ourData’s contents. But we’re not just referencing a simple location within the data frame’s contents. We’re instead using program logic to dynamically generate the selected data. In particular, we’re using logical subsetting to select specific values from the data structure. In short, this means that we’re using program logic to select a subset of our data frame’s contents. With this process in mind, we can build on it to create larger-scale row extraction.

Putting the Pieces Together for Row Extraction

Take a look at the following code to see how we can build on the concepts we’ve looked at so far.

ourData <-- data.frame(
animal = c(“snail”,”lion”, “tiger”, “giraffe”, “zebra”, “panda”, “puma”),
habitat = c(“grassland”,”savanna”, “forest”, “savanna”, “grassland”, “forest”, “forest”),
size = c(“small”,”large”, “large”, “large”, “medium”, “medium”, “small”),
stringsAsFactors = FALSE
)
ourSubset1 <-- ourData[data$habitat == "savanna", ]
ourSubset2 <-- ourData[grep("^p", data$animal), ]
ourSubset3 <-- ourData[ourData$size %in% c("medium", "small"), ]
ourSubset <-- rbind(ourSubset1, ourSubset2, ourSubset3)
print(ourSubset)

You’ll recognize some of the elements from the previous script. For example, we once again assign animals and habitats to ourData. However, this time around we’re being a little more selective with our subset. We use the $ to specify a column name inside ourData. This lets us load up the habitat column from the data set’s contents. Note that we’re using a simple logical criterion format to select our information from within the specific column in ourData. Logical criterion simply means using a basic TRUE and FALSE flag for our selection. Something is or is not labeled as a savanna. The results of this selection are then assigned to ourSubset1.

We move on to use a more complex form of selective criteria. You’ll note that we call the grep function this time. If you’re familiar with grep in Linux then you already have a good idea of what’s going on. In short, grep lets us sort through data using complex regular expressions and pipes. In this case, we’re using a regular expression, or regex, format when we pass ^p as an argument to grep.

The ^p character combination simply tells grep to look for any string that beings with the letter p. We could substitute any other character for p and it’d look for that letter instead. But in this case, grep searches for a column value starting with the letter p in a single column of ourData. Note that while we’re working with a specific column in each of these substring assignments, there’s no reason why we couldn’t use multiple columns. We’re simply keeping things as concise as possible for the sake of explanation. With that selection finished we assign the resulting information to ourSubset2

We obtain our next subset by leveraging logical subsetting through the %in% operator. It works through our data’s size information to find matches for medium and small. You might have noticed that so far we’re working with columns when the intent is to extract rows. But everything comes together in the following line. Select adjacent columns and we essentially select rows as well. Select a specific column and elements of a specific row come along. As you can see when we use rbind to concatenate our three subsets and assign the result to ourSubset. We print out the results in the next line. And, as you can see, we now have extracted rows that adhere to three forms of logical selective criteria based on a certain value.

Note that we’re using a data frame with all of the values in place. A missing value would create an error during this process. If you weren’t sure of your data’s integrity when using these methods then you could use a is.na() check to validate elements during the selection. This would give you the chance to modify NA values according to your specific needs.