Using The grepl() function in R

Pattern matching looks for a given pattern in data, for example, a literal value like the character ‘#’ or the string ‘verb’ or the number ‘2018’. This pattern can then be replaced by another pattern or just filtered out to find some information within the data. In addition to the literal values, a pattern can be a regular expression which describes a specific string.

 For example, if we want to look for a pattern that starts with ‘abc’, we can specify ‘^abc’ or if we want to find a pattern that ends with ‘abc’ we would specify ‘abc$’. We can also specify a group of values such as ‘ab[0-9]’ that starts with the string ‘ab’ followed by any digit, or ‘[A-E]9’ which looks for an uppercase A or B or C or D or E followed by 9. And we can specify that a sequence repeats such as ‘[xy]*’ which means that we are looking for a string with one or more ‘xy’ sequences.

This tutorial explains how to search for matches of certain character pattern in the R programming language. The article is mainly based on the grepl() R function. The basic R syntax and the definitions of the two functions are as follows:

              grepl(“char”, x)

There is also similar function grep()

The grep R function searches for matches of certain character pattern in a vector of character strings and returns the indices that yielded a match.

The grepl R function searches for matches of certain character pattern in a vector of character strings and returns a logical vector indicating which elements of the vector contained a match.

Example how to use grepl:

x <- c(“d”, “a”, “c”, “abba”)

grepl(“a”, x)

[1] FALSE  TRUE FALSE  TRUE

As we can see, grepl() returns a logical vector for each element. TRUE if the element contains parameter ‘’a’’ we passed and FALSE if not.

Example how to use grep:

x <- c(“d”, “a”, “c”, “abba”)

grep(“a”, x)

[1] 2 4

grep() returns indexes of elements from a vector which contain parameter ‘’aa’’.

Both of these functions are used for pattern matching but unlike grep, which returns the vector with indices of the matched strings or the strings themselves, grepl returns the logical vector. Logical vector returns TRUE for a match and FALSE otherwise.

For example, for the pattern of ‘ab’ followed by any digit as in the code below:

              pattern <- “ab[0-9]”

strings <- c(‘ab9’, ‘ab’, ‘abc8’, ‘abc’, ‘ab1’)

print (grepl(pattern, strings))

[1]  TRUE FALSE FALSE FALSE  TRUE

———————

Data Acquisition Usage

Imagine you have a large amount of data such as years of weather reports with date, location, temperature, percentage of sunny versus cloudy, rain, snow, wind, etc. For each location, let’s say, a city along with its information could be an element of a vector. If you wanted to extract some specific data based on some pattern, you can use grep or grepl to do that. For example, you could find all cities within some region with below freezing temperatures or a certain amount of snow accumulation. Then once you have this filtered data, you could do some additional analysis on it.

———————

Lets check another few examples with grep()

The grep function takes as parameters the pattern and a character vector as the data to search through for the pattern. The other parameters are optional if the default behavior is desired. Some of the parameters with default values include:

ignore.case = FALSE by default it is case sensitive

value = FALSE by default returns vector with index values of match; otherwise returns the values

fixed = FALSE by default treats pattern as regular expression; otherwise will match exact

invert = FALSE by default matches the pattern; otherwise returns what is not matched

Find data that starts with ‘abc’:

strings <- c(‘abcd’, ‘dabc’, ‘abcabc’)

pattern <- ‘^abc’

print (grep(pattern, strings))

Output is [[1] 1 3] which means that 1st (‘abcd’) and 3rd (‘abcabc’) items have a match.

Find data that ends with ‘abc’ and is not case sensitive:

              strings <- c(‘abcd’, ‘dABc’, ‘abcabc’)

pattern <- ‘abc$’

print (grep(pattern, strings, ignore.case = TRUE))

[1] 2 3

Find data that starts with ab followed by any digit:

pattern <- “ab[0-9]”

strings <- c(‘ab9’, ‘ab’, ‘abc8’, ‘abc’, ‘ab1’)

print (grep(pattern, strings))

[1] 1 5

Find data that starts with ab followed by any digit and return the strings:

pattern <- “ab[0-9]”

strings <- c(‘ab9’, ‘ab’, ‘abc8’, ‘abc’, ‘ab1’)

print (grep(pattern, strings, value=TRUE))

[1] “ab9” “ab1”

The other great feature about grep and grepl is their adaptation by other packages in R.  I am a huge fan and user of the dplyr package by Hadley Wickham because it offer a powerful set of easy-to-use “verbs” and syntax to manipulate data sets. 

However,  strong and effective packages such as dplyr incorporate base R functions to increase their practicalityr:

 library(dplyr)

 CO2_dplyr<-tbl_df(CO2) #converting CO2 into a local data frame

 filter_dplyr_for_value_non<-CO2_dplyr %>% filter(grepl(“non”, Treatment))

 filter_dplyr_for_value_non

Source: local data frame [42 x 5]

  Plant Type   Treatment  conc uptake

1 Qn1   Quebec nonchilled 95   16.0

2 Qn1   Quebec nonchilled 175  30.4

3 Qn1   Quebec nonchilled 250  34.8

4 Qn1   Quebec nonchilled 350  37.2

5 Qn1   Quebec nonchilled 500  35.3

6 Qn1   Quebec nonchilled 675  39.2

7 Qn1   Quebec nonchilled 1000 39.7

8 Qn2   Quebec nonchilled 95   13.6

9 Qn2   Quebec nonchilled 175  27.3

filter_dplyr_for_not_a_value<-CO2_dplyr %>% filter(!(grepl(“non”, Treatment)))

 filter_dplyr_for_not_a_value

Source: local data frame [42 x 5]

  Plant Type   Treatment conc uptake

1 Qc1   Quebec chilled   95   14.2

2 Qc1  Quebec  chilled   175  24.1

3 Qc1  Quebec  chilled   250  30.3

4 Qc1  Quebec  chilled   350  34.6

5 Qc1  Quebec  chilled   500  32.5

6 Qc1  Quebec  chilled   675  35.4

7 Qc1  Quebec  chilled   1000 38.7

8 Qc2  Quebec  chilled   95   9.3

9 Qc2  Quebec  chilled   175  27.3

If you check Treatment variable, you will see that wee have chilled and nonchilled, and here we combined filter() from dplyr package with grepl()

*Please check out free e-book  ‘R for data science’ by Garett Grolemund to learn more about functions for manipulating dataframes such as filter().

Summary

The grep and grepl functions use regular expressions or literal values as patterns to conduct pattern matching on a character vector. The grep returns indices of matched items or matched items themselves while grepl returns a logical vector with TRUE to represent a match and FALSE otherwise. Both functions can be used to match a pattern to change or replace it or to filter data. As such they are used heavily in data acquisition to extract some subset of data to be used for an analysis.