Pattern matching looks for a given pattern in data, for example, a literal value like the character ‘#’ or the string ‘verb’ or the number ‘2018’. This pattern can then be replaced by another pattern or just filtered out to find some information within the data. In addition to the literal values, a pattern can be a regular expression which describes a specific string.
For example, if we want to look for a pattern that starts with ‘abc’, we can specify ‘^abc’ or if we want to find a pattern that ends with ‘abc’ we would specify ‘abc$’. We can also specify a group of values such as ‘ab[0-9]’ that starts with the string ‘ab’ followed by any digit, or ‘[A-E]9’ which looks for an uppercase A or B or C or D or E followed by 9. And we can specify that a sequence repeats such as ‘[xy]*’ which means that we are looking for a string with one or more ‘xy’ sequences.
This tutorial explains how to search for matches of certain character pattern in the R programming language. The article is mainly based on the grepl() R function. The basic R syntax and the definitions of the two functions are as follows:
grepl(“char”, x)
There is also similar function grep()
The grep R function searches for matches of certain character pattern in a vector of character strings and returns the indices that yielded a match.
The grepl R function searches for matches of certain character pattern in a vector of character strings and returns a logical vector indicating which elements of the vector contained a match.
Example how to use grepl:
x <- c(“d”, “a”, “c”, “abba”)
grepl(“a”, x)
[1] FALSE TRUE FALSE TRUE
As we can see, grepl() returns a logical vector for each element. TRUE if the element contains parameter ‘’a’’ we passed and FALSE if not.
Example how to use grep:
x <- c(“d”, “a”, “c”, “abba”)
grep(“a”, x)
[1] 2 4
grep() returns indexes of elements from a vector which contain parameter ‘’aa’’.
Both of these functions are used for pattern matching but unlike grep, which returns the vector with indices of the matched strings or the strings themselves, grepl returns the logical vector. Logical vector returns TRUE for a match and FALSE otherwise.
For example, for the pattern of ‘ab’ followed by any digit as in the code below:
pattern <- “ab[0-9]”
strings <- c(‘ab9’, ‘ab’, ‘abc8’, ‘abc’, ‘ab1’)
print (grepl(pattern, strings))
[1] TRUE FALSE FALSE FALSE TRUE
———————
Data Acquisition Usage
Imagine you have a large amount of data such as years of weather reports with date, location, temperature, percentage of sunny versus cloudy, rain, snow, wind, etc. For each location, let’s say, a city along with its information could be an element of a vector. If you wanted to extract some specific data based on some pattern, you can use grep or grepl to do that. For example, you could find all cities within some region with below freezing temperatures or a certain amount of snow accumulation. Then once you have this filtered data, you could do some additional analysis on it.
———————
Lets check another few examples with grep()
The grep function takes as parameters the pattern and a character vector as the data to search through for the pattern. The other parameters are optional if the default behavior is desired. Some of the parameters with default values include:
ignore.case = FALSE by default it is case sensitive
value = FALSE by default returns vector with index values of match; otherwise returns the values
fixed = FALSE by default treats pattern as regular expression; otherwise will match exact
invert = FALSE by default matches the pattern; otherwise returns what is not matched
Find data that starts with ‘abc’:
strings <- c(‘abcd’, ‘dabc’, ‘abcabc’)
pattern <- ‘^abc’
print (grep(pattern, strings))
Output is [[1] 1 3] which means that 1st (‘abcd’) and 3rd (‘abcabc’) items have a match.
Find data that ends with ‘abc’ and is not case sensitive:
strings <- c(‘abcd’, ‘dABc’, ‘abcabc’)
pattern <- ‘abc$’
print (grep(pattern, strings, ignore.case = TRUE))
[1] 2 3
Find data that starts with ab followed by any digit:
pattern <- “ab[0-9]”
strings <- c(‘ab9’, ‘ab’, ‘abc8’, ‘abc’, ‘ab1’)
print (grep(pattern, strings))
[1] 1 5
Find data that starts with ab followed by any digit and return the strings:
pattern <- “ab[0-9]”
strings <- c(‘ab9’, ‘ab’, ‘abc8’, ‘abc’, ‘ab1’)
print (grep(pattern, strings, value=TRUE))
[1] “ab9” “ab1”
The other great feature about grep and grepl is their adaptation by other packages in R. I am a huge fan and user of the dplyr package by Hadley Wickham because it offer a powerful set of easy-to-use “verbs” and syntax to manipulate data sets.
However, strong and effective packages such as dplyr incorporate base R functions to increase their practicalityr:
library(dplyr)
CO2_dplyr<-tbl_df(CO2) #converting CO2 into a local data frame
filter_dplyr_for_value_non<-CO2_dplyr %>% filter(grepl(“non”, Treatment))
filter_dplyr_for_value_non
Source: local data frame [42 x 5]
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
7 Qn1 Quebec nonchilled 1000 39.7
8 Qn2 Quebec nonchilled 95 13.6
9 Qn2 Quebec nonchilled 175 27.3
filter_dplyr_for_not_a_value<-CO2_dplyr %>% filter(!(grepl(“non”, Treatment)))
filter_dplyr_for_not_a_value
Source: local data frame [42 x 5]
Plant Type Treatment conc uptake
1 Qc1 Quebec chilled 95 14.2
2 Qc1 Quebec chilled 175 24.1
3 Qc1 Quebec chilled 250 30.3
4 Qc1 Quebec chilled 350 34.6
5 Qc1 Quebec chilled 500 32.5
6 Qc1 Quebec chilled 675 35.4
7 Qc1 Quebec chilled 1000 38.7
8 Qc2 Quebec chilled 95 9.3
9 Qc2 Quebec chilled 175 27.3
If you check Treatment variable, you will see that wee have chilled and nonchilled, and here we combined filter() from dplyr package with grepl()
*Please check out free e-book ‘R for data science’ by Garett Grolemund to learn more about functions for manipulating dataframes such as filter().
Summary
The grep and grepl functions use regular expressions or literal values as patterns to conduct pattern matching on a character vector. The grep returns indices of matched items or matched items themselves while grepl returns a logical vector with TRUE to represent a match and FALSE otherwise. Both functions can be used to match a pattern to change or replace it or to filter data. As such they are used heavily in data acquisition to extract some subset of data to be used for an analysis.