How to Efficiently Compare Two Strings in R

When faced with multiple variations of inputs to your program, R allows the use of a comparator function to determine if two strings are similar in your defined terms. Let’s use an example of needing to compare strings with the same meaning but different elements.

The first vector reads as “Go To the Supermarket”, and we compare it to the string “go to the supermarket”. Obviously, these mean the same but if we decide to make our comparator sensitive to exact elements when compared (Vector1==Vector2), then the resulting case is proved false.

#define two strings
string1 — “Iguanas”
string2 — “iguanas”

#case-sensitive comparison
string1 == string2

[1] FALSE

#case-insensitive comparison
tolower(string1) == tolower(string2)

[1] TRUE

Commands can also be more specific. To see if strings are exactly alike, simply tell your comparator to look for two exact cases. The Identical command will read the first string (Apple, Pear, Orange) and only prove True when matching those exact elements. If you need a very narrow bracket for accepted strings, the identical tool is very useful.

#define two vectors of strings
vector1 — c(“hey”, “hello”, “HI”)
vector2 — c(“hey”, “hello”, “hi”)

#case-sensitive comparison
identical(vector1, vector2)

[1] FALSE

#case-insensitive comparison
identical(tolower(vector1), tolower(vector2))

[1] TRUE

There may be a case where you want to accept certain elements that appear in different strings. Faced with vector1 (“red”, “blue, “orange”) the %in% operator compares it to vector2 (“purple”, “orange”, “violet”) and picks out the matching element “orange”. These variations of comparators are all available in the R tool library to meet a variety of cases where two strings need to be compared. Another example would be:

#define two vectors of strings
vector1 — c(“Tokyo”, “Beijing”, “Hong Kong”)
vector2 — c(“Singapore”, “Beijing”, “Tokyo”)

#find which strings in vector1 are also in vector2
vector1[vector1 %in% vector2]

[1] “Beijing” “Tokyo”

Other R users have noted more advanced uses of comparators. the str_equal holds to the Unicode rule to assign different parts of code in sequence to mean the same character. Strings seeming dissimilar can have different elements but the underlying character assigned to each will be equal and read as a True outcome. A scenario may look like such:

a1 “\u00Aa1”
a2 “a\ie0301”
c(a1, a2)
#[1] “á” “á”

a1 == a2
#[1] FALSE
str_equal(a1, a2)
#[1] TRUE

So there are many ways to compare strings in R. The most efficient will depend on your parameters and what you require from the outcome.