Strsplit in R: How to manipulate strings the right way

In real world data science and analysis, a lot of useful data is provided to you as a character string. This includes both text files and raw character input string data from scraping the web or text comments recorded in an IT system. This tutorial is going to walk through how to use the strsplit function to unpack this data, including handling any delimiter(s) like commas or pipes.

Let’s start with the two most basic operations: assembling small pieces of text into a larger whole and breaking a large string into smaller substring(s). In R, you use the paste() function to concatenate and the strsplit() function to split. This tutorial covers the strsplit() function; we’ve got another tutorial about paste() (click here).

The strsplit() function splits the elements of a character vector x into substrings according to the matches for splitting. This is based on delimiter characters such as a comma or whitespace character. The specified delimiter serves as a separate character between the fields (or chunks of text), allowing you to split the original string into substring(s) of varying length. There are a few edge cases (consecutive delimiters) that we will address later in this tutorial.

The basics: How To Use the Strsplit() function

For the start, I will define a little sentence and put it under quotes:

sentence <- “The dog likes the food of owner.”

Lets check how R identifies our example string:

str(sentence)

chr “The dog likes the food of owner.”

So we see that R identifies our sentence as a character.

First thing I am going to do is to split our sentence. By splitting I am going to take what is now one sentence and divide it up into several individual words. To do that, we need to use strsplit() function and add a separator character. In this case, we’re going to use a single character (the whitespace character), which will split the entire string into a cell array of words, with each word as a single string.

example1 <- strsplit(sentence, ” “)

example1

[[1]]

[1] “The” “dog” “likes” “the” “food” “of” “owner.”

So the first argument is the name of a string I want to split and then I tell it how to split it, so in this case we split it by the whitespace character.

You may have noticed few things that are actually different. So here we have an output for each word that has its own quotation marks. That is because the sentence has been split apart.

What R has printed out is what we call a list.

Now, lets do another way of splitting a string. Notice how in the upper result, we have [[1]] in the output before the actual words of string.

That is because in that case we get list inside of list.

For example:

example1[1]

[[1]]

[1] “The” “dog” “likes” “the” “food” “of” “owner.”

When we printed out example1[1], we would probably expect to get „The“ as an output, because it is the first element of a string sentence. Instead, we get whole sentence, because in example1 we have stored list of lists, and inside of first list we have our sentence.

If we want to get string separately, we want to take care of this default behavior:

example2 <- strsplit(sentence, ” “)[[1]]

Lets compare example1 and example2:

str(example1)

List of 1

$ : chr [1:7] “The” “dog” “likes” “the” …

str(example2)

chr [1:7] “The” “dog” “likes” “the” “food” “of” “owner.”

Here we can clearly see that in example2 we have list of characters as opposed to example1 where we have list of lists.

If we print out example2 element:

example2[1]

[1] “The”

We see that we get word The.

To better understand this, check out these two pictures.

View (example1):

Strsplit in R - Example 1 — Strsplit in R – Example 1

View (example2)

Strsplit in R - Example 2 — Strsplit in R – Example 2

Here you can clearly see that in example1 we have one element (one list), and in example 2 we have 7 elements (7 words of sentence).

Lets check the strsplit() from another angle. Lets define a bit different string:

x <- “123_456_789”

Here I have a character vector of length 1. I can separate strings by any character, so our x I can separate by “_”:

y <- strsplit(x, “_”)

If I check the class of y:

class(y)

[1] “list”

And if I print out y:

[[1]]

[1] “123” “456” “789”

Remember how we earlier discussed default behaviour of strsplit(), or how we get list of lists?

Lets try to print out one element directly. Lets say I want to print out first element, 456:

y[[1]][2]

[1] “456”

And there it is. I referenced a first list from y (y[[1]]), and then I referenced the second element from that list. If we check the type of elements contained in list y[[1]]:

class(y[[1]])

[1] “character”

We will notice that it is of character type.

You can also use strsplit() to split multiple strings at once. Lets define a vector with two elements:

x <- c(“123:456:789”, “Boogie:Woogie:Band”)

Now I can split both elements of vector x by colon:

y<- strsplit(x, “:”)

Now if we print out y:

[[1]]

[1] “123” “456” “789”

[[2]]

[1] “Boogie” “Woogie” “Band”

We see that we have two lists. Beautiful!

If I want to reference elements, I simply reference a list and then index of element. Lets say I want to extract third element of second list:

y[[2]][3]

[1] “Band”

Lets say I want to extract all elements from the first list:

y[[1]]

[1] “123” “456” “789”

I can split strings by more than one character. I will define string which will contain all alphabet characters and I will separate them by vowels:

alphabet <- “abcdefghijklmnopqrstuvwxyz”

groups <- strsplit(alphabet, “[aeiou]”)

groups

[[1]]

[1] “” “bcd” “fgh” “jklmn” “pqrst” “vwxyz”

In this example I passed a list of characters in strsplit(). Whenever the function is called, it checks if character from alphabet is equal to any of vowels we passed in the list. If it is equal to any, it will create a list and inside of it it will put every character that came before.

Lets now quickly see how to revert the strsplit(), or the opposite function of strsplit()

I will again define a string from the beginning and split it:

sentence <- “The dog likes the food of owner.”

splited <- strsplit(sentence, ” “)

splited

[[1]]

[1] “The” “dog” “likes” “the” “food” “of” “owner.”

Now we have list called splitted and inside of it we have a list which contains vector with our words (the returned array, a string array).

To make one sentence out of the elements of that list, we use paste() function:

recreated<- paste(splitted[[1]], collapse=” “)

recreated

[1] “The dog likes the food of owner.”

Notice how I specified the index of a list in paste() function. The second parameter collapse defines what should sepparate our elements. If I change it to something else, we will get different result:

recreated<- paste(splitted[[1]], collapse=” “)

recreated

[1] “The—dog—likes—the—food—of—owner.”

Advanced options: Consecutive Delimiters & Regular Expression(s)

Real world data rarely conforms to the sterile specifications found in structured systems. In theory, you should use one whitespace character between words in a given string to break them up. In practice? Someone always manages to inject consecutive delimiters into the mix.

For a whitespace string, we have a simple way to handle this. Replace the ” ” with “”… the default handling of this will accept any number of white space characters.

One common issue with defining the delimiter for a structured text file is the risk of that character value appearing in the underlying data. This is a very common problem with comma delimited text. You may want to use a non-standard or special character to denote the start of a separate substring.

Nor is the strsplit function limited to using a single character delimiter. You have the option of using a regular expression to define more complex separator logic, which could include watching for multiple strings or patterns. It can also handle non ascii character strings, if that is part of your data (non ascii strings may generate issues with other R functions, so be sure to test).

And that is it for this tutorial. In this tutorial, we mostly worked with string vectors. If you want to learn more about handling character vector(s), please check out our tutorials