How to use strsplit in R

In this tutorial we will take a look how to split strings in R. String is collection of characters.

Whenever you work with text, you need to be able to concatenate words (string them together) and split them apart. In R, you use the paste() function to concatenate and the strsplit() function to split. In this section we will cover mostly strsplit()

strsplit() function splits the elements of a character vector x into substrings according to the matches for splitting.

For the start, I will define a little sentence and put it under quotes:

sentence <- “The dog likes the food of owner.”

Lets check how R identifies our example string:

               str(sentence)

              chr “The dog likes the food of owner.”

So we see that R identifies our sentence as a character.

First thing I am going to do is to split our sentence. By spliting I am going to take what is now one sentence and divide it up into several individual words. To do that, we need to use strsplit() function:

              example1 <- strsplit(sentence, ” “)

example1

[[1]]

[1] “The”    “dog”    “likes”  “the”    “food”   “of”     “owner.”

So the first argument is the name of a string I want to split and then I tell it how to split it, so in this case we split it by space.

You may have noticed few things that are actually different. So here we have an output for each word that has its own quotation marks. That is because  the sentence has been split appart.

What R has printed out is what we call a list.

Now, lets do another way of splitting a string. Notice how in the upper result, we have [[1]] in the output before the actual words of string.

That is because in that case we get list inside of list.

For example:

              example1[1]

              [[1]]

[1] “The”    “dog”    “likes”  “the”    “food”   “of”     “owner.”

When we printed out example1[1], we would probably expect to get „The“ as an output, because it is the first element of a string sentence. Instead, we get whole sentence, because in example1 we have stored list of lists, and inside of first list we have our sentence.

If we want to get string separately, we want to take care of this default behaviour:

              example2 <- strsplit(sentence, ” “)[[1]]

Lets compare example1 and example2:

str(example1)

List of 1

 $ : chr [1:7] “The” “dog” “likes” “the” …

              str(example2)

chr [1:7] “The” “dog” “likes” “the” “food” “of” “owner.”

Here we can clearly see that in example2 we have list of characters as oposed to example1 where we have list of lists.

If we print out example2 element:

              example2[1]

[1] “The”

We see that we get word The.

To better understand this, check out these two pictures.

View (example1):

Strsplit in R - Example 1

Strsplit in R – Example 1

View (example2)

Strsplit in R - Example 2
Strsplit in R – Example 2

Here you can clearly see that in example1 we have one element (one list), and in example 2 we have 7 elements (7 words of sentence).

Lets check the strsplit() from another angle. Lets define a bit different string:

              x <- “123_456_789”

Here I have a character vector of length 1. I can separate strings by any character, so our x I can separate by “_”:

y <- strsplit(x, “_”)

If I check the class of y:

              class(y)

[1] “list”

And if I print out y:

              y

[[1]]

[1] “123” “456” “789”

Remember how we earlier discussed default behaviour of strsplit(), or how we get list of lists?

Lets try to print out one element directly. Lets say I want to print out first element, 456:

              y[[1]][2]

[1] “456”

And there it is. I referenced a first list from y (y[[1]]), and then I referenced the second elemenet from that list. If we check the type of elements contained in list y[[1]]:

              class(y[[1]])

[1] “character”

We will notice that it is of character type.

You can also use strsplit() to split multiple strings at once. Lets define a vector with two elements:

              x <- c(“123:456:789”, “Boogie:Woogie:Band”)

Now I can split both elements of vector x by colon:

              y<- strsplit(x, “:”)

Now if we print out y:

               y

[[1]]

[1] “123” “456” “789”

[[2]]

[1] “Boogie” “Woogie” “Band”

We see that we have two lists. Beautiful!

If I want to reference elements, I simply reference a list and then index of element. Lets say I want to extract third element of second list:

              y[[2]][3]

[1] “Band”

Lets say I want to extract all elements from the first list:

              y[[1]]

[1] “123” “456” “789”

I can split strings by more than one character. I will define string which will contain all alphabet characters and I will separate them by vowels:

              alphabet <- “abcdefghijklmnopqrstuvwxyz”

              groups <- strsplit(alphabet, “[aeiou]”)

              groups

              [[1]]

[1] “”      “bcd”   “fgh”   “jklmn” “pqrst” “vwxyz”

In this example I passed a list of characters in strsplit(). Whenever the function is called, it checks if character from alphabet is equal to any of vowels we passed in the list. If it is equal to any, it will create a list and inside of it it will put every  character that came before.

Lets now quickly see how to revert the strsplit(), or the oposite function of strsplit()

I will again define a string from the beginning and split it:

sentence <- “The dog likes the food of owner.”

splited <- strsplit(sentence, ” “)

splited

[[1]]

[1] “The”    “dog”    “likes”  “the”    “food”   “of”     “owner.”

Now we have list called splited and inside of it we have a list which contains vector with our words.

To make one sentence out of the elements of that list, we use paste() function:

              recreated<- paste(splited[[1]], collapse=” “)

recreated

[1] “The dog likes the food of owner.”

Notice how I specified the index of a list in paste() function. The second parameter collapse defines what should sepparate our elements. If I change it to something else, we will get different result:

              recreated<- paste(splited[[1]], collapse=” “)

recreated

[1] “The—dog—likes—the—food—of—owner.”

And that is it for this tutorial. In this tutorial, we mostly worked with string vectors. If you want to learn more about handling vectors, please check out our tutorials

  • Identify and Remove Duplicate Data in R,
  • How to aggregate multiple columns at once in R,
  • Replace values in data frame r