Text Analytic in R - an explanation of Tokenization.

Many analysts and data science interested individuals have come across text processing working with code libraries and R language. The subject of text analytics in R language can fill a whole book, as many authors have demonstrated, but let’s start with tokenization and its role in understanding text.

Tokenization: Teaching a Computer to Read

Tokenization means breaking down unstructured text data into a usable format. With tools in the natural language processing field to break down the blocks of raw data with parameters like word counts and word frequency, you can make highly complex paragraphs more digestible to your analytics API. Many iterations of the r package come with tools to perform such modeling as latent semantic analysis and topic modeling, which seek to establish relationships between similar words and semantics that let the computer effectively learn to read the text data for further definition and classification.

Tokenizer Package

Every text analytics package will have a library of functions that can break up the input text by the parameters you set, whether its different lengths of text segments as small as words or as big as whole paragraphs. This kind of text classification works when dealing with text formated to read in a certain way, like a book or a poem. The group Project Gutenberg offers free digital-reading content by taking books and written text to input into their interface and make into digital files for download. The wide range of writing styles and topics that Gutenberg keeps in their digital libraries requires different functions to break and organize the interpreted words and phrases into vectors for the interface to store and distribute.

Let’s look at some code used to break and sort sentences:
tidy_mark_twain mark_twain
unnest_tokens(word, text)
anti_join(stop_words)

This group of functions draws from the tidyverse library resource and specifies breaking up the text by words, per the unnest_tokens function. While scanning the library, the reader will come upon a frequent word not needed in the tokenized final version, and the anti_join commands the API to phase over the list of unwanted assigned to the list of stop words. The resulting data frame can be used for further analysis and visual creation by plotting values from the set.

Seeing with Word Cloud

Repetition is a tedious part of text mining that can take up a fair amount of time unless you have the correct function. When analysts have to organize large sums of data knowing there are repeating terms and phrases seeded in the rows of text, they look to create a word cloud to show the most frequently occurring words in a data frame.
Before generating a word cloud, the text input has to be cleaned and prepped so that the API can read the frequented terms and list them in its table for further sortation.

docs tm_map(docs, content_transformer(tolower))
docs tm_map(docs, removeNumbers)
docs tm_map(docs, removeWords, stopwords(“english”))
docs tm_map(docs, removeWords, c(“blabla1”, “blabla2”))
docs tm_map(docs, removePunctuation)
docs tm_map(docs, stripWhitespace)

The tm_map is a do-it-all function that converts the text, removes white spaces, and eliminates stop words like “English” when read. These parameters shave down the text data to something more usable by the word cloud command to find the most repeated terms and make a visual representation.

wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, “Dark2”))
The word cloud function can generate different visuals by swapping what it’s looking to separate; things like sentences, max/min amount of words, and their frequency of occurrence.

As you get more familiar with coding mining functions, you’ll find these tools can look for more than just text. With the evolving science of judgemental programming, the potential for advanced analysis can be limitless and understanding how text mining works is just the first step.

Going Further With Sentiment Analysis

Up to this point, we’ve covered functions that take frequently occurring words and pull them out. Now, we’ll look at discerning a more complex topic; viewer emotional response. One of the biggest hurdles for machine learning is the aspect of program reading input responses and the ability to infer the mood of the individual responding. In the case of sentiment analysis, or opinion ming, it’s a matter of taking a given text sample and comparing the chunks of text data with a stored set of keywords or phrases and using the comparison to determine the responder’s frame of behavior. For the comparisons, the API uses a provided lexicon that stores a library of words, sentences, and phrases and sorts by looking at the frequency of these words and the percentage of the text data to determine its standing.
For this sample, we’ll look at the tidyverse lexicon NRC that categorizes in a simple argument fashion; it looks at the text documents and asks if it contains words matching those in preset groups labeled after emotions. While working through the text, the lexicon narrows down the groups until it’s left with the best matching descriptor. Here’s what NRC will look like:
get_sentiments(“nrc”)
#> # A tibble: 13,901 × 2
#> word sentiment
#> chr chr
#> 1 abacus trust
#> 2 abandon fear
#> 3 abandon negative
#> 4 abandon sadness
#> 5 abandoned anger
#> 6 abandoned fear
#> 7 abandoned negative
#> 8 abandoned sadness
#> 9 abandonment anger
#> 10 abandonment fear
#> # … with 13,891 more rows

Other lexicons categorize based on different wording and categorizations, but by no means do you have to be complex to get an effective system of sortation. An analysis may split between only two groups or score their words based on the amount of negative/positive content they pick up. Experiment with some of the lexicons, and if confident enough, you may feel the need to create your own.