How to Do One-Hot Encoding in R

Machine learning has revolutionized the way we relate to software. It is, in many ways, a bridge between the physical and digital worlds. Machine learning systems can take in large amounts of data and highlight elements that might be invisible to the human eye. However, there’s one notable caveat to most forms of machine learning.

Machine learning systems can’t simply take in undefined or unformatted data in the same way a human can. We need to specifically format data for machine learning systems. Every language and platform has specific methods that can be used to create a unified data format. And R, in particular, has some powerful built-in functionality that can be used to create compatible data sets for machine learning. One-shot encoding in particular is an easy-to-use solution within R. You’ll soon discover the best ways to leverage this functionality within your own code.

An Overview of One-Hot Encoding

One hot encoding tackles a fundamental problem with logic-based comparisons. How do you work with a binary comparison when you’re using an incompatible collection of data? The simple answer is that we just need to convert that data into a binary collection. With one-hot encoding, that means translating a data set into multiple fields consisting of either a 0 or 1 value.

One-hot encoding brings with it a number of significant benefits. Of course, the most obvious benefit comes from pure compatibility with functions that require a particular formatting style. But you’ll also find that it typically creates more efficient performance. For example, one-hot encoded data is more compatible with any machine learning algorithm which uses an attribute of significance. Even when numerical data isn’t necessitated it’s usually best to make use of it whenever possible.

One significant reason to use one-hot encoding as a normal import practice comes down to the fact that instances where it’s not required tend to just hide conversion behind any given function. This essentially means that you can run something like one-hot encoding once to convert your data set, or you can have it happen multiple times within specific functions. The former is, of course, the most computationally efficient solution. This might seem like a difficult proposition at first glance. But don’t worry, this is very far from trying to actually program in machine language or use nothing but binary within your code. In fact, R simplifies the encoding process by giving us a few easy-to-use functions.

Starting Out With the Basics

We can begin by creating a simple example with some categorical data. In this instance imagine that we want to keep track of how much feed needs to be put into a bird feeder every day. We’ll assume that different amounts of seed are consumed at different rates depending on the weather. For example, we might see more birdseed consumed on sunny days and the least during storms. Consider the following example.

ourData <- data.frame(
weather = as.factor(c(“Sunny”, “Overcast”, “Raining”, “Thunderstorm”)),
ml = c(250, 100, 50, 10)
)
print(ourData)

We begin by creating a dataframe called ourData and proceed to populate it with the relevant categories and numbers. Then we print out the result. The important point to note is that the data is laid out in a standard human-readable manner. We have two categories, weather and ml. And the relevant weather conditions and ml of feed is listed within them.

However, we ultimately want our data properly formatted for a machine learning algorithm rather than the human eye. Thankfully R provides us with a tool to do so with hardly any effort at all within its mltools library. Take a look at how easily we can perform one-hot encoding within the following example.

library(mltools)
library(data.table)

ourData <- data.frame(
weather = as.factor(c(“Sunny”, “Overcast”, “Raining”, “Thunderstorm”)),
ml = c(250, 100, 50, 10)
)

print(ourData)
ourFormattedData <- one_hot(as.data.table(ourData))
print(ourFormattedData)

We begin by importing mltools and data.table. This, respectively, gives us some extra functionality to work with machine learning and data manipulation. We define ourData in the same way and with the same values as before. Then we print out the resulting information to verify what we’re working with. The next line is where the mltools magic happens. We send ourData to as.data.table to properly format it as a data table. The result is then sent to one_hot. This creates a one-hot encoded version of ourData which is, in turn, assigned to ourFormattedData. And in the final line we print out ourFormattedData for comparison. The final result of this process is a one-hot encoded version of our initial data.

Different One-Hot Paths To Reach the Same End Goal

Of course, given R’s focus on data science, it shouldn’t come as a surprise to find out that there’s more than one way to approach one-hot encoding. We could also go about one-hot encoding our data using R’s caret (Classification And REgression Training) library. More specifically, we’ll make use of caret’s dummyVars function. Take a look at the following code.

library(caret)

ourData <- data.frame(
weather = as.factor(c(“Sunny”, “Overcast”, “Raining”, “Thunderstorm”)),
ml = c(250, 100, 50, 10)
)
dummyData <- dummyVars(” ~ .”, data=ourData )
ourFormattedData <- data.frame(predict(dummyData , newdata = ourData ))
print(ourData)
print(ourFormattedData)

We start things off in a similar manner as the previous example. The most obvious initial difference is that we’re using the caret library and padding out our dataset where necessary with dummy variable selections. The dummy encoding can be used alongside categorical feature rules or with regression that would handle missing items. In fact, the strength of a predictive model can be seen in the ourFormattedData assignment.

We create a data frame with information gained by passing the current test data to R’s predict function. This essentially leverages machine learning to create a system that will be more efficient within that context. In this case, we’re basing predictive values on a linear model of the supplied data set. This is a fairly straightforward process given the simplicity of our initial data. Keep in mind that we can also take this a little further by introducing label encoding into the example. Add this line before the call to dummyVars.

ourData$weather <- as.numeric(factor(ourData$weather))

All we need to do is pass variables to as.numeric. This essentially converts any character-based variables and vectors into numeric variables. While we’re using this in a limited function here it can be expanded across much larger scopes within your code. And as seen here it pairs up quite nicely with one-hot encoding functions. We only need a single line to add label encoding to the one-hot encoding. Just as we only need a line or two in order to make a one-hot conversion. R makes it fairly easy to move toward fully numeric data sets to represent real-world collections. This is an extremely useful toolset given that a typical machine learning model depends on numeric data sets.