How To Bin Data In R: Group Continuous Data Into Clean & Simple Buckets

Data binning is a way to simplify a column of data, transforming a numeric variable into a simplified categorical variable by grouping values into buckets. Not only is this helpful when creating a plot or performing exploratory analysis, this also enables you to apply categorical data analysis methods to numerical datasets. Fortunately, the R programming language offers many ways to accomplish this task.

How Does Binning Help With Data Science in R?

Binning data provides a simple way to reduce the complexity of your data by collapsing continuous variable(s) into discrete ranges. This makes it easier to visualize relationships using a basic plot. It also makes it easier to explore the interactions between variables.

In addition to simplicity, reducing a continuous variable to a categorical variable can improve the predictive power of certain models. By grouping observations into bins, we are boosting the sample size of each bin. This reduces expected sample variance, which can improve the accuracy and precision of statistical inferences.

Basic Binning

We’re going to show how to do basic binning using the cut function from the dplyr r package. For this example, we’re going to look at some sales rep performance data. How many calls did a salesperson make? And how many sales resulted?

df <- data.frame(calls=c(12,15,20,21,25,28,29,30,35,45,53,75),
sales=c(2,3,4,5,3,6,4,5,8,10,11,19))

df %>% mutate(new_bin = cut(calls, breaks=4))

The result of this code bucket(s) our observations into several bins:

calls sales new_bin
1 12 2 (11.9,27.8]
2 15 3 (11.9,27.8]
3 20 4 (11.9,27.8]
4 21 5 (11.9,27.8]
5 25 3 (11.9,27.8]
6 28 6 (27.8,43.5]
7 29 4 (27.8,43.5]
8 30 5 (27.8,43.5]
9 35 8 (27.8,43.5]
10 45 10 (43.5,59.2]
11 53 11 (43.5,59.2]
12 75 19 (59.2,75.1]

In this case, the cut function split the bins into intervals of equal binwidth. For more on using the cut function, check out our deeper guide here.

Generating Equally Weight Bins (Quantiles)

Suppose we needed a binning method that split the data into bins of similar size with an equal number of observations? We can use the ntile function in the dplyr r package to accomplish this. Sample code is shown below:

df %>% mutate(new_bin = ntile(calls, n=4))

This R code will split the sales call activity dataset from the previous example into four similarly sized bins, ranked by numeric value. These can be referred to as quantiles (or quartiles and deciles – for 4 and 10 bins, respectively).

The results of splitting the dataset into 4 bins:

calls sales new_bin
1 12 2 1
2 15 3 1
3 20 4 1
4 21 5 2
5 25 3 2
6 28 6 2
7 29 4 3
8 30 5 3
9 35 8 3
10 45 10 4
11 53 11 4
12 75 19 4

One other advantage of the ntile function – it labels your bins in clean numerical order.