Sometimes when doing data science, it is necessary to categorize data that has not already been categorized. This often occurs when the person writing the program to analyze a dataset is not the person who created it. In such situations, it may be necessary to cut a numerical vector into segments and set up those segments as a factor. Fortunately, the process for doing this is quite easy.
Description
The function used in R for cutting a continuous variable is the cut function and it has the format of cut(vector, breaks, labels) where “vector” is the vector being cut, the “brakes” the values being used to set the levels and “labels” are the cut labels. While the function will work without the third argument, it is needed to get meaningful results. The “brakes” can consist of any values that cover the range of the vector being cut. The “labels” can be any values that form a factor. However, you are going to want to set them as labels that have meaning, rather than meaningless generic labels such as letters of the alphabet. That is of course unless those letters have special meaning within the context of the work. The point is that the cut labels are arbitrary as far as the function is concerned so it is best to choose labels that are going to be meaningful to you.
Explanation
The cut function produces breaks based on values in a continuous variable for the purpose of creating a factor variable. The lowest break value has to be lower than the lowest value in the vector and the highest break value has to be greater than the highest value in the vector. If these two conditions are not met, you will get an NA value instead of one of the labels that you have set for one or more of the factor values. The way to fix this potential problem is to make sure that the lowest and highest breaks values are outside the expected range of the vector. Now if you do not know the range of the vector for certain, you can deal with this issue by trial and error and adjust the breaks range until you are not getting any NA values. You can also use the NA values as a way of eliminating any values outside a certain range.
Examples
Here are three examples of code illustrating the cut function in action. They each show different aspects of this function.
> df = data.frame(A = c(“Bob”, “Sue”, “Chuck”, “Ann”, “Tim”, “Rob”, “Beth”, “Tom”),
+ B = c(55, 67, 76, 84, 93, 44, 71, 99))
> df
A B
1 Bob 55
2 Sue 67
3 Chuck 76
4 Ann 84
5 Tim 93
6 Rob 44
7 Beth 71
8 Tom 99
> df$C = cut(df$B, breaks = c(0, 60, 70, 80, 90, 100))
> df
A B C
1 Bob 55 (0,60]
2 Sue 67 (60,70]
3 Chuck 76 (70,80]
4 Ann 84 (80,90]
5 Tim 93 (90,100]
6 Rob 44 (0,60]
7 Beth 71 (70,80]
8 Tom 99 (90,100]
This example has a data frame containing fictitious names and grades. In this case, we did not include labels and it causes column C to contain the break range that the values fall under.
> df = data.frame(A = c(“Bob”, “Sue”, “Chuck”, “Ann”, “Tim”, “Rob”, “Beth”, “Tom”),
+ B = c(55, 67, 76, 84, 93, 44, 71, 99))
> df
A B
1 Bob 55
2 Sue 67
3 Chuck 76
4 Ann 84
5 Tim 93
6 Rob 44
7 Beth 71
8 Tom 99
> df$c = cut(df$B,
+ breaks = c(0, 60, 70, 80, 90, 100),
+ labels = c(“F”, “D”, “C”, “B”, “A”))
> df
A B c
1 Bob 55 F
2 Sue 67 D
3 Chuck 76 C
4 Ann 84 B
5 Tim 93 A
6 Rob 44 F
7 Beth 71 C
8 Tom 99 A
> class(df$c)
[1] “factor”
This example has a data frame containing fictitious names and grades. In this case, we included grade labels and column C provides the grade label for each value.
> t = as.numeric(Sys.time())
> set.seed(t)
> X = as.integer(abs(rnorm(10)*10))
> A = c(“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”)
> df = data.frame(A, X)
> df
A X
1 A 10
2 B 16
3 C 8
4 D 3
5 E 7
6 F 1
7 G 7
8 H 8
9 I 0
10 J 5
> df$C = cut(df$X,
+ breaks = c(0, 5, 10, 15, 20, 25),
+ labels = c(“Low”, “medium low”, “medium”, “medium high”, “high”))
> df
A X C
1 A 10 medium low
2 B 16 medium high
3 C 8 medium low
4 D 3 Low
5 E 7 medium low
6 F 1 Low
7 G 7 medium low
8 H 8 medium low
9 I 0 NA
10 J 5 Low
This example has a data frame containing random values, to show how it works in such cases. It includes a zero value that produces an NA value for its label.
Application
The cut function has many applications. One of them would be setting up the bins for a histogram. Another application, which is illustrated in the above examples, is applying a letter grade to numerical values. This can go for more than just educational situations, but any case where such grading is useful. The cut function is useful under any circumstances where you are categorizing data based on numerical values.
The cut function is a handy function for creating factor categories for the contents of a numerical vector. It is an easy function to use, but one that you have to be careful with to get the results you are looking for. It is a tool with many uses.