How to Use the scale() Function in R

The scale() function works exactly as it’s described. It’s able to normalize the data which is provided to it. If you’re not familiar with statistics, this might take a bit of work to understand at first. However, the scale function is an extremely powerful one which you’re able to use in order to make your data science projects successful.

What is the scale function?

In short, the scale() function converts a list of values to a list of z-scores. It takes an x parameter as the main argument, which is the list of parameters. It also has two optional parameters: center = TRUE, and scale = TRUE.

The scale and center attributes are fairly simple to use. Center either subtracts values (if it’s passed a list of values) from each data point, the mean of each column in your data set if it’s set to true, or nothing if it’s set to false. Scale does something similar, except if it receives a list of values it divides each of the data points by the corresponding one in the list. If both center and scale are both set to true, the x columns are divided by their standard deviations. Finally, if scale is false, nothing happens.

In summary, the scale function can return a few different things. It can either return a list of z-scores of the individual points in a data set. In other scenarios, it’s able to center the data set and provide scaled values that are able to be compared to other data sets.

What is a Z-Score?

A z-score is a statistical measure. In most cases, data points lie on a statistical bell-curve where less of the data is either much smaller or larger than the average. It tells you how far away from the average (in a statistical sense) a data point is. A z-score for a data point is calculated by subtracting the data point from the average of all data points and dividing this result by the standard deviation of the data set.

Using the Scale Function

Here is a basic implementation of the scale function. Suppose that you have a list of high temperatures for two imaginary cities for a week in August. You’re trying to determine if one city is hotter than the other when looking at multiple years of data. Your data is widely scattered and needs to be centered across multiple columns.

Suppose that our list of high temperatures for the first imaginary city, Pflugerville, is: 86, 88, 87, 89, 93, 95, 94 and the second city, Thames, is 86, 88, 87, 89, 93, 95, 94.

We would first want to create a list vector for each city’s high temperatures:
PflugervilleTemperatures <- (86, 88, 87, 89, 93, 95, 94) and ThamesTemperatures <- c(95, 99, 89, 96, 93, 98, 99). We would then have two different choices for running the scale function to produce the z-scores for these two sets of values.

The first option is to run it with the scale function on each vector by itself. If we ran the scale function on the PflugervilleTemperatures vector, we would get this result:

[1,] -1.1779055
[2,] -0.6282163
[3,] -0.9030609
[4,] -0.3533717
[5,] 0.7460068
[6,] 1.2956961
[7,] 1.0208515
attr(,”scaled:center”)
[1] 90.28571
attr(,”scaled:scale”)
[1] 3.638419

Each of the values you’ll see above are the z-scores for each data point. attr(,”scaled:center”) and attr(,”scaled:scale”) are the mean and standard deviation, respectively. With these z-scores, you’re able to return the p-values for each point. This is a measure that allows you to determine the percentage of the data that is to the lower end of the data spectrum. You would want to save the results of the scale function to a variable so that you can use a specific value:
scaledPflugervilleTemperatures <- scale(PflugervilleTemperatures) and you would be able to access a specific z-score on the list by using its position in the array. In this example, scaledPflugervilleTemperatures[6] would return 1.2956961.

The other choice for running the scale function is by assigning both data sets to an array. The following code could be used to create a 7 row, 2 column array: result <- array(c(PflugervilleTemperatures,ThamesTemperatures),dim = c(7,2)). In either case, you can pass the array to the scale function in its entirety, and the scale function will provide z-scores on each column.

If we combined the two lists into an array, the scale function would return the following z-scores:

[,1] [,2]
[1,] -1.1779055 -0.1567724
[2,] -0.6282163 0.9406342
[3,] -0.9030609 -1.8028821
[4,] -0.3533717 0.1175793
[5,] 0.7460068 -0.7054756
[6,] 1.2956961 0.6662825
[7,] 1.0208515 0.9406342
attr(,”scaled:center”)
[1] 90.28571 95.57143
attr(,”scaled:scale”)
[1] 3.638419 3.644957

As you can see, the scale() function is a very powerful function that has plenty of great statistical capabilities!