The Pearson Correlation Coefficient in R

Introduction

The Pearson correlation coefficient was invented by the British statistician Karl Pearson in the late 19th century (Norton 1978). It is used to measure the strength and direction of the linear relationship between two quantitative variables. The coefficient is usually denoted by the symbol r and ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, a value of -1 indicates a perfect negative correlation, and a value of 0 indicates no correlation (Ryan et al. 1976).

The Pearson correlation coefficient is calculated by taking the covariance of the two variables and dividing it by the product of their standard deviations. The formula is:

$$r = \frac{Cov(x,y)}{(sx * sy)}$$

where Cov(x,y) is the covariance of x and y and sx and sy are the standard deviations of x and y.

Implementation in R

The Pearson correlation coefficient can be implemented in R simply by using the cor() function. For example, if the variables x and y are stored in the data frame df, the following code could be used to calculate the Pearson correlation coefficient cor(df$x, df$y).

Here’s how the pearson correlation is implemented on the cars dataset:

library(dplyr)
data(“cars”)

cars %>%
summary()

The cars dataset was generated in 1920 and is comprised of the speed of cars at the time, and the distances taken to stop. The data shows a 2 numerical variables that are measured independently of each other. Conducting the pearson analysis on this data set seeks to answer whether there is a relationship between the speed of cars and their corresponding distances taken to stop.

cor(cars$speed, cars$dist)

Applying the cor() function on the two variables in the dataset, we get a pearson-r value of 0.8068949. This r value suggests that there is a moderately positive linear correlation between the two variables.

The cor() function has several helpful arguments. Aside from specifying the paired data to be analyzed, one can call na.rm = TRUE to remove any pairs with missing values. The function generates an error if it finds missing values in the data set. Moreover, the argument method can specified to change the correlation method from pearson (default) to spearman or kendall. These other two correlation methods are typically used for non-parameteric data or data that does not resemble a normal distribution. Thus, a prerequisite in doing a pearson correlation is to ensure that both correlated variables are each normally distributed. More information on the function can be found here.

Helpful features of cor()

If the x and y variables are not specified in the function and instead the data.frame itself is called, cor() generates a correlation matrix for as an output

cor(cars)

This is particularly helpful for data sets containing multiple variables that you would like to correlate with each other.

data(mtcars)

first.five.mtcars <-
mtcars %>%
select(1:5) # selects the first 5 variables of mtcars

cor(first.five.mtcars)

Moreover with the reshape2 and ggplot2 packages, a few lines of code can generate a heat map, a visualization of the correlation matrix.

library(reshape2) # run install.packages(“reshape2”) if not yet installed

cor.mat <- round(cor(first.five.mtcars),2) # this round() wrapper rounds up the coefficient values

reduced.cor.mat <- melt(cor.mat)

library(ggplot2)

reduced.cor.mat %>%
ggplot(aes(Var1, Var2, fill = value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value), color = “black”, size = 4) +
scale_fill_gradient2(low = “red”, mid = “white”, high = “blue”)

Applications of the pearson correlation coefficient

The Pearson correlation coefficient has many applications in a variety of fields. As demonstrated earlier, it is used in scientific research to measure the strength of the relationship between two or more variables. It is also used in finance to measure the relationship between stock prices and economic indicators. It is also used in data mining to identify patterns in data and to make predictions about future outcomes. Additionally, it is used in medicine to measure the correlation between patient symptoms and disease diagnosis.

When using the Pearson correlation coefficient, it is important to consider the assumptions that it makes about the data. The Pearson correlation assumes that the data is normally distributed and that the two variables are linearly related. Additionally, the Pearson correlation assumes that there is no outlier data points in the data. It is important to check for these assumptions before using the Pearson correlation coefficient (Okwonu, Asaju, and Arunaye 2020).

The Pearson correlation usually serves as an initial assessment of the relationship between variables. The Pearson-r can serve as a guide to implementing predictive analysis such as linear regression or generating multiple linear regression or other statistical methods.

—————

Norton, Bernard J. 1978. “Karl Pearson and Statistics: The Social Origins of Scientific Innovation.” Social Studies of Science 8 (1): 3–34. http://www.jstor.org/stable/284855.

Okwonu, Friday Zinzendoff, Bolaji Laro Asaju, and Festus Irimisose Arunaye. 2020. “Breakdown Analysis of Pearson Correlation Coefficient and Robust Correlation Methods.” IOP Conference Series: Materials Science and Engineering 917 (1): 012065. https://doi.org/10.1088/1757-899x/917/1/012065.

Ryan, Thomas A, Brian L Joiner, Barbara F Ryan, et al. 1976. Minitab Student Handbook. Duxbury Press.