When doing data science, it is often necessary to understand the relationships between related variables. Correlations between two variables show the relationships between those two variables. The Pearson Correlation Coefficient is a relationship between two variables known as a zero order correlation. This is a common coefficient used in R programming.
What is zero order correlation
Zero-order correlations are correlations between two variables that do not consider the influence of a third variable. Correlations that control for the influence of a third variable is called are first-order correlations. Furthermore, correlations that control for the influence of two other variables are called the second order correlations. The Pearson Correlation Coefficient is a zero-order correlation between two variables where the correlation is shown by fractional values.
- One shows a perfect positive relationship.
- Zero shows there is no linear relationship.
- A negative one shows a perfect negative relationship
This is an excellent example of a zero order correlation showing the full range of possibilities.
Introduce the r cor() function
The r cor function is a correlation test for two variables for a direct relationship between them. It has the format of cor(x,y) where “x” and “y” are the vectors being correlated. When used with a data frame, it has the format of cor(df) where “df” is the data frame being correlated. This is a simple function to use, but you need to be aware of missing values, and the possibility that they do not have a zero order correlation. However, when all goes well it supplies a good indication of the relationship between the variables. Furthermore, it is a good simple starting place for an evaluation.
Code example – Correlation Between Two Variables
In this example, we use the cor function on two vectors. the first time as separate vectors and the second time as part of a data frame.
> t = as.numeric(Sys.time())
> set.seed(t)
> x = rnorm(100)
> y = rnorm(100)
> cor(x,y)
[1] -0.05795285
> df = data.frame(x, y)
> head(df)
x y
1 -0.20603713 -0.1032161
2 -1.36476014 0.9872797
3 0.40895404 0.5785601
4 -1.19948379 -0.3422506
5 -0.06184609 0.9271071
6 0.03113741 0.7985931
> cor(df)
x y
x 1.00000000 -0.05795285
y -0.05795285 1.00000000
Note that when we correlate the two vectors by themselves, we get a single number showing the relationship between the two vectors. When we put them through this function as a data frame, we get every possible combination, as they are correlated to themselves and each other.
Code example – Correlations Between Multiple Variables
In this example, we put three vectors in a data frame, and then the data frame is applied as an argument to the cor function.
> t = as.numeric(Sys.time())
> set.seed(t)
> x = rnorm(100)
> y = rnorm(100)
> z = rnorm(100)
> df = data.frame(x, y, z)
> head(df)
x y z
1 -0.8992010 0.4794814 -0.6279939
2 0.5735586 0.3619869 -1.0066940
3 1.1040821 0.5271032 0.8246742
4 -0.9078135 -0.2761389 0.9254025
5 -0.4676835 0.4207246 0.8302730
6 1.7401241 -0.3310062 0.1478014
> cor(df)
x y z
x 1.000000000 -0.07137339 -0.001565518
y -0.071373390 1.00000000 0.105806784
z -0.001565518 0.10580678 1.000000000
As before each column is correlated to itself and the other columns. This is the only way of correlating multiple variables, without getting an error message.
Trouble shooting / errors [handling missing data]
Unfortunately, you can run into several potential problems. For example, a zero order correlation may just represent a partial correlation between the variables. This would indicate the possible influence of other variables in the association or other explanatory factors. Furthermore, a missing value could completely mess up the calculations.
> t = as.numeric(Sys.time())
> set.seed(t)
> x = rnorm(100)
> y = rnorm(100)
> z = rnorm(100)
> x[rbinom(100, 1, 0.3) == 1] = NA
> y[rbinom(100, 1, 0.3) == 1] = NA
> z[rbinom(100, 1, 0.3) == 1] = NA
> df = data.frame(x, y, z)
> head(df)
x y z
1 -0.8992010 0.4794814 NA
2 NA 0.3619869 -1.0066940
3 1.1040821 0.5271032 0.8246742
4 -0.9078135 NA 0.9254025
5 -0.4676835 0.4207246 0.8302730
6 1.7401241 -0.3310062 0.1478014
> cor(df)
x y z
x 1 NA NA
y NA 1 NA
z NA NA 1
> cor(df, use = “complete.obs”)
x y z
x 1.00000000 0.11335432 -0.05466917
y 0.11335432 1.00000000 0.08516062
z -0.05466917 0.08516062 1.00000000
In this example, our data frame has several missing values. As a result, the cor function will just return missing values when evaluating a column that has them. Adding the “use = ‘complete.obs’” argument causes the function to skip any missing values resulting in a perfect set of correlations.
Real world applications of Zero Order correlations
There are many applications of zero-order correlations. In fact, such linear regressions are commonly used statistical tools. They are often used by businesses to understand the relationships between advertising and the revenue that it produces. They are used in agriculture to help understand the relationship between fertilizer and water to crop yield. Doctors use these linear regressions to understand the effectiveness of medicine on their patients. Sports teams also use them to understand how training affects player performance. They can be used to understand the relationship between speed and fuel consumption in vehicles. The applications of such correlations between variables are endless.
Summation and closing
A zero order correlation is a simple statistical tool for showing the relationships between variables. These correlations are simple because they do not consider additional factors such as third variables. However, this can result in errors if those additional factors exist. The cor function will calculate this correlation for pairs of vectors, or for the columns of data frames. This type of linear regression has a lot of applications, but you do need to keep an eye out for the possibility that you may be using too simple a model. This can be determined by testing to make sure that a zero-order relationship is the correct one. With the proper precautions, zero-order correlations can be a helpful statistical tool.