The phi coefficient is a value related to pairs of binary variables with regard to the number of occurrences of the combined results. It shows how strongly the variables are related and whether the relationship is negative or positive. This coefficient only has meaning in situations where both variables are binary in nature. That is each variable can only have one of two values. These two variables can then be placed in a chart showing the four possible combinations that result from their overlap. The value of this coefficient is based on the number of occurrences of each of the possible combinations that result from the overlap of the two variables.
What is the Phi Coefficient
The phi coefficient is a value that shows the degree of association and correlation between binary variables. This correlation coefficient has an effect size that shows the relationship between the two variables. It is calculated by the formula:
Phi = (AD-BC)/√(A+B)(C+D)(A+C)(B+D)
Based on the following chart:
[0][1]
[0] A B
[1] C D
The following patterns produce a phi = 0.
[0][1]
[0] 1 1
[1] 1 1
phi = 0
[0][1]
[0] 2 2
[1] 1 1
phi = 0
[0][1]
[0] 1 1
[1] 2 2
phi = 0
[0][1]
[0] 2 1
[1] 2 1
phi = 0
[0][1]
[0] 1 2
[1] 1 2
phi = 0
The following patterns have a positive phi showing a positive relationship.
[0][1]
[0] 2 1
[1] 1 1
phi = 0.17
[0][1]
[0] 1 1
[1] 1 2
phi = 0.17
[0] [1]
[0] 2 1
[1] 1 2
phi = 0.33
The following patterns have a negative phi showing a negative relationship.
[0][1]
[0] 1 2
[1] 1 1
phi = -0.17
[0][1]
[0] 1 1
[1] 2 1
phi = -0.17
[0][1]
[0] 1 2
[1] 2 1
phi = -0.33
These examples illustrate the fact that A and D are positive, and B and C are negative. This can be seen in the formula used to calculate the coefficient.
Why Use the Phi Coefficient in R
There are many reasons to use the phi coefficient when working with binary variables in R, it can be helpful in machine learning Where it is also known as Matthews correlation coefficient. In this case, it is used to compare predicted and actual results when checking the validity of a model.
|| predicted
Actual || Yes No
Yes TP FP
No FN TN
TP is the number of true positives.
TN is the number of true negatives.
FP is the number of false positives
FN is the number of false negatives
When you calculate the coefficient for this arrangement it shows the accuracy to the productions.
Phi = 1 shows total agreement between actual and predicted.
Phi = 0 is equivalent to completely random guessing
Phi = -1 shows total disagreement between actual and predicted.
You also want to use this coefficient when dealing with contingency tables, where it is also known as the Mean Square Contingency Coefficient and Yule phi.
phi = 1 shows a completely positive relationship between the two variables.
phi = 0 shows no relationship between the two variables.
phi = -1 shows a completely negative relationship between the two variables.
The following example is based on actual data and illustrates a real-world example of this coefficient in action.
Male Female
Theist 57 69
Atheist 12 6
phi = -0.1418599
First of all, this means that there is a slightly negative relationship in this situation. The negative and positive values of phi do not indicate anything qualitative about the relationship, in fact, they can easily be reversed by reversing the values in one of the variables. It merely indicates which side the relationship favors.
Examples of Code to Calculate the Phi Coefficient
Here are three examples of code that calculates this coefficient. All three examples add the same values so that they produce the same result.
> x = data.frame(a = c(4, 8),
+ b = c(9, 6))
> x
a b
1 4 9
2 8 6
> phic = (x$a[1] * x$b[2] – x$b[1] * x$a[2]) / sqrt((x$a[1] + x$b[1]) * (x$a[2]+x$b[2]) * (x$a[1] + x$a[2]) *(x$b[1] + x$b[2]))
> phic
[1] -0.2651974
This example is a simple data frame, and it shows how to apply the coefficient’s formula to a data frame.
> x = data.frame(Male = c(4, 8),
+ Female = c(9, 6))
> rownames(x)= c(“Nonsmoker”, “Smoker”)
> x
Male Female
Nonsmoker 4 9
Smoker 8 6
> phic = (x$Male[1] * x$Female[2] – x$Female[1] * x$Male[2]) / sqrt((x$Male[1] + x$Female[1]) * (x$Male[2]+x$Female[2]) * (x$Male[1] + x$Male[2]) *(x$Female[1] + x$Female[2]))
> phic
[1] -0.2651974
This example is a data frame that has meaningful column and row names, and it shows how to apply the coefficient’s formula to this type of data frame. This is not based on real data.
> x = matrix(c(4, 8, 9, 6), nrow = 2)
> x
[,1] [,2]
[1,] 4 9
[2,] 8 6
> phic = (x[1,1] * x[2,2] – x[2,1] * x[1,2]) / sqrt((x[1,1] + x[1,2]) * (x[2,1]+x[2,2]) * (x[1,1] + x[2,1]) *(x[1,2] + x[2,2]))
> phic
[1] -0.2651974
This example shows how to apply this coefficient formula to a matrix. In each of these cases, we have a slightly different situation.
Common Problems with Using Phi
There are problems that can arise when using phi with a pair of binary variables. Besides the possibility of messing up the calculation by putting values in the wrong place, There are problems directly related to statistics itself.
One problem is not having a sufficient sample size. If your sample size is too small, you will likely not have a representative sample and the results will be meaningless. For example, if you have a population of a million people, and you survey only five people then you will probably not have a representative sample. Not taking a large enough sample to be representative of the population is an easy problem to fall into when you are sampling a population of objects. This problem needs to be avoided, otherwise, your results will be meaningless.
Another frequent problem is that of marginal probabilities. This is also known as an unconditional probability because it is not dependent upon any other events. When dealing with contingency tables you could have the situation where the two variables are not reacting to each other, but the results are random. This problem is illustrated by the following examples.
> t = as.numeric(Sys.time())
> set.seed(t)
> x = matrix(abs(as.integer(rnorm(4)*10)), nrow = 2)
> x
[,1] [,2]
[1,] 11 14
[2,] 13 6
> phic = (x[1,1] * x[2,2] – x[2,1] * x[1,2]) / sqrt((x[1,1] + x[1,2]) * (x[2,1]+x[2,2]) * (x[1,1] + x[2,1]) *(x[1,2] + x[2,2]))
> phic
[1] -0.2429353
> t = as.numeric(Sys.time())
> set.seed(t)
> x = matrix(abs(as.integer(rnorm(4)*10)), nrow = 2)
> x
[,1] [,2]
[1,] 15 8
[2,] 1 11
> phic = (x[1,1] * x[2,2] – x[2,1] * x[1,2]) / sqrt((x[1,1] + x[1,2]) *(x[2,1] + x[2,2]) * (x[1,1] + x[2,1]) * (x[1,2] + x[2,2]))
> phic
[1] 0.5420113
These two examples use random numbers, with the system time being used as the seed. They still produce non-zero values for phi. This means that you can still get positive and negative results when the values are random. You have to make sure that this is not the case.
Alternatives in Using the Phi Coefficient
So far, we have written out the full formula for the phi coefficient but there is an alternative way of calculating it that works for both matrixes and data frames. This alternative requires having the psych package installed. The phi function from the psych package has the format of phi(ct) where “ct” is the contingency table being evaluated. The following two examples show it being used with a matrix end a data frame.
> library(psych)
> x = matrix(c(4, 8, 9, 6), nrow = 2)
> x
[,1] [,2]
[1,] 4 9
[2,] 8 6
> phi(x)
[1] -0.27
This example shows the phi function being used with a matrix.
> library(psych)
> x = data.frame(a = c(4, 8),
+ b = c(9, 6))
> x
a b
1 4 9
2 8 6
> phi(x)
[1] -0.27
This example shows the phi function being used with a data frame. In both cases, it rounds off the result to two significant digits.
The phi coefficient is a way of showing the relationship between two binary variables. It shows which side the data favors. While it is a useful tool, there are potential problems you should watch out for. Used properly, it supplies important relationship information about the two variables. Thank you for taking the time to read this article.