Fix R Warning Message: prediction from a rank-deficient fit may be misleading

If you are using the predict function, and the test data does not work well with the logistic regression model, you will get this warning message. Fixing this problem usually requires simplifying the model. This warning does not stop the program from running or producing results. It does, however, indicate that those results may be meaningless.

Description of the warning (prediction from a rank-deficient fit may be misleading)

The predict function is a logistic regression that uses a model defined by the lm function. The predict function uses the values from the data set that you are working on to calculate the predicted values for the prediction intervals. The function ignores missing values but will produce a warning message in the event of a rank deficient fit. It is a validation method to prevent overfitting from occurring as the result of the regression process. The warning message occurs because there is a problem.

Explanation of the warning message

There are two main ways that you can create a rank deficiency that will trigger this warning message. The first is by having two predictor variables that correlate perfectly. The second is to have more parameters than your data set has observations.

> df = data.frame(A=c(1,2,3,4),
+ B=c(2,4,6,8),
+ C=c(6,10,19,26))
>
> m = lm(C~A+B, data=df)
>
> predict(m, df)
1 2 3 4
4.9 11.8 18.7 25.6
Warning message:
In predict.lm(m, df) :
prediction from a rank-deficient fit may be misleading

In this example, the predictor variables A and B are perfectly correlated. This correlation results from the situation of B=2*A. This correlation produces our warning message.

> df = data.frame(A=c(1,2,3,4),
+ B=c(3,6,12,24),
+ C=c(4,6,8,10),
+ D=c(6,10,14,18))
>
> model = lm(D~A*B*C, data=df)
>
> predict(model, df)
1 2 3 4
6 10 14 18
Warning message:
In predict.lm(model, df) :
prediction from a rank-deficient fit may be misleading

In this example, the modeling produces more than four parameters. It is this excess in parameters that produces the warning message.

How to fix the warning

These two examples show how to fix this problem.

> df = data.frame(A=c(1,2,3,4),
+ B=c(2,4,6,8),
+ C=c(6,10,19,26))
>
> m = lm(C~A, data=df)
>
> predict(m, df)
1 2 3 4
4.9 11.8 18.7 25.6

In this example, the problem is fixed by removing the redundant predictor variable B from the modeling. This clears up the warning message by eliminating redundancy.

> df = data.frame(A=c(1,2,3,4),
+ B=c(3,6,12,24),
+ C=c(4,6,8,10),
+ D=c(6,10,14,18))
>
> model = lm(D~A*B, data=df)
>
> predict(model, df)
1 2 3 4
6 10 14 18

In this example, the problem is fixed by removing predictor variable C from the modeling. This clears up the warning message by reducing the number of parameters to an acceptable level.

This warning message is easy to understand and easy to fix. It simply requires adjusting the modeling. In both examples removing a predictor variable solved the problem. Once you understand why this problem occurs it will be easy to avoid.

This error commonly appears when you’re working on data science, predictive analytics, or machine learning projects, especially if you’re working with “wild” training data. This basically indicates your independent variables may not be as independent as you believe. Especially if you were forced to pick from available variables or use categorical variable systems you didn’t create.

Other alternatives beyond adjusting model parameters include changing your overall approach to the model, such as looking at linear regression (linear model) or cluster analysis. If you’re feeling brave, you could also just move into cross validation and assess the outcome… (this is a warning prediction, not an outright error) [note: don’t do this for your thesis; this is strictly a “real world” move when you’re able to mitigate the risk of model failure through other means]