How to Count Missing Values in R

When doing data analysis sometimes there is missing data in your data set. When this occurs, it may be necessary to count missing values in R programming to help you to deal with it. Dealing with missing data includes both working around it and replacing it. In both cases, knowing how much missing data there is will help you to better understand the results that you get.

Description: How to Count Missing Values in R

When you count missing values in R, you use the sum function in the format of sum(is.na(x)) where “x” is the data set being evaluated for missing values. This is a straightforward process, that simply finds and counts the missing values in a data set. The same function is used to find and count other values within data sets, but this is a special format that is focused on missing values. When applying the sum function to normal values you use a logical argument so that the function takes on the format of sum(x==value)where “x” is the data set and “value” is the value being looked for.

Explanation – How This Works

In real-life situations a missing value is usually a missing observation. As a result, when you count missing values, you are actually counting missing observations. When configured for counting missing values, the sum function ignores non missing values, it just counts the NA values. In general usage, the sum function simply counts each time the logical argument is true. When configured for counting missing values, that argument is true whenever it finds an NA value. The sum function simply runs through the data set and checks it to see which values meet the conditions that it has been given. When it encounters one that does, it counts. When finished, it provides a total.

Examples of Counting Missing Values in R

Here are three examples of counting missing values in R. They each show missing values being counted under different situations.

> x = c(1, 2, NA, 3, NA, 4, 5)
> x
[1] 1 2 NA 3 NA 4 5
> sum(is.na(x))
[1] 2

In this example, a simple case of counting NA values. A similar process can be used to count nan values. These, “Not a Number” values are usually a result of calculations with results that are undefined such as dividing by zero.

> x = c(1, 2, 3, 4, 5, 6, 7)
> x
[1] 1 2 3 4 5 6 7
> sum(is.na(x))
[1] 0

This example illustrates the result when there are no NA values.

> x = c(2, 3, NA, 7, 8, NA, 9)
> x
[1] 2 3 NA 7 8 NA 9
> sum(is.na(x))
[1] 2

This example illustrates a different case of counting NA values.

Imputation: How to Count Missing Values in R

Imputation is the process where some default value is substituted for an NA value. It is common in the process of imputation to use the mean value of a vector or data frame column, but there are ways to produce unique values for each case.

> x = c(2, 3, NA, 7, 8, NA, 9)
> x
[1] 2 3 NA 7 8 NA 9
> x[is.na(x)] = mean(x, na.rm=TRUE)
> x
[1] 2.0 3.0 5.8 7.0 8.0 5.8 9.0

In this example, we used the mean function to produce a multiple imputation of a vector, where the imputed value is the mean value of the vector. An alternative way of performing an imputation is by use of the impute function from the Hmisc package.

Application: How to Count Missing Values in R

One basic application for counting NA values is as a measurement of the quality of your data. If a high percentage of the data you are working with is in the form of NA values, then you have a data quality problem. Another application for counting NA values is in preparation for dealing with its presence. For example, having the number of NA values is needed for some methods of imputation. This would particularly be the case if you are trying to provide unique imputed values for each missing value. Counting NA values Provides useful statistical information about the quality of the data you are working on. It is not something that should be ignored.

When you need to count missing values in R, you do not have a dedicated base R function for the task. However, you can use the sum function, along with the missing value detection command to accomplish the task. It is a straightforward process that can tell you a lot about the quality of the data you are working with.