Sometimes when doing data science, you will encounter a data set that contains missing values. Designated by NA, it is often important to count it. Within R programming, this is a simple process, that requires a double function.

## Counting NA values in an R vector

Counting NA values is a simple matter of using the is.na function has an argument in the sum function. The format of these functions is sum(is.na(vector)) where “vector” is the name of the vector being evaluated. This double function will count the number of NA values within the vector. If there are no NA values, then the functions will come up with an answer of zero. Otherwise, they will produce an integer value of the number of NA values.

## Why We Do This

Whenever a vector or a data frame contains space where values are supposed to be, but the space is empty, then the empty space is filled by an NA value. The process of counting them is a simple matter of combining a function that looks through the vectors for NA values with a function that is designed to count how many it finds. This is a simple process that simply requires telling the combined function what vector to evaluate.

## Examples of counting null / NA values in an R vector

Here we have three examples of counting NA values. The first is a basic defined vector, the second Extracts three columns from one of the built-in data sets, and the third is randomly generated.

> x = c(1,5,NA,9,4,NA,4,NA,5)

> x

[1] 1 5 NA 9 4 NA 4 NA 5

> sum(is.na(x))

[1] 3

This example simply takes a vector defined by the vector function and counts the number of NA values.

> df = datasets::airquality

> oz = df$Ozone

> sr = df$Solar.R

> wd = df$Wind

> sum(is.na(oz))

[1] 37

> sum(is.na(sr))

[1] 7

> sum(is.na(wd))

[1] 0

This example uses one of the built-in data sets. Specifically, it is the data set on air quality. The code extracts the ozone, solar radiation, and wind columns and does a NA value count on each one.

> t = as.numeric(Sys.time())

> set.seed(t)

> l = as.integer(abs(rnorm(1)*10))

> n = as.integer(rnorm(l)*10)

> x = as.integer(abs(rnorm(l)*10))

> for(i in 1:l){if(n[i] < 0){x[i] = NA}}

> x

[1] 6 10 NA NA NA 9 NA 14 NA 2 NA NA 9 NA NA 16

> sum(is.na(x))

[1] 9

This example generates a random number of NA values, for a vector of random length containing random data.

## How To Use This…

One important application of this process is being able to compare the number of NA values with the total size of the vector, so as to evaluate the statistical validity of the data. For example, if you have a vector with a thousand data points, but eight hundred of them contain missing values, then the data is probably not statistically significant.

When doing data science, it is important to know that the data you are working with is statistically significant, and counting NA values is an important part of this process. It will help you to know whether or not your data set is loaded with junk.