Real world data collection doesn’t always follow the rules.
Sometimes a manufacturing sensor breaks and you can only get good readings on four of your six measurement spots on the assembly line. Perhaps one of the marks on the quality sheet is illegible. You could even be missing samples for an entire shift. Stuff happens.
Unfortunately, this can affect your statistical calculations. Certain procedures don’t handle missing values gracefully. We’re going to discuss a few ways to remove na values in R. This allows you to limit your calculations to rows which meet a certain standard of completion. We also have a separate article that provides options for replacing na values with zero.
Identifying missing values
We can test for the presence of missing values via the is.na() function.
test <- c(1,2,3,NA) is.na(test)
In the example above, is.na() will return a vector indicating which elements have a na value.
na.omit() – remove rows with na from a list
This is the easiest option. The na.omit() function returns a list without any rows that contain na values.
completerecords <- na.omit(datacollected)
Passing your data frame through the na.omit() function is an simple way to purge incomplete records from your analysis. It is an efficient way to remove na values in r.
complete.cases() – returns vector of rows with na values
This allows you to perform more detailed review and inspection. The na.omit() function relies on the sweeping assumption that the dropped rows (removed the na values) are similar to the typical member of the dataset.
This frequently doesn’t hold true in the real world. Continuing our example of a process improvement project, small gaps in record keeping can be a signal of broader inattention to how the machinery needs to operate. If an operator with good record-keeping is a sign of diligent management, we would expect better performance from other areas of the process. So removing the na values in r might not be the right decision here. We should consider inspecting them to evaluate if other factors are at work.
We accomplish this with the complete.cases() function. This r function will examine a dataframe and return a vector of the rows which contain missing values. We can examine the dropped records and purge them if we wish.
fullrecords <- collecteddata[!complete.cases(collecteddata)] droprecords <- collecteddata[complete.cases(collecteddata)]
Fix in place using na.rm
For certain statistical functions in R, you can guide the calculation around a missing value through including the na.rm parameter (na.rm=True). The rows with na values are retained in the dataframe but excluded from the relevant calculations. Support for this parameter varies by package and function, so please check the documentation for your specific package.
This is often the best option if you find there are significant trends in the observations with na values. Use the na.rm parameter to guide your code around the missing values and proceed from there.
NA Values and regression analysis
Removal of missing values can distort a regression analysis. This is particularly true if you are working with higher order or more complicated models. Fortunately, there are several options in the common packages for working around these issues.
If you are using the lm function, it includes a na.action option. As part of defining your model, you can indicate how the regression function should handle missing values. Two possible choices are na.omit and na.exclude. na.omit will remove the affected rows from the calculations. The na.exclude option removes na values from the R calculations but makes an additional adjustment (padding out vectors with missing values) to maintain the integrity of the residual analytics and predictive calculations.
For more information about handy functions for cleaning up data, check out our functions reference.