Real world data collection doesn’t always follow the rules.
Sometimes a manufacturing sensor breaks and you can only get good readings on four of your six measurement spots on the assembly line. Perhaps one of the marks on the quality sheet is illegible. You could even be missing samples for an entire shift. Stuff happens.
Unfortunately, this can affect your statistical calculations. Certain procedures don’t handle missing data gracefully. We’re going to discuss a few ways to remove na or null values in R programming. This allows you to limit your calculations to rows in your R dataframe which meet a certain standard of completion. We also have a separate article that provides options for replacing na values with zero.
Identifying Missing Values
We can test for the presence of missing data or null values via the is.na() function. This method searches through every single column of the dataset, finding the outliers with an na value that might affect the calculation. Any element in the sequence within the R dataframe or matrix that has an na value will be returned, so you know which cells in the original data had null values before you actually use any method to remove them or replace them with zeros.
# remove na in r - test for missing values (is.na example) test <- c(1,2,3,NA) is.na(test)
In the previous example, is.na() will return a vector indicating which elements have a na value.
na.omit() – remove rows with na from a list
This is the easiest option. The na.omit() function returns a list without any rows that contain na values. This is the fastest way to remove na rows in the R programming language.
# remove na in r - remove rows - na.omit function / option ompleterecords <- na.omit(datacollected)
Passing your data frame or matrix through the na.omit() function is a simple way to purge incomplete records from your analysis. It is an efficient way to remove na values in r.
complete.cases() – returns vector of rows with na values
This allows you to perform more detailed review and inspection. The na.omit() function relies on the sweeping assumption that the dropped na rows (removed the na values) are similar to the typical member of the dataset, and are not total outliers.
This frequently doesn’t hold true in the real world. Continuing our example of a process improvement project, small gaps in record keeping can be a signal of broader inattention to how the machinery needs to operate. If an operator with good record-keeping is a sign of diligent management, we would expect better performance from other areas of the process. So removing the na values in R might not be the right decision here. We should consider inspecting subsets of the original data to evaluate if other factors are at work.
We accomplish this with the complete.cases() function. This r function will examine a dataframe and return a result vector of the rows which contain missing values. We can examine the dropped records and purge them if we wish.
# na in R - complete.cases example fullrecords <- collecteddata[!complete.cases(collecteddata)] droprecords <- collecteddata[complete.cases(collecteddata)]
Fix in place using na.rm
For certain statistical functions in R, you can guide the calculation around a missing value through including the na.rm parameter (na.rm=True). The rows with na values are retained in the dataframe but excluded from the relevant calculations. Support for this parameter varies by package and function in the R language, so please check the documentation for your specific package.
This is often the best option if you find there are significant trends in the observations with na values. Use the na.rm parameter to guide your code around the missing values and proceed from there. We prepared a guide to using na.rm.
NA Values and regression analysis
Removal of missing values can distort a regression analysis. This is particularly true if you are working with higher order or more complicated models. Fortunately, there are several options in the common packages for working around these issues.
If you are using the lm function, it includes a na.action option. As part of defining your model, you can indicate how the regression function should handle missing values. Two possible choices are na.omit and na.exclude. na.omit will omit all rows from the calculations. The na.exclude option removes na values from the R calculations but makes an additional adjustment (padding out vectors with missing values) to maintain the integrity of the residual analytics and predictive calculations. This is often more effective that procedures that delete rows from the calculations.
You also have the option of attempting to “heal” the data using custom procedures. In this situation, map is.na against the data set to generate a logical vector that identifies which rows need to be adjusted. From there, you can build your own “healing” logic.