Data Cleanup: Remove NA rows in R

Real world data collection doesn’t always follow the rules.

Sometimes a manufacturing sensor breaks and you can only get good readings on four of your six measurement spots on the assembly line. Perhaps one of the marks on the quality sheet is illegible. You could even be missing samples for an entire shift. Stuff happens. Unfortunately, this can affect your statistical calculations.

In data analysis and machine learning, it is quite common to deal with datasets that contain missing values. These missing values can be represented as NA or NaN in R. However, many statistical functions and machine learning algorithms don’t handle missing data gracefully. Thus, it is necessary to remove them from the dataset. This article discusses how to remove rows with missing values in R using different methods. This allows you to limit your calculations to rows in your R dataframe which meet a certain standard of completion. We also have a separate article that provides options for replacing na values with zero.

The article provides a comprehensive guide on how to remove rows with missing values in R using different methods. We’re going to cover different methods for removing rows with missing values, including na.omit(), complete.cases(), and is.na(). This includes providing sample code to help you implement this in your own projects.

Identifying Missing Values

We can test for the presence of missing data or null values via the is.na() function. This method searches through every single column of the dataset, finding the outliers with an na value that might affect the calculation. Any element in the sequence within the R dataframe or matrix that has an na value will be returned, so you know which cells in the original data had null values before you actually use any method to remove them or replace them with zeros.

# remove na in r - test for missing values (is.na example)
test <- c(1,2,3,NA) is.na(test)

In the previous example, is.na() will return a vector indicating which elements have a na value.

na.omit() – remove rows with na from a list

This is the easiest option. The na.omit() function returns a list without any rows that contain na values. It will drop rows with na value / nan values. This is the fastest way to remove na rows in the R programming language.

# remove na in r - remove rows - na.omit function / option
ompleterecords <- na.omit(datacollected)

Passing your data frame or matrix through the na.omit() function is a simple way to purge incomplete records from your analysis. It is an efficient way to remove na values from an r data frame (nan values).

complete.cases() – returns vector of rows with na values

This allows you to perform more detailed review and inspection. The na.omit() function relies on the sweeping assumption that the dropped na rows (removed the na values) are similar to the typical member of the dataset, and are not total outliers.

This frequently doesn’t hold true in the real world. Continuing our example of a process improvement project, small gaps in record keeping can be a signal of broader inattention to how the machinery needs to operate. If an operator with good record-keeping is a sign of diligent management, we would expect better performance from other areas of the process. So removing the na values in R might not be the right decision here. We should consider inspecting subsets of the original data to evaluate if other factors are at work.

We accomplish this with the complete.cases() function. This r function will examine a dataframe and return a result vector of the rows which contain missing values. You can use the row number to dig deeper into the other column values for the data frame rows before you remove row(s) from your project.

# na in R - complete.cases example
fullrecords <-  collecteddata[!complete.cases(collecteddata)] droprecords <-  collecteddata[complete.cases(collecteddata)]

Fix in place using na.rm

For certain statistical functions in R, you can guide the calculation around a missing value through including the na.rm parameter (na.rm=True). The rows with na values are retained in the dataframe but excluded from the relevant calculations. Support for this parameter varies by package and function in the R language, so please check the documentation for your specific package.

This is often the best option if you find there are significant trends in the observations with na values. Use the na.rm parameter to guide your code around the missing values and proceed from there. We prepared a guide to using na.rm.

Using the dplyr filter function

If you are using the dplyr r package, you can invoke the filter function – filter() – to drop rows meeting a specific condition. Unlike the bracket based subsetting in base r, the filter function will drop row(s) where the condition evaluates to an na value. This is an efficient way to drop na value(s), especially for blank rows.

Using a dropna function

Inspired by the pandas dropna method in Python, there are several versions of the dropna function within various R Libraries. This includes drop_na within tidyr (part of the tidyverse) and the DropNA function in the DataCombine package.

NA Values and regression analysis

Removal of missing values can distort a regression analysis. This is particularly true if you are working with higher order or more complicated models. Fortunately, there are several options in the common packages for working around these issues.

If you are using the lm function, it includes a na.action option. As part of defining your model, you can indicate how the regression function should handle missing values. Two possible choices are na.omit and na.exclude. na.omit will omit all rows from the calculations. The na.exclude option removes na values from the R calculations but makes an additional adjustment (padding out vectors with missing values) to maintain the integrity of the residual analytics and predictive calculations. This is often more effective that procedures that delete rows from the calculations.

You also have the option of attempting to “heal” the data using custom procedures in your R code. In this situation, map is.na against the data set to generate a logical vector that identifies which rows need to be adjusted. From there, you can build your own “healing” logic.

Handling Missing Character Values

On the surface, handling missing values of character data is similar to handling missing values of numeric data. There are, however, a few quirks. Here are a few things to keep in mind:

Identify the missing values: In character data, missing values are often represented as blanks or empty strings. You can use the is.na() function to identify missing values in character data. However, note that this function only works for NA, not for other types of missing values.
Decide on a method for handling missing values: There are several methods for handling missing values in character data. One common method is to replace missing values with a specific value, such as “Unknown” or “Not Applicable”. Another method is to impute missing values using a prediction algorithm or a statistical model.
Be very cautious when imputing missing values: Unlike numeric data values, character values are often categorical in nature (different names, people, etc.) and most trivial imputation methods (taking the mean or median, etc.) can be relied upon to give wildly inaccurate results. This is especially true in marketing and social sciences applications, where being unable to collect data about a factor is often a proxy for other useful signals about the observation. These are often best labeled “unknown” and treated as a group of their own.

Here’s an example of some code for converting missing character values into a group called “unknown”:

# Load the dataset
data <- read.csv("dataset.csv")

# Replace missing values with "Unknown"
data$character_variable[is.na(data$character_variable)] <- "Unknown"

In the code above, we first load the dataset. We then use the [ ] operator to select the character_variable column and replace missing values with the string “Unknown”. The is.na() function is used to check for missing values.

Note that this is just one example of how to handle missing values in character data using base R. The choice of method will depend on the specific dataset and research question being addressed.

An Alterative Option – Imputation Methods

By the way, you don’t always have to drop records with missing data. In some cases, it may be acceptable to fill in the missing values in a dataset with estimated values. These are referred to as the imputation methods. In base R, there are several imputation methods available, including mean imputation, median imputation, and regression imputation. The choice of imputation method depends on the nature of the data and the research question being addressed.

Here’s an example of how to fill in missing values using mean imputation in R using base R:

# Load the dataset
data <- read.csv("dataset.csv")

# Check for missing values
sum(is.na(data))

# Impute missing values using mean imputation
data_imputed <- data
for (i in 1:ncol(data)) {
  if (class(data[,i]) == "numeric") {
    data_imputed[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
  }
}

# Check for missing values in the imputed dataset
sum(is.na(data_imputed))

In the code above, we first load the dataset and check for missing values using the is.na() function.

Next, we use a for loop to iterate over each column in the dataset. If the column contains numeric data, we impute missing values using mean imputation. The is.na() function is used to identify missing values, and the mean() function is used to calculate the mean of the non-missing values.

Finally, we check for missing values in the imputed dataset using the is.na() function. If there are no missing values, then the imputation was successful.

Note that this is just one example of how to use imputation methods in base R. There are many other imputation methods available, and the choice of method will depend on the specific dataset and research question being addressed. Additionally, there are also many packages available in R that provide more advanced imputation methods.

Frequently Asked Questions

R Function for Excluding Rows with Missing Values

The na.omit() function serves to filter out any rows in a dataset that include missing values, denoted as NA, ensuring a cleaner data analysis process.

Utilizing na.omit for Data Cleaning in R

To exclude rows with missing data in a dataframe, one can apply na.omit(your_dataframe). This command filters out incomplete cases, simplifying subsequent data analysis tasks.

Tidyverse Approach for Omitting NA Rows

Within the tidyverse collection of R packages, filter() from dplyr combined with is.na() allows for precise removal of rows that contain missing values, enhancing data tidiness.

Handling NA Values in R Vectors

It is possible to purge NA values from a vector without altering other elements by using logical indexing, for example: your_vector[!is.na(your_vector)].

Substituting NA Values in a Dataframe

To replace NA values with zeros or another specified constant in a dataframe, one could employ the replace() function or construct a code such as: your_dataframe[is.na(your_dataframe)] <- 0.

Differentiating na.rm and na.omit in R

The na.rm parameter, often found in descriptive statistics functions, enables exclusion of NA values in calculations, whereas na.omit() is specifically employed to filter out rows with missing values from a dataframe or list structure.

For more information about handy functions for cleaning up data (beyond ways to remove na in r), check out our functions reference, data science articles, and general tutorial.