The Case for Complete Cases
A fair amount of my career has focused on quality engineering and process improvement. We tend to look at a lot of tick sheets and similar ad-hoc data collection efforts in that line of work. And it is not uncommon that a few values are…. missing. Since many statistical procedures are dependent on a complete and or balanced data set, you’ve got a decision to make about fixing or dropping records with missing values.
One option is simply setting missing values to zero. While this is valid approach for certain studies, it can create additional problems. For example, there is often signal value in the missing data. A sloppy process operator will generally do a poor job on collecting quality samples. The converse is often true – rigorous records can indicate high attention to detail by the operating team. We often want to split the data into records with complete cases (all values) and missing values (the converse of complete cases for R).
We can accomplish this using the complete.cases() function.
complete.cases in R – Get Vector of Case Rows With na Values
Missing or na values can cause a whole world of trouble, messing up anything you might do with your data. Complete.cases in r will help change that.
The complete cases function will examine a data frame, find complete cases, and return a logical vector of the rows which contain missing values. or incomplete cases. We can examine the dropped records and purge them if we wish.
complete_records <- sampledata[!complete.cases(sampledata)]
partial_records <- sampledata[complete.cases(sampledata)]
This technique allows us to look at and exclude na data using the na.omit df function, or find an alternate way of dealing with the missing values. Using complete.cases in R, we can clean up our data, and make it easier to carry out statistical functions like finding the standard deviation or creating a confidence interval. Finding complete cases is a breeze, and yet another invaluable skill for any good programmer.
Examples For Common Uses
You’re going to use complete case analysis to validate an r dataframe for missing data. This is commonly used to clean up a data structure for a linear regression analysis or logistic regression study. Missing observations can create biased estimates under many analysis method(s). A regression model or imputation model is going to reflect the observed values in the original data.
What makes this complicated is that there is often usable data in null value or na value records. For example, if you’re studying industrial failures, missing data in the maintenance dataset may indicate neglect. Similar insights can come from longitudinal data. There is no easy path to an unbiased estimate: you’ve got to decide what role missing values vs. complete data will play in a data analysis.
Need more tips on cleaning up and manipulating data? Check out our tips page.