The case for complete cases
A fair amount of my career has focused on quality engineering and process improvement. We tend to look at a lot of tick sheets and similar ad-hoc data collection efforts in that line of work. And it is not uncommon that a few values are…. missing. Since many statistical procedures are dependent on a complete and or balanced data set, you’ve got a decision to make about fixing or dropping records with missing values.
One option is simply setting missing values to zero. While this is valid approach for certain studies, it can create additional problems. For example, there is often signal value in the missing data. A sloppy process operator will generally do a poor job on collecting quality samples. The converse is often true – rigorous records can indicate high attention to detail by the operating team. We often want to split the data into records with complete cases (all values) and missing values (the converse of complete cases for R).
We can accomplish this using the complete.cases() function.
complete.cases() – get vector of rows with na values
The complete cases function will examine a dataframe and return a vector of the rows which contain missing values. We can examine the dropped records and purge them if we wish.
complete_records <- sampledata[!complete.cases(sampledata)] partial_records <- sampledata[complete.cases(sampledata)]
Need more tips on cleaning up and manipulating data? Check out our tips page.