Perhaps you’ve seen it. (Or maybe you haven’t?) Or worse, the dang thing is sitting on your console right now! You’re running an R programming script and a little warning message: nas introduced by coercion appears in your console log. Since it’s a warning message, not an error message, the program will finish executing.
But the doubt it creates remains….is your data correct?
Fear not, dearest reader… we shall explain why this coercion warning message appeared and explain how to keep this under control.
Why You Have a Coercion Warning Message
The warning is a result of attempting to perform a data type conversion and getting an invalid result. The *most* common version of this problem occurs when you’re trying to convert character strings data into numeric data.
Faced with the string “55”, a proper data type conversion to numeric should generate the integer 55. From which, you can handle any mathematical operations that expect a number (adding, dividing, etc.)
The problem occurs if we’re faced with a string which contains a character not found in numbers. For example, a dash “-“. Not part of a number, thus that dash will crash a data type conversion function. The good news is that R handles this specific data type conversion failure fairly elegantly, adding a “missing values” observation back to your data frame. The bad news? You’ve got missing values... which could contain real data that needs to be part of your analysis.
Solution: Look for the format error
The good news is that the source of this error is generally pretty easy to track down, once you’re aware of the cause. Look for where you convert data. Focus on any string (or similar) data type value / variable that isn’t likely to be converted.
Potential sources can include failing to parse a column properly, bad column names or layout specifications, formatting errors in the character string data. Let the comma hunt begin!
Other Solutions:
There’s also the Chuck Norris approach: drive right through the warning, using the SuppressWarnings() function. If Chuck doesn’t care about a warning message, why should you?
As a general rule, this actually works very well if you’re cranking through a large dataset of what is basically “raw text data” from a bulk source such as a phonebook or government list. Give it a quick check to ensure you’re not systematically ignoring any key variables (or types of record) and power through the occasional bad data point. I used to do this for raw address data: over 99% of consumer addresses follow a standard format… the remaining 1% wasn’t worth the code to unpack and process them (for a small team). The approach of general suppressing warnings was a easy way to handle this.
Just make sure you’re not systematically biased in terms of which data points go missing and that you’re not losing enough data to put the overall integrity of your findings at risk. Checking the example data helps.