Anyone who’s worked with statistics can testify to the fact that raw numbers can be deceptive. On one hand, statistics are one of the best ways to form an objective view of any given situation. But on the other, statistical noise can prompt misguided interpretations. One of the best examples of this phenomenon is something that can occur when we average out a set of values. If we have one data point on an extreme end it can massively raise or lower the group average. For example, consider an exam with 1,000 questions.
Ten people take the test, but it turns out that it’s in a language nine of them don’t even speak. However, one of those ten both speaks the language and is an expert in the subject. He scores 1,000 out of 1,000 correct. While the other people taking the exam don’t fill in a single question and have 0 correct answers. If we averaged out those values it’d give a wildly misleading impression of the group’s ability to work with the subject matter.
This is one of those times when statistical observation can dramatically differ from what actually happened. And why special techniques are used to compensate for these instances. One of the most common and effective of these techniques is known as a trimmed or truncated mean. As the name suggests we’re essentially trimming these extreme data points to remove outlier values from a sample or probability distribution. The remaining values are then used to calculate the mean. This is a relatively simple way to use weights with a mean. We’re essentially applying different levels of significance to the values used to generate a mean.
The prior example is a little hyperbolic by design. But in real-world situations, the uncharacteristic extreme values often do lead to improper interpretations. And this is especially common when working with smaller sample sizes during scientific studies. Though the issue is also quite common in finance and investing. Likewise, trimming is quite common in the financial sector. For example, a trimmed mean inflation rate is a common way to look at inflation rates to get a more realistic picture than the individual data points would suggest.
Of course, the prior examples are only the tip of the iceberg. Trimmed mean calculations are a part of a wide variety of different fields. And we can even implement it in R to quickly sort through our data. But we need to look at the specifics of trimmed mean calculations before implementing them in code.
We touched on the idea of a trimmed mean as what we get after truncating a sample. But up until this point, the specific details of what’s trimmed have been generalized for a good reason. There’s no set, singular, percentile for trimmed means. People instead refer to a trimmed mean by a specific percentile value. For example, if a set had the top and bottom 10% removed before calculating the mean then it’d be a 10% trimmed mean.
In short, a mean is the sum of all data points divided by the total instances or observations. And a trimmed mean results from trimming x percent of the top and bottom extremes of those data points. With that in mind, how would we go about implementing the concept in R?
Implementing the Mean in R
You might be readying yourself to create a trimming function by hand. But R’s developers have us covered. The language actually has built-in trimming within the mean function. Take a look at the following code to see just how easy it is to find a trimmed value in R.
ourData <- c(1,500,500,500,500,500,500,500,500,500,500,500,500,500,500,1000000)
tMean <- mean(ourData, trim=0.1)
We begin by defining a vector called ourData and filling it with a variety of different values. Most of these values are 500. But we have a 1 and 1,000,000 as outliers. In comparison, it’s easy to see which values should be removed when trimming. And we can see how well that works in the next two lines. We run R’s mean on ourData and supply a trim value of 0.1 to indicate 10%. This calculated mean is then assigned to tMean. And, finally, we print tMean to screen. The result shows that the mean value is indeed 500 when trimmed.
However, there is an important point to keep in mind. When we’re working with trimmed means we often want to compare results or work with multiple sets. But we can accomplish that with just a few tweaks of our earlier code. Take a look at the following.
ourData <- data.frame(
tMean <- sapply(ourData [c(‘study1’, ‘study3’)], function(x) mean(x, trim=0.1))
In this example, we’re presented with a situation where we want to calculate the trimmed mean of selected portions of a larger dataset. We define ourData as a data frame containing three separate columns. Then we run that data through sapply to selectively apply the mean function to study1 and study3 before assigning the results to tMean. And when we print out tMean it shows the expected result of 500.