Percentages and percentiles are similar in many ways, and sometimes the terms are used interchangeably, but they are different things. A percentage represents a fraction while a percentile represents the fraction of the data points of a data set below a certain point. Both a percentage and percentile value provide useful information about a data set but they are not the same.
Both are used in various types of order statistics, to find different measures and estimate probability in a dataset distribution. Most often they are used in a standard normal distribution of a continuous variable, from the smallest data value to the largest value. Each data point can be accounted for with a percentile statistic, as well as many other probability statistics within a data frame or dataset. They can help you find mean, median, z score, standard deviation, regression, interquartile range, outliers, the correlation coefficient, and more. In a standard normal distribution, the percentiles are clearly defined, with important values such as the 80th percentile and 95th percentile being relatively easy to point out on the bell curve.
A pth percentile rank within a data set is the value within the data set that has a certain percentage (p) of the data points below it. To demonstrate how the process works, I will demonstrate by finding the 12th 37th 62nd 87th percentiles.
5 10 12 15 20 24 27 30 35
Here is our example already in numerical order, there are nine values in this data set. To find the percentile we take the percentage of number of values in the data set, count up that number of values and then go to the next value up. That value is our percentile.
- 12% of 9 = 1.08 – percentile = 10
- 37% of 9 = 3.33 – percentile = 15
- 62% of 9 = 5.58 – percentile = 24
- 87% of 9 = 7.83 – percentile = 30
What these results show is that 12% of the values or less than 10, 37% or less than 15, 62% or less than 24, and 87% or less than 30. This process naturally works better with larger data sets. This is in part because you need to get a hundred data points before you have a complete percentile rank set.
The three quantiles of a data set are the numbers whose percentiles are the quarter marks of the data set. Specifically, they are the values in the data set that are at 25%, 50%, and 75%. These are also known as a quartile, and the space between the 25th percentile and 75th percentile is known as the interquartile range. This calculation method is the same as the percentile value calculations above.
- 25% of 9 = 2.25 – quantile1 = 12
- 50% of 9 = 4.50 – quantile2 = 20
- 75% of 9 = 6.75 – quantile3 = 27
This clearly connects percentile and quantiles calculations showing how closely the concepts are related. This is why R uses the same quantile function for both.
How to find percentiles in R
So how to find percentiles in R? You find a percentile in R by using the quantiles function. It produces the percentage with the value that is the percentile.
# how to find percentiles in R - quantile in r > x = c(5,10,12,15,20,24,27,30,35) > quantile(x) 0% 25% 50% 75% 100% 5 12 20 27 35
This is the default version of this function, and it produces the 0th percentile, 25th percentile, 50th percentile, 75th percentile, and 100th percentile.
# how to find percentiles in r - quantile in r > x = c(5,10,12,15,20,24,27,30,35) > quantile(x, probs = c(0.125,0.375,0.625,0.875)) 12.5% 37.5% 62.5% 87.5% 10 15 24 30
Here, we have the inclusion of the probs (probability) option which allows you to set other percentages.
There are many applications to finding a percentile in R. Here’s a good example of a long dataset consisting of 7,980 data points.
# how to find percentiles in R using treering data > quantile(treering) 0% 25% 50% 75% 100% 0.000 0.837 1.034 1.197 1.908
Here, we have the quantiles and the minimum and maximum values. One thing it reveals about these tree rings is that they tend to be concentrated in the middle. The IQR is 0.36 when the range is 1.908 meaning that the IQR makes up only about 19% of the range of the data set.
Finding the numbers that represent a given percentage in a data set can tell you much about it. It can tell you how concentrated and skewed the values are. It is an example of R as a tool in data science.