ggplot2 is a powerful and flexible library in the R programming language, part of what is know as the tidyverse. In this tutorial we’re going to cover how to create a ggplot2 boxplot from your data frame, one of the more fundamental descriptive statistics studies. There are many ways to style and format ggplot2 boxplots, making them an ideal way to share data with non data scientists.
Let’s start by introducing the topic of data exploration, more specifically the early states. Your boss emails you a file, indicating it should be analyzed to explain or predict some effect. As a first step you will want to run some descriptive statistics on the data set. This usually includes some basic graphs to understand the shape, center, and spread of a particular variable.
Here are a few charts you might generate:
- Basic Boxplot (or multiple boxplots)
- Violin plot – version of the box plot
- Bar Plot (or bar chart)
- density plot
- Line Plot
- Scatter plot (maybe)
The goal is to quickly get a handle on the shape, center, and spread of a statistical distribution. What does the data look like? And in the event you generate multiple boxplots (see our tutorial on a side by side or grouped boxplot), you can quickly assess the predictive power of a categorical variable.
The Data for the R ggplot2 boxplot
A quick piece of house keeping: you will need to install the r ggplot2 library (not r ggplot, you will need the ggplot2 package). If you do not already have it, please install it (guidance is here).
The data set used in this tutorial will be the ubiquitous Iris data set that is available by default in the base R installation. To access it just call it by its name: iris. Let us take a quick look at what this dataset contains:
There are five variables in this dataset. Species gives us the categorical names of each of the species of irises that has been measured. The remaining four variables are numerical and contain the measurements of the lengths and widths of the sepals and petals of each flower. When confronted with a data set like this, a dataset which contains numerical variables (as data points), it is important to understand the spread and dispersion of the numerical values and how they relate when compared with the other variables. Boxplots are the perfect tool for just that. They allow us to plot and compare multiple numerical distributions side by side, not only giving us insight into the spread but also other key statistical information such as median, percentiles, and outliers.
Plotting boxplots in ggplot2 is very straightforward. We know that ggplot2 uses the grammar of graphics paradigm and thus all types of plots can be created by adding a corresponding geom_*() function to the base ggplot() plot function. In the case of a boxplot it is geom_boxplot().
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot()
This is the bare minimum boxplot from ggplot2. Let us take a moment to refresh ourselves on how to read and interpret boxplots. Right off the bat, we see three shapes, or “boxes”. Each one corresponds to a unique categorical value from the Species variable in our data set. On the y-axis, we see the whole range of possible numerical values this categorical may represent. The full length of each shape gives us the full range of each categorical value.
In our plot, we see a dot on the bottom side of the rightmost shape. This represents an outlier. An outlier is a value that is so distant from the rest of the range that if it were to be included in the range it might wildly skew any conclusions we might draw from observing this range. Therefore, it is specifically marked, so as to allow for you to account for it in your analyses. Often, boxplots, by allowing us to easily observe any outliers that might exists in a dataset, allow us to find any anomalies or inconsistencies in our dataset. This helps us manually account for these “errors” before going into any modeling or forecasting.
Using a boxplot for data exploration on our dataset has already been of help to us by giving us information on outliers. Let us go a little further. You will notice that each box has a pair of vertical lines appended to each side. This are called “whiskers”. They designate the two tailing ends of the distribution. Most specifically, the box itself represents 50% of the values in the dataset. The line running horizontally through the center of each box represents the median of the data. Each figure, from one endpoint to the other, gives as a concise statistical summary of the distribution of the range representing each category. In this way, we are able to gain intuitive insight into the relationship between the different categories.
Depending on the analysis you intend to conduct and the results you are aiming for, you can use a boxplot such as this to gain different insights. For example, you can clearly see that the Serosa specie has a narrower distribution. We also notice that, overall, its values are much lower than the other species. You can also see that the Versicolor specie has a range that ıs slightly offset. If you would like to better see the medians of the ranges for comparison, you can introduce a notch.
ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(notch=TRUE)
These are just some of the conclusions you can draw from a boxplot.
Making It Look Pretty
So… lets have a quick talk about making it look pretty, eg. making the graph look nice for your business management. (silly people, they insist on an axis label or add a horizontal line).
The geom function (geom_boxplot) has a variety of options you can tap into to create a better looking graph. Let’s walk through some of the more common requests.
Changing Box Plot Line Colors
Going back to our example, you can cause the colors of the different series to change by keying them to a categorical variable. For example, we could adjust the example above via:
ggplot(iris, aes(x = Species, y = Sepal.Length), color=Species) + geom_boxplot()
This will cause each Species to have a different color in the box plot.
You can also handle this manually, by series, using scale_color_manual option and specifying a vector of assigned colors.
p + scale_color_manual (values=c(“red”, “blue”, “green”))
Changing ggplot2 Boxplot Fill Colors
Similar to the color option, there is a fill option in the geom function you can use in a similar fashion. You can set this for the entire chart (as color=”red”) or let colors get assigned automatically by linking it to a categorical variable (fill=”Species”).
Want custom manual control? Similar to line colors, you can use the scale_fill_manual function to manually set the color for each group.
p + scale_fill_manual (values=c(“red”, “blue”, “green”))
Moving the Box Plot Legend Around
A favorite ask of managers. Move the legend.
Not a problem.
Or “left”, “right”, “bottom”, or “none”.
Creating a Notched Box Plot
A notched box plot can be created via the “notch” option in the geom function.
And there you have it, a notched box plot.
Adding a Mean Data Point to Boxplot
This is accomplished via the stats_summary function. Here’s an example.
p + stat_summary(fun.y=mean, geom=”point”, shape=2, size=3)
This plots the mean (in addition to the median) as part of the box plot. Very useful for multiple boxplots.