Numerous and complex topics have been covered in this series so far. In an attempt to get more and more familiar with ggplot2, topics varying from themes to legends to facets were discussed. Today, the focus will be on a much simpler topic. Among the many plots ggplot2 offers, the histogram cannot be ignored. There are many plots available to for use to a data scientist. Histograms also have a role to play. Each plot has a use and histograms, likewise, offer their two cents when needed. This tutorial will cover histograms and how they are to be implemented using ggplot2.
Histograms
Histograms are often confused with bar charts. Quite often, they are referred to, in discussions, interchangeably. Many do not know the difference and never bother attempting to understand it. These are completely different plots and offer completely differ benefits. Therefore, it is important to know the difference. The confusion stems from the similarity in which their look. They both have “bars”, so to speak, that represent data. However, the type of data they each represent are quite different. If the data being used contains categorical values or, in other words, if it has categories, then a bar chart should be used. In such a case, the x-axis will represent the categories, or the categorical variable, while the y-axis represents the individual values in each category. So, remember, if the data contains categories, use bar plots.
In contrast to categorical data, there are continuous data. What is often desired when working with such data is an understanding of how the set can be broken down into ranges. If such a visualization is desired, then a histogram is required. Where as a bar chart represents two variables, the variable containing the categories and the variable containing the values, a histogram represents only one. It represents a continuous variable. More precisely, it represents the frequency of different ranges within that variable. For example, say during the course of a study, a list of ages of the people involved in the study were gathered. Assuming the range of possible ages for a human starts from zero and goes to a hundred, the range could be split up into 12 distinct ranges, each 10 years in length. Then, a histogram would be used to plot the number of persons falling into each range. If there were five people that had ages between 20 and 30, the bar representing this range in the histogram would have a height of five.
Since histograms represent ranges of a continuous variable, it is evident that there can be no spaces between the bars in a histogram. The bars in bar charts, on the other hand, represent distinct categories and therefore it makes sense that there be spaces between them. This is one of the ways histograms and bar charts may be easily distinguished from each other. The subranges the bars represent in a histogram are sometimes called “bins”. The width of the bins can be adjusted as desired. Sometimes, smaller bins are required. This might be the case when more detailed information is needed on the dataset being represented. Other times, a more generic understanding of the dataset is desired and therefore wider bins are used.
The Data
In adherence with the previous articles in this series, the Iris dataset will be used for this tutorial as well. For those unfamiliar with this dataset, it is a data set containing 150 samples. Three different species of Iris were measured, 50 samples for each specie. Four features were measure for each specie, the length and width of the sepals and petals. In order to get a better grasp of the dataset, summary() and str() can be invoked
summary(iris)
str(iris)
For the sake of this tutorial, Petal.Length will be used since it offers the largest spread of all the features and will provide to be the better one for showcasing the use of a histogram.
How to plot a histogram using ggplot2
By now, enough has been covered on ggplot2 when it comes to how to plot and use the ggplot() function. Those unfamiliar with this library may be advised to go over the previous articles in this series. The grammar of graphics methodology ggplot2 utilizes makes it evident that there should be a geom_*() function available through ggplot2 to plot a histogram. This is indeed the case:
ggplot(iris, aes(Petal.Length)) + geom_histogram()
This first thing to notice here is that this is the first time we see only one variable in aes(). This is due to the nature of histograms, as explained in detail above. Histograms only plot information on one variable, whereas most other plots plot information on two or more variables. Histograms can be thought as one-dimensional plots whereas scatter plots or bar charts are two-dimensional or more.
The second thing to notice is the warning message ggplot2 gives.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ggplot2 tells us that the “bins” attribute was given the default value of 30. It also tells us that we can pick a better value using the “binwidth” attribute. What are bins? you may be asking. Well, as previously alluded to, they refer to the subranges a histogram produces of the variable being plotted. In the example of the ages of the participates in the survey, the variable being plotted has a range from 0 to 120. If this range were to be split into subranges of width 10, there would be 12 total subranges, meaning 12 bins. The width of the bins would be 10. In the histogram we just plotted, the number of bins (specified with bins=30) was picked to be 30, by default. This means, ggplot2 picks the subranges in such a way as to make sure there are exactly 30 bars for the complete range of the plot (in this case 1.00 to 7.00).
There are two ways to adjust the bins in a histogram. The “binwidth” attribute can be used within geom_histogram() to adjust the width of the binds. Or, “bins” can be used to adjust the number of bins to be used. Below are examples for each:
ggplot(iris, aes(Petal.Length)) + geom_histogram(bins=10)
This ensures that there are a total of 10 bins, or bars, in the resulting plot. And:
ggplot(iris, aes(Petal.Length)) + geom_histogram(binwidth=0.5)
this ensures that each bin, or bar, has a width of 0.5.
That is all that is needed to get started using histograms in ggplot2. The same customization principles explained in previous tutorials may subsequently be used on this plot as well. Legends, themes, colors, and any other modification, may be added onto this plot, just as with any other plot in ggplot2.