How To Create Barplots using ggplot2

Barplots are some of the simplest plots available in a data scientist repertoire. Despite their simplicity, they are often very useful and insightful in the information they convey. Rarely is a dashboard designed, or report written, in which a barplot does not possess some significance. Just as with any other library or program, ggplot2 offers a straightforward approach to building barplots. There will be a few tricky points, such as the internal stats functions ggplot2 employs when calculating the values of a barplot, that need to be carefully considered. Aside from that, what requires the most attention is proper customization of the plot. In earlier articles, customization and the use of theme() was covered. Everything pertaining to customization of titles, legends, colors, and more, previously mentioned, also applies to the barplots shown in this tutorial. Some examples will be presented. However, ggplot2 is incredibly extensive in its customization capabilities. This means that there are a near endless number of possible customizations that can be applied to a ggplot2 barplot. This will remain in the hands of the reader to further pursue and explore.

Data

              For this tutorial, a simple and custom data set will be generated. Up until now, the Iris dataset, available by default in the base R installation, was used for all examples. However, the iris dataset is not particularly suited for barplots. There are two main uses for barplots. When a dataset contains multiple occurrences of specific categorical values, barplots may be used to visualize the number of occurrences of each categorical value. Another use for barplots is to display the value assigned to each categorical variable in the data.

              Since the Iris dataset does not fall into either of these two categories, a simple example dataset will be generated for use in this tutorial. The dataset will be in the form of a dataframe. At this point, if the reader feels he lacks adequate knowledge in R or R Studio, he may choose to take a quick “R Basics” tutorial before continuing further with this tutorial. The dataframe will contain two columns. One of these columns will contain the categorical variables of the dataset and will be called “Letters”. The second column will contain the values of each of the categorical variables and will be called “Values”.

df <- data.frame(Letters=c(“Alpha”, “Beta”, “Gamma”, “Delta”, “Epsilon”),

                                             Values=floor(abs(rnorm(5 , 0 , 15))))

              At any point, when confronted with an unfamiliar function, the console may be used to search for the documentation of that function. Try typing “?rnorm” into the console. You will find that this function generates a normal distribution of random variables. Since the “Letters” columns contains five entries, only five random values are generated, with “5” as the first argument of rnorm(). Only positive numbers are desired. This is achieved through the use of abs() to get the absolute value of the results. Additionally, floor() is applied to the values to get rid of their decimals so as to end up with just integeras. Now, we have a simple data set containing five rows. Type “df” into the console to see the dataset

Letters Values

1   Alpha      5

2    Beta      4

3   Gamma     16

4   Delta     16

5 Epsilon     41

              Now hold one a minute. You might notice that your values are completely different than the ones shown here! This makes sense, since rnorm() is supposed to generate completely random numbers. However, since this is a tutorial and the reader should be able to follow along exactly with what is being presented here, we need a way to generate the same random numbers. In order to do this, we can set the “seed” of the random generation function.

              set.seed(2020)

If you set the seed to the same number as set here (2020), then re-run the above code, you should see the exact same values! Trying running it a couple more time. Notice how the values do not change.

Barplots

              Now that we have our dataset, let us finally make our first barplot. As you are already well aware of by now, all plots in ggplot2 are constructed using a geom_*() function. For a barplot, as you might easily have guessed, the function is geom_bar(). Using our dataset, we can plot a simple barplot as follows:

ggplot(df, aes(x = Letters, y = Values)) + geom_bar()

Ugh-oh. There is a problem. We get an error message telling us that we can only use one aesthetic, x or y, for stat_count(). What is state_count()? This will be explained shortly but let us try to rerun our code with only the x aesthetic assigned.

              ggplot(df, aes(x = Letters)) + geom_bar()

This results in an error-free plot. However, it does not look right. The “Letters” are all correctly mapped to the x-axis but their values are all 1! In order to understand the reason for this, we need a little bit of background information on how ggplot2 functions calculate values for plotting.

Stat functions

            Often, a plot does not directly display information provided to it through the dataset. It needs to transform the data into some other form in order to display it. In ggplot2, these are called statistical transformations, or stat functions for short. Stat functions comprise a huge topic that is way above the scope of this tutorial, therefore we will only be going into them briefly.

              Each geom_*() function in ggplot2 is assigned a default stat function. This stat function transforms the data before the plot is created. In the case of geom_bar(), the default stat function is the stat_count() function (?stat_count). The resulting transformation is a condensed (or summarized) version of our dataset, which counts the number of times each categorical variable is found in the data set. In our dataset each “Letter” only occurs once, therefore the count for each “Letter” is 1. This explains the plot above.

              So, geom_bar(), by default, plots the count of each categorical variable. If, instead, we would like to plot another variable corresponding to each categorical value (in our case, Values), we would have to use a different stat function. The “identity” stat function maps a variable directly to an aesthetic without transforming it in any other way. By redefining the “stat” property of geom_bar(), we can finally create the plot we are looking for.

              ggplot(df, aes(x = Letters, y = Values)) + geom_bar(stat = “identity”)

Basic customization of a barplot

              The topic of customization was covered extensively in previous tutorials. These are all applicable to our barplot as well. This includes, but is not limited to, customizing the colors, sizes and titles of all the components of the plot. For a recap, please review the previous tutorials.

              Sometimes, it looks better if a barplot is horizontal instead of vertical. This may easily be achieved using coord_flip() as shown below:

ggplot(df, aes(x = Letters, y = Values)) + geom_bar(stat = “identity”) + coord_flip()

You might also want to change the widths of the bars. This can be achieved with the “width” parameters in geom_bar():

              ggplot(df, aes(x = Letters, y = Values)) + geom_bar(stat = “identity”, width = 0.5)

One thing to be careful of when customizing a barplot is the distinction between the “color” and “fill” aesthetics. “Color” maps to the color of the borders of the bars and not to the fill colors of the bars. To map a variable to the fill color of the bars you need to use “fill”. For more information on ggplot2 barplots, see ?geom_bar