This article is a continuation of the ggplot2 series. The first half of the series was primarily concerned with the aesthetics of ggplot2. This included themes, creating facets, and customizing legends, among other things. The last couple articles have been concerned more with the data related side of ggplot2. The tutorials have been delving into how to create different kinds of plots. The article immediately preceding this one concerned the histogram. The tutorial went into detail explaining what a histogram is and how it is plotted using ggplot2. One of the distinguishing characteristics of a histogram, as mentioned in the article, is that it plots only one variable. It is, therefore, a relatively simple plot and quite fitting to be used as an introduction to plots.
This article, however, will instead be concerned with plots that plott more than one variable. In fact, most plots are actually used for plotting multiple variables. The scatter plot is one of the most common of these plots. Therefore, it is only natural for the transition from single-variable plots to multi-variable plots be made through scatterplots. Notice how the term “multi-variable” is used, as opposed to “two-variable”. The reason for this is that scatterplots are not limited to only two variables. This will be explained more clearly in the following section.
Scatter plots are most commonly used to plot, or display, two variables of a given dataset. The data, or the values in the dataset, are displayed as points. The position value of the point on the x-axis represents one variable, while the value of the positions of the points on the y-axis represents the second variable. Now, by mapping the aesthetics of the points to other variables, the plot may be extended from a two-variable plot to a multi-variable plot. For example, color may be added to the scatter plot. The varying colors of the points would thus represent a third variable. Likewise, shapes, sizes, and transparency can all be mapped to other variables, extending the range of variables visible at one glance in the plot. However, as enticing, as this may seem to do, it is not often advised to plot too many variables on one plot. Too much information is often worse than no information. Good practice in data visualization is to keep the amount of information being communicated per plot to a minimum. There should be only one or two main focus points in the graphic. The audience’s eyes should fall naturally on the important messages being reported about the data. They should not have to frantically search for what is meant with the plot and what the plot is attempting to communicate. However, multi-variable plotting may often come in handy, especially during data exploration. This article will cover all of them and leave the decision up to the reader.
How to plot a scatter plot in ggplot2
In adherence with the style of the previous articles, this article will use the Iris dataset. This dataset is available by default within R. All that is required to access it is to refer to it by its name (“iris”). There are four numerical variables, or features, that are represented in this dataset. A total of 150 observations, 50 for each species of Iris, are all the make up the dataset. That is why, it is often used as a practical test dataset when trying out new packages or tools. When confronted with a new dataset, scatterplots are often helpful in giving us insights into the data.
When ggplot2 is used to plot anything, often the default plot people use is the scatter plot. This is the go-to option for those seeking to quickly get a glimpse of or compare some variables in their data. The geom_*() function used for the scatter plot gets its name from the points that the plot uses to display the information:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
This outputs a nicely formatted scatter plot with black points, or dots, representing the values in the data. Ggplot2 takes care of setting suitable axis limits and point sizes.
This plot looks nice visually; however it might be difficult to discern much from it during data exploration. That is due to there being no apparent apparent trend among the points. Rather than dismissing the plot based on this conclusion, further work may be done in order to discover any other underlying patterns that may be existent in the data. One such step would be to display the different categories, in this case species, on the same plot. This might reveal patterns or relationships among the different categories.
The simplest way to map a third variable to a scatter plot is by using the color aesthetic. Mapping variables to plot aesthetics in ggplot2 is straightforward. It is done in aes() in the ggplot() function. In fact, this is already done twice, as x and y are both aesthetics (position aesthetics). To map a variable to the color aesthetic, use “color”:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()
Adding a third aesthetic forces ggplot2 to include a legend in the plot. This legend explains what the aesthetics mean. Note that these additional aesthetic are not suitable with continuous variables. They help encode categorical variables like the species variable in the iris dataset.
Using the color aesthetic in our scatter plot, we are immediately rewarded. The colors reveal a relationship among the variables. This is a good example of how scatter plots can help in explanatory data analysis. It can be clearly observed that the sepal length and width features of the Iris plant are clustered or grouped according to specie. There is an underlying pattern and it is not all a random mess. Many insights may be drawn from this plot, depending on the direction the analysis is to take. For example, it can clearly be observed that the setosa specie always has sepal lengths that are less than 6 centimeters. More exploration may be conducted using other aesthetics. This is left to the read to explore (try: shape, alpha, size).