Guide to ggplot2: Scatter Plots

This article is a continuation of the ggplot2 series. The first half of the series was primarily concerned with the aesthetics of the ggplot2 package. This included themes, creating facets, and customizing legends, among other things. The last couple articles have been concerned more with the data related side of ggplot2. The tutorials have been delving into how to create different kinds of plots. The article immediately preceding this one concerned the histogram. The tutorial went into detail explaining what a histogram is and how it is plotted using ggplot2. One of the distinguishing characteristics of a histogram, as mentioned in the article, is that it plots only one variable. It is, therefore, a relatively simple plot and quite fitting to be used as an introduction to plots.

This article, however, will instead be concerned with plots that plot more than one variable. In fact, most linear model plots are actually used for plotting multiple variables. The ggplot scatter plot is one of the most common of these plots. Therefore, it is only natural for the transition from single-variable plots to multi-variable plots be made through scatterplots. Notice how the term “multi-variable” is used, as opposed to “two-variable”. The reason for this is that scatterplots are not limited to only two variables. This will be explained more clearly in the following section on creating a scatterplot in R.

Scatter Plots

Scatter plots are most commonly used to plot, or display, two variables of a given dataset. The data, or the values in the dataset, are displayed as points. The position value of the point on the x-axis represents one variable, while the value of the positions of the points on the y-axis represents the second variable. Now, by mapping the aesthetics of the points to other variables, the plot may be extended from a two-variable simple scatter plot to a multi-variable 3d scatterplot. For example, color may be added to the scatter plot. The different color of the points would thus represent a third categorical variable. Like different color y values, shapes, sizes, and transparency can all be mapped to other variables, extending the range of variables visible at one glance in the plot. This allows you to map a trend line on the ggplot scatter plot, plotting a linear regression line and seeing the correlation coefficient of your data frame directly on the graph using a plot function.

However, as enticing, as this may seem to do, it is not often advised to plot too many variables on one plot. Too much information is often worse than no information. Good practice in data visualization is to keep the amount of information being communicated per plot to a minimum. There should be only one or two main focus points in the graphic- even a 3d scatter plot can be too confusing, with too many axes to be able to understand axis labels, the linear regression line, or see the correlation coefficient. The audience’s eyes should fall naturally on the important messages being reported about the data. They should not have to frantically search for what is meant with the plot and what the plot is attempting to communicate. However, multi-variable plotting may often come in handy, especially during data point exploration. This article will cover all of them and leave the decision up to the reader.

How to plot a scatter plot in ggplot2

In adherence with the style of the previous articles (box plot, line plot, etc), this article will use the Iris dataset. This dataset is available by default within R. All that is required to access it is to refer to it by its name (“iris”). There are four numerical variables, or features, that are represented in this dataset. A total of 150 observations, 50 for each species of Iris, are all the make up the dataset. That is why, it is often used as a practical test dataset when trying out new packages or tools. When confronted with a new dataset, scatterplots are often helpful in giving us insights into the data.

When the ggplot2 package is used to plot anything, often the default plot people use is the scatter plot. This is the go-to option for those seeking to quickly get a glimpse of or compare some variables in their data. The geom_*() function used for the scatter plot gets its name from the points that the plot uses to display the information:

# scatterplot in R
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()

This outputs a nicely formatted scatter plot with black points, or dots, representing the values in the data. Ggplot2 takes care of setting each graphical parameter such as suitable vertical axis and horizontal axis labels, and data point sizes.

This scatterplot matrix looks nice visually; however it might be difficult to discern much from it during data exploration. That is due to there being no apparent apparent trend among the points. Rather than dismissing the plot based on this conclusion, further work may be done in order to discover any other underlying patterns that may be existent in the data. One such step would be to display the different categories, in this case species, on the same plot. This might reveal patterns or relationships among the different categories.

The simplest way to map a third variable to a scatter plot is by using the color aesthetic. Mapping variables to plot aesthetics in ggplot2 is straightforward. It is done in aes() in the ggplot() function. In fact, this is already done twice, as x and y are both aesthetics (position aesthetics). To map a variable to the color aesthetic, use “color”:

# how to make scatter plot in r
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()

Adding a third aesthetic forces ggplot2 to include a legend in the plot. This legend explains what the aesthetics mean. Note that these additional aesthetic are not suitable with continuous variables. They help encode categorical variables like the species variable in the iris dataset.

Using the color aesthetic in our scatter plot, we are immediately rewarded. The colors reveal a relationship among the variables. This is a good example of how scatter plots can help in explanatory data analysis. It can be clearly observed that the sepal length and width features of the Iris plant are clustered or grouped according to specie. There is an underlying pattern and it is not all a random mess. Many insights may be drawn from this plot, depending on the direction the analysis is to take. For example, it can clearly be observed that the setosa specie always has sepal lengths that are less than 6 centimeters. More exploration may be conducted using other aesthetics. This is left to the read to explore (try: shape, alpha, size). For more about creating a box plot, line plot, pairs plot, 3d plot, histogram, or more, check out the other articles here on ProgrammingR!