Though people often only see the final visuals of a data science project, most of the time and effort a data scientist puts into his work happens during the data exploration phase. This is when the data is explored in an attempt to understand it and grasp its underlying features. Data visualization plays a crucial role in this phase just as it does in the final phase. The easier it is for the data scientist to visualize raw data, the more intuitive it is for him to view any aspect of it he so desires, the easier his job becomes. When you first approach a new dataset, so many possibilities are going through your head. You need an a set of tools that allows you to focus only on exploring those possibilities without being slowed down by the logistics of how those explorations take place. The package we are learning about in this series does just that. In conjunction with its capabilities as a publication-ready plot-rendering tool, it is an excellent tool for data exploration.
One of the challenges a data scientist faces when first confronted with a new dataset is understanding what the dataset is and what key features and patterns are inherently present within. He needs to be able to understand which part of the data needs manipulation, which features are most likely to be useful, what relationships exist between different parts of the data, and anything else he can figure out. In order to do this, he needs to view his data from different “angles”, so to speak. This means, he needs to be able to “cut” and “slice” the data in any way he desires. Not only that, he needs to be able to do this on the spot and visualize it instantly in order to gain an intuitive understanding of the data and spot anything that might be useful or otherwise impossible to spot through mere data processing. This can be achieved through facets in ggplot2.
Facets allow you to split and view your data in all sorts of ways. There are two functions provided by ggplot2 that allow us to create facets from our datasets. These are the facet_grid() and facet_wrap() functions. In this tutorial, we will focus on facet_wrap(). These functions allow use to split our data along any variable or variables and view the subsequent subsets all at once. This is useful because it allows us to do this while generating our plot. This means, we don’t have to waste time or code filtering or selecting certain features, assigning them to data frames, then plotting them individually only to realize that you should have split along different variables. In the data exploration phase of a project, one often needs to try many different combinations and this is very time-consuming to do manually. facet_wrap() gives you the ease of choosing which variables you would like to split your data by while you are plotting the data. The variable you want to split your data by is passed to facet_wrap() as an argument, thus giving you the luxury of changing only one argument to view your data split by a different variable.
We will continue using the iris dataset to learn about ggplot2. Those unfamiliar with the famous Iris flower data set, it is a dataset containing five variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. Any data scientist can immediately tell you by observing the variable names that the first four are numeric values (measurements). The fifth variable ought to be a categorical one. Let us start exploring this dataset by first viewing a scatterplot of sepal lengths and widths:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
You will immediately notice that this is not very useful. There is too much information and too many data points all jumbled together into one big mess. It is very hard to ascertain much from this plot. One solution to this is color-coding. Let us color code the species and plot again:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point()
This is an improvement on our first plot. We start noticing evident groups within the dataset. These signify that individual species have similar characteristics (at least when it comes to their sepal lengths and widths). However, it is still a bit confusing. Some of the species seem to have overlapping and closely related characteristics (versicolor and virginica). We would like to view each specie on its own in order to better see the characteristics of each species without interference from other species. This is where facets and facet_wrap() come in handy.
facet_wrap() is added onto a ggplot2 plot in the same manner we have been adding on theme() and geom_*() functions to plots, in accordance with ggplot2’s style. It takes as its main argument the variable you desire to split your data by. In our case, this is the Species variable. In order to inform ggplot2 that you want to split your data “by” this variable, you use a tilde (~). You can think of this tilde as meaning “by”. Therefore, facet_wrap( ~ Species) may be read as “split data by Species”. Let us see this in action:
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point() + facet_wrap( ~ Species)
The resulted is a “faceted” view of our data. We see three miniplots of subsets of our data split according to the Species variable. In our case, we only have three categories under the variable we are using to split the data. However, when dealing with datasets that might contain more variables, you may wish to arrange the miniplots in some fashion. You can achieve this by using the nrow or ncol arguments. This will allow you to specify how many miniplots you would like to view per column or row.
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point() + facet_wrap( ~ Species, nrow = 2)
This time, there are only two subplots per row rather than three as in the previous plot. More information on facet_wrap() and its arguments is available in the R Documentation (?facet_wrap or help(facet_wrap).