Making Data Plots Sexy
The world of data analytics is often held up by non-data intensive viewers as a specialized environment for hard-trained specialists who use extensive skillsets to create professional-level plots and diagrams for boardroom presentations. In truth, the bulk of work in creating elegant data visualization is handled by programs and functions readily available through the data science community for everyone to take full advantage of. The ggpubr package available for R language programs is a library of functions for creating publication ready plots with little to no advanced r programming skills.
As we get more in-depth with ggpubr, we’ll look at how the package can assist in taking different values and creating a box plot, dot plot, and violin plot formats plus other data visuals like graph and label to make fully discernable visualizations. As we go along there will be samples of the codes talked about as well as for instructions for placing your different visuals on one page so you can show comparisons between your dot chart and density plot without flipping back and forth. Let’s look at how to take data from the variable to the visual and start practicing with some provided resources at the end.
Starting with ggplot2 in R
The tidyverse library ggplot2 is a whole bible of graphics for Rstudio users. Setup and running, users input their data set into ggplot2 and add further features with different tools for such features as scales, layers, and coordinate systems.
coord_flip() – <i>Coordinate System</i>
scale_colour_brewer() – <i>Scales</i>
geom_point() – <i>Scatter Plots</i>
To install the gghub specifically, use the install.packages(“ggpubr”) command. From there it’s easy to call on the library of plotting functions for different forms of data visuals.
Here’s one sample for a scatter plot:
ggscatter(df, x = “wt”, y = “mpg”,
color = “black”, shape = 21, size = 3, # Points color, shape and size
add = “reg.line”, # Add regressin line
add.params = list(color = “blue”, fill = “lightgray”), # Customize reg. line
conf.int = TRUE, # Add confidence interval
cor.coef = TRUE, # Add correlation coefficient. see ?stat_cor
cor.coeff.args = list(method = “pearson”, label.x = 3, label.sep = “\n”)
This plots the values given as well as shows additional parameters for regression lines, the confidence interval, and the correlation coefficient. These values originally began as simple line plots, but with the ggpubr toolset, the scope of your values grows in-depth with each additional parameter.
#> `geom_smooth()` using formula ‘y ~ x’
To show how these functions would be used, let’s take a data sample that needs sorting. For this example, we’ll look at car emissions from car brands and see which manufacturer has the worst rankings. We’ll load the dataset “mtcars” and start graphing it in different formats.
We’re looking at the number of cylinders in regards to gas efficiency, so we’ll sort the data to filter out those parameters and name the columns.
head(dfm[, c(“name”, “wt”, “mpg”, “cyl”)])
Now that we have our reference data, we’ll plot it using functions from the ggpubr package.
ggbarplot(dfm, x = “name”, y = “mpg”,
fill = “cyl”,
color = “white”,
palette = “jco”,
sort.val = “desc”,
sort.by.groups = FALSE,
x.text.angle = 90 #
Notice the long list of code that affects everything about the bar graph from the color of the category to the orientation angle of the text. Though seeming tedious, this level of control can allow you to alter your data graphs to fit your needs and allow endless alterations depending on what factors you want to focus on. Learning to use these parameters to full effect, you can create data graphs that fully capture the attention of your audience.
Don’t forget, you can show more advanced data values and calculations with the functions doing the bulk of the work. Let’s use our toolset to calculate the car’s mpg z score using our dataset and publish a bar plot showing its scores.
dfm$mpg_z (dfm$mpg -mean(dfm$mpg))/sd(dfm$mpg)
dfm$mpg_grp factor(ifelse(dfm$mpg_z 0, “low”, “high”),
levels = c(“low”, “high”))
We’ll sort the data and set parameters to give an accurate bar plot graphic showcasing the found z score values calculated by the first lines of code.
head(dfm[, c(“name”, “wt”, “mpg”, “mpg_z”, “mpg_grp”, “cyl”)])
ggbarplot(dfm, x = “name”, y = “mpg_z”,
fill = “mpg_grp”,
color = “white”,
palette = “jco”,
sort.val = “asc”,
sort.by.groups = FALSE,
x.text.angle = 90,
ylab = “MPG z-score”,
xlab = FALSE,
legend.title = “MPG Group”
Multiple Graphs on one Page
Now that you are more familiar with the ggpubr library and its different commands, let’s try taking some publication ready plots and put them all in one space. This will require loading the <i>grid extra R</i> package to make use of its grid arrange and <i>arrange Grob</i> to place different plot diagrams on the same page.
ggarrange(bxp, dp, ncol = 2, labels = c(“B”, “C”)),
nrow = 2,
labels = “A” )
When using these functions be aware that neither will align the axes of the diagram to be even on the page. If having the plots aligned on their axes is required, load the cow plot package and its plot grid function to get the placement you want.
arrangeGrob(bxp, dp, ncol = 2),
nrow = 2)
With functions from both packages, you can optimize your page with more details describing your plots and formating with just a few straightforward functions. Starting with the grid. layout to define the spacing of the page, the type of plots and visuals you want can be shown in full detail. This goes for a scatter plot diagram, for example:
sp ggscatter(iris, x = “Sepal.Length”, y = “Sepal.Width”,
color = “Species”, palette = “jco”,
size = 3, alpha = 0.6)+
xplot ggdensity(iris, “Sepal.Length”, fill = “Species”,
palette = “jco”)
yplot ggdensity(iris, “Sepal.Width”, fill = “Species”,
palette = “jco”)+
yplot yplot + clean_theme()
xplot xplot + clean_theme()
# Arranging the plot
ggarrange(xplot, NULL, sp, yplot,
ncol = 2, nrow = 2, align = “hv”,
widths = c(2, 1), heights = c(1, 2),
common.legend = TRUE)
Taking the dataset Species inputs, the functions show a scatterplot of values grouped and colored by value. The resulting chart shows the plot as well as a picture of the marginal density values for both the length and width of the scale. With the right functions, you can do anything to annotate your data and make professional-level presentations. If you feel more detail is needed to explain to your audience, try adding such features as a reference table, a clear graphical model, and text to read about observational conclusions for the value set being shown.
density.p ggdensity(iris, x = “Sepal.Length”,
fill = “Species”, palette = “jco”)
stable desc_statby(iris, measure.var = “Sepal.Length”,
grps = “Species”)
stable stable[, c(“Species”, “length”, “mean”, “sd”)]
stable.p ggtexttable(stable, rows = NULL,
theme = ttheme(“mOrange”))
text paste(“iris data set gives the measurements in cm”,
“of the variables sepal length and width”,
“and petal length and width, respectively,”,
“for 50 flowers from each of 3 species of iris.”,
“The species are Iris setosa, versicolor, and virginica.”, sep = ” “)
text.p ggparagraph(text = text, face = “italic”, size = 11, color = “black”)
ggarrange(density.p, stable.p, text.p,
ncol = 1, nrow = 3,
heights = c(1, 0.5, 0.3))
These features will come in handy when presenting information to convince an audience of your hypothesis and subsequent conclusion. If you want to convince your superiors of the popularity of a product, use a clear graphic showing the spikes in purchases for your parameters and see your proposal gain followers.
Where To Go From Here
Now that you’ve got your head around ggplot and its functions, look at where you can apply these new skills. In a business environment, one of the biggest attributes accredited to productivity is communication. The rise of advanced data collection in determining decisions for a whole company has been trailed by a growing confusion amongst company members unfamiliar with data language and the breakdown in information reception can lead to major delays for projects vital to its managers. With a quick install and some input data sets, you can now bring your fellow workers into the world of advanced data analytics with a guided tour by visual graphics. Make annotated scatter plots for that year’s sales figures and make clear the success factors not seen through the glut of code that will lead to your team’s success. With the R library at your disposal, the data side of the business can now appeal to its less technical side and connect its teams in a new way through your publication skills.
To learn more and try some hands-on coding, try these sources:
Source Code: rpkgs.datanovia.com/ggpubr/
A Look at ggplot and some of its challenges:https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html#Plotting_with_ggplot2
For more illustrations on data:https://www.r-graph-gallery.com/ggplot2-package.html