In the world of data science, there is always a need to verify your results are sound in their declarations. Ascertaining the p-value of a data plot can get tricky with more tests running scenarios and leaving you with more groups to validate as a collective whole to see if your hypothesis still holds. Here we’ll look at how to alleviate that chore with a function that looks at and compares the means of <i>all</i> plots p values and scrutinizes them together for a definitive mean value relating to the overall truth in your hypothesis.
The stat_compare_means function in R is actually a specified function based on the more general compare_means tool that compares all means in a specified string, with the stat extension more specifically telling the program to take calculated p-values and significance data and assign it to the axis in ggplot graphs. This specification can be used in any graphing performed through the ggpubr package syntax. The basic code line reads fairly simple;
stat_compare_means(mapping = NULL, comparisons = NULL hide.ns = FALSE,
label = NULL, label.x = NULL, label.y = NULL, …)
Here we see the arguments spelled out for the function: items like hiding the ns symbol if it appears, where labels are positioned on the x and y-axis, and other branches of the decision tree for the means tool to rely on. Setting up this matrix in advance can help organize your dataframe as you continually run tests and create more graphs that need to be accounted for the new mean to be calculated with each additional probability testing.
Moving forward with testing, we’ll show how to compare two sets of data against each other to see how the compare_mean function interprets these variables in one graph.
compare_means(len ~ supp, data = CommunityDevelop)
ggboxplot(CommunityDevelopGrowth, x = “years”, y = “change”,
color = “supp”, palette = “jco”,
add = “jitter”)
p + stat_compare_means()
p + stat_compare_means(method = “t.test”)
We’ve now set up a box plot with included p values that compares the “years” values with the “change” values and plots out the corresponding means between the two groups.
Now let’s look at comparing more than two groups, say several at once. IF we want to get a specific pairing up on the chart, we’ll tell the means tool to look for specifically named groups:
list( c(“x1”, “y3”), c(“xa”, “yb”), c(“x2b”, “y4c”) )
and while we can leave it to compare these groups, we can also set a baseline means for standard comparisons with every group we want to plot
stat_compare_means(comparisons = my_comparisons, label.y = c(1, 2, 3))+
stat_compare_means(label.y = 10)
That last line lists the label. y as the global p-value to use as a standard for comparisons with all the groups in our data frame. You can base your evaluation on a global data frame of other groups to get greater precision with your plots, but keep in mind you will rarely have to be that dead-on in your testing methods to require such a level of validation.
We’ll end with an example of some errors encountered with the mean function by other programmers. One often-cited issue is adding significant levels and specifying significant values the mean function does not seem to recognize the hide.ns argument. While you can calculate the values on your own and simply add them to the plot under the ggsignif function geom_signif command. If you adopt to workaround rewriting the mean function, you can simply single out the values you want to be plotted by making them a subset of your original data source and run the subplot through the mean function instead. Most times programming your ggplots can seem arduous with nitpicking command lines. Keep in mind that sometimes the best fix is normally the simplest.