When doing data science sometimes you need to compare the theoretical values from a model to actual data. A normal probability plot is just such a comparison. In this case, the theoretical model is a normal probability distribution that represents the pattern expected from random data. This makes it an excellent tool for figuring out whether or not your data is random.

### Typical Approaches To Plotting Probability in R

R programming provides five base functions involved with plotting probability distributions. The ggplot2 package supplies a sixth possibility using its built-in plotting function. The dnorm function has the format of dnorm(x) where “x” is the vector being evaluated and it produces the density function for that set of data. The plot function has the basic format of plot(x,y) where “X” and “y” are two variables serving as plotting coordinates. When combined with the results of the dnorm function you can produce a plot of your data’s probability density distribution. The qqnorm function has the format of qqnorm(x) where “x” is the data set being evaluated and is the default function for plotting probability distributions. It is also known as a Quantile-Quantile Plot or QQ plot. The qqline function has the format of qqline(x), where “x” is the vector containing the data being evaluated, and it adds a line of equivalent value to your QQ plot. The qqplot function has the format of qqplot(x,y) where “X” and “y” are the two datasets being compared. You can also use ggplot function from the ggplot2 package to plot probability distributions for a data set contained in a data frame. These functions provide you with handy tools for plotting probability distributions that have lots of flexibility for evaluating your data.

### Specific Implementations: Normal Probability Plot in R

Regardless of the exact approach, when creating a normal probability plot the basic process is the same. The process may have different commands but behind the scenes, it is essentially the same. The program calculates the normal distribution for the data set. The data set is then used to calculate the graph. If you are calculating a density distribution curve, it uses the data set to calculate each position. If you are calculating a QQ plot, then the theoretical and actual positions are used as the axis of the graph. The QQ plot is simply a comparison between a theoretical and an actual data set where the theoretical is a normal distribution. You can use the same type of graph to compare real-world data to any theoretical model that you want. That is where the plot, qqplot, and ggplot functions come in handy. In each of these cases, if you are comparing your data set to a normal distribution the results are essentially the same, they may simply display it differently or supply additional information.

### Examples – Normal Probability Plot in R

Here we have seven examples of code that deal with the process of producing a normal probability plot. They include various aspects of the process and the functions that are a part of it.

> t = as.numeric(Sys.time())

> set.seed(t)

> x = rnorm(100)

> x = sort(x)

> y = dnorm(x)

> plot(x,y, type = “l”, lwd = 2)

This first example simply illustrates the process of plotting the distribution density curve for a random vector.

> t = as.numeric(Sys.time())

> set.seed(t)

> x = rnorm(100)

> qqnorm(x)

> qqline(x)

This example illustrates the production of a simple normal probability plot. It uses the most basic form of the qqnorm function.

> t = as.numeric(Sys.time())

> set.seed(t)

> mean = 10

> sd = 5

> x = rnorm(100)*sd+mean

> qqnorm(x)

> qqline(x)

This example illustrates the production of a simple normal probability plot with a non-zero mean and a standard deviation that is not equal to one. It also uses the most basic form of the qqnorm function.

> t = as.numeric(Sys.time())

> set.seed(t)

> x = rnorm(100)

> qqnorm(x, main = “Normal Probability Plot”, xlab = “Normal”, ylab = “Data”)

> qqline(x, col = “red”)

This example illustrates the production of a simple normal probability plot but with extra arguments added to the qqnorm and qqline functions to illustrate added features. These features include naming the plot and both of the axes, along with selecting a color for the line of a normal distribution.

> t = as.numeric(Sys.time())

> set.seed(t)

> x = rnorm(100)

> y = rnorm(100)

> qqplot(x,y)

> qqline(x, col = “red”)

This example illustrates using the qqplot function to compare two random vectors. Note that there is both an x and y in this function.

> library(“ggplot2”)

> t = as.numeric(Sys.time())

> set.seed(t)

> x = rnorm(100)

> df = data.frame(x)

> ggplot(df, aes(sample = x)) + stat_qq() + stat_qq_line(col = “red”)

In this example, we produce a normal probability plot using the ggplot function from the ggplot2 package.

### Application: Normal Probability Plot in R

The main application of a normal probability plot is to show whether or not data is approximately normally distributed. That is, it shows how random the data in a data set is. This is important because if the data is significantly off from a normal probability distribution it suggests that there is more going on than completely independent results. Such results can not only expose fraudulent data but also suggests other hypothesis explaining the data points.

Generating a normal probability plot is a handy way of testing data. The process can not only compare data to a normal distribution, but to other models as well. It is a handy tool to master when dealing with data science and one you should understand and learn within the R programming language.