What does n=n( ) mean in R? (Unlock dplyr's capabilities)

Sometimes when doing data science, it is necessary to keep track of the number of occurrences that you have. The nn function is a tool for doing this job. It is part of the dplyr package, which needs to be loaded before the function can be used.

Description

The nn function is part of the dplyr package and it has the format of n( ). It is commonly used in the form of n=n( ) where the function loads the number of observations that it counts to the variable n. This variable is not used for display but for processing purposes. It is a handy tool for controlling the number of observations needed before processing. This usage is group-specific And it has no meaning outside of that group. These limits do not prevent it from being a useful tool.

How It Works Within the Mutate Function

The nn function only works within the summarise, mutate and filter functions of the dplyr package. The n=n( ) variant is used within the mutate function. The purpose in all three cases is to supply the number of observations of the current data group to these functions. When used in the filter function, it is used to set the minimum number of observations that will be processed. For example, if you set this function to one hundred, you will need to have at least a hundred observations for the current data group to be processed.

Examples

Here is an example of the nn function in action in all three functions that it is designed to work in. This example not only requires the dplyr package but the nycflights13 package as well. It prints out some data being processed from the nycflights13 package.

> library(dplyr)
> library(nycflights13)
> if (require(“nycflights13”)) {
+ carriers = group_by(flights, carrier)
+ summarise(carriers, n())
+ mutate(carriers, n = n())
+ filter(carriers, n() < 100)}
# A tibble: 32 x 19
# Groups: carrier [1]
year month day dep_time sched_dep_time dep_delay arr_time
1 2013 1 30 1222 1115 67 1402
2 2013 11 3 1424 1430 -6 1629
3 2013 11 10 1443 1430 13 1701
4 2013 11 17 1422 1430 -8 1610
5 2013 11 25 1803 1759 4 2011
6 2013 11 30 1648 1647 1 1814
7 2013 6 15 1626 1635 -9 1810
8 2013 6 22 1846 1635 131 2107
9 2013 8 27 1755 1805 -10 1956
10 2013 8 28 2039 1805 154 2213
# … with 22 more rows, and 12 more variables:

As you can see all three of these functions are used, furthermore combined they produce the data table displayed in this example. It displays data concerning flights arriving and departing out of New York City. It is actually a fairly simple process since each of the functions being used has only two arguments.

Application

The primary application of the nn function is to keep track of the number of observations being processed within thee the summarise, mutate and filter functions of the dplyr package. The goal of this function is to make sure that a significant number of observations are being processed to make the results statistically significant. For example, if you are conducting a poll in an election asking only five people is probably not statistically significant, but asking a thousand people might be.

Keeping track of the number of observations when doing data science is an important part of making sure that the results are statistically significant. The nn function is a handy tool for this process when you are working with the dplyr package.