Selecting Random Samples in R: Sample() Function

Many statistical and business analysis projects will require you to select a sample from a list of values. This is particularly true for simulation requests. To select a sample, r has the sample() function. This function can be used for combinatoric problems and statistical simulation.

Tempers flare a bit when you talk about random samples in certain audiences. This article is going to focus on the essence of using sample () to select values from a list. We are also going to briefly discuss more advanced options for sampling and random number generation.

R Sample() – Random Selections From A List

R has a convenient function for handling sample selection; sample(). This function addresses the common cases:

  • Picking from a finite set of values (sampling without replacement)
  • Sampling with replacement
  • Using all values (reordering) or a subset (select a list)

The default setting for this function is it will randomly sort the values on a list. These are returned to the user in random order. Sample code is below:

# r sample - simple random sampling in r

sample (vector_of_values)
sample (c(1:10))

This request returns the following:

 [1] 7 8 2 9 1 4 6 3 10 5 

As you can see, we’ve shuffled the list of the first 10 numbers into a different order.

But what if a value can be selected multiple times? This is known as sampling with replacement. Sample supports this via an additional parameter: replace. Replace can be T (true) or F (false). The default case assumes no replacement. Code example looks like:

 
# r sample with replacement from vector
sample (c(1:10), replace =T) 

Yielding the following result. As you can see, certain values are repeatedly picked.

 [1] 4 7 10 9 4 6 6 4 3 4 

We can add the size parameter to return only a few values. The following code will pick three values.

# r sample multiple times without replacement
sample (c(1:10), size=3, replace =F)

Yielding the following result.

[1] 3 6 8

The same result with replacement turned on…. (carefully selected)

 
# r sample with replacement from vector
sample (c(1:10), size=3, replace=T) 
[1] 9 9 1

It took a couple of trials to get that random selection.

As a practical use case, we can use this to figure out who will pick up the bar tab for a R meetup.

 
# r sample - generate random sample in r
sample (c('Joe','Karl','Jack','Larry','Curly',
             'Moe','Kim','Kathy','Sam','Jim'), size=1) 
[1] "Kim"

Drinks are apparently on Kim this week.

Adjusting Probabilities

The prior examples assume we are selecting values at random from a list. But R sample also allows us to adjust the probability of each item being selected. We do this with the prob argument.

Our next example imagines us on a factory floor. We make widgets, which have a certain chance of being defective. Our quality isn’t great, so there is a 25% chance of a widget being defective. We can simulate this using the following code.

 
# simple random sampling in r
sample (c('Good','Bad'), size=6, replace=T, prob=c(.75,.25)) 
[1] "Bad"  "Good" "Bad"  "Good" "Good" "Bad" 

As you can see, we stumbled upon a particularly bad sample, with even more errors than expected. We would typically expect to find 1 – 2 defects out of 6 trials, if our average defect rate is 25%. Instead, we find three defects. A 50% error rate. Indeed, our client should hire a quality consultant,  ideally a consultant who knows R…..

R Sample Dataframe: Randomly Select Rows In R Dataframes

Up till now, our examples have dealt with using the sample function in R to select a random subset of the values in a vector. It is more likely you will be called upon to generate a random sample in R from an existing data frames, randomly selecting rows from the larger set of observations. Continuing our product quality example, our quality lab is probably measuring and inspecting the item multiple ways. We can compile these measurements into a dataframe of quality observations, with each observation measuring the item along multiple dimensions. Since life exists in more than one dimension, you can easily adapt R’s random sampling process to support this.

 
# r sample dataframe; selecting a random subset in r
# df is a data frame; pick 5 rows

df[sample(nrow(df), 5), ]

In this example, we are using the sample function in r to select a random subset of 5 rows from a larger data frame.

If you are using the dplyr package to manipulate data, there’s an even easier way. Use the sample_n function:

 
# dplyr r sample_n example
sample_n(df, 10)

Generating Random Numbers in R

Our examples up to this point have dealt with random selections from finite sets. But what if we need to generate a true random number using R?

The next part of our tutorial will address generating floating point numbers and values from a specific statistical distribution.

To hop ahead, select one of the following links:

Need more tips? Check our list of helpful R functions.