How To Randomly Split Data In R

Many statistical procedures require you to randomly split your data into a development and holdout sample. This is used to validate any insights and reduce the risk of over-fitting your model to your data. The development sample is used to create the model and the holdout sample is used to confirm your findings. These are also referred to as a training and a testing sample.

How to Split Data into Training and Testing in R

We are going to use the rock dataset from the built in R datasets. The data (see below) is for a set of rock samples. We are going to split the dataset into two parts; half for model development, the other half for validation.

> head(rock)
  area    peri     shape perm
1 4990 2791.90 0.0903296  6.3
2 7002 3892.60 0.1486220  6.3
3 7558 3930.66 0.1833120  6.3
4 7352 3869.32 0.1170630  6.3
5 7943 3948.54 0.1224170 17.1
6 7979 4010.15 0.1670450 17.1

When doing an automated split, you need to start by determining the sample size. We accomplish this by counting the rows and taking the appropriate fraction (80%) of the rows as our selected sample. Next, we use the sample function to select the appropriate rows as a vector of rows.

The final part involves splitting out the data set into the two portions.

# Split Data into Training and Testing in R 
sample_size = floor(0.8*nrow(rock))
set.seed(777)

# randomly split data in r
picked = sample(seq_len(nrow(rock)),size = sample_size)
development =rock[picked,]
holdout =rock[-picked,]

Why Randomly Split Data in R?

This technique is known as cross-validation. This is an important way to protect the integrity of your findings, in the absence of being able to repeat your test on a different sample.

The power of statistical modeling is that you can make inferences using the attributes in your data to predict how similar observations should perform. For example, if we note that sales tend to spike on Mondays (due to typical shopping patterns), a sales prediction model might incorporate the day of the week as a predictive attribute. When you apply a statistical model to date outside your original training set, you are making the assumption that the same predictive relationships will continue to apply to the new dataset.

Unfortunately, thanks to the concept of spurious correlations, you will at some point identify a statistical relationship that is due to random chance. If you test enough variables, you will eventually find one that appears to accurately predicts the results in your sample. Just from random chance.

The holdout sample is your insurance policy against false insights. By splitting off part of your sample and requiring that any findings replicate within that sample, you reduce the risk of selecting a model build on false trends. This reduces the risk of your model failing in field testing.