Validate Me! Simple Test vs. Holdout Samples in R

In statistics, it is often necessary to not only model data but test that model as well. To do this, you need to randomly separate the data into two groups ensuring even samples regardless of the order of the original sample.

Statistical model

When a data scientist is working with a data set, he will create a statistical model. Creating a model involves finding a mathematical formula that fits the data. Once you have a model, you need to validate it. This involves comparing it to a validation sample. To get this sample, the original sample is randomly separated into two samples. One sample is a training data set used to develop the model and the other is used to validate your model.

Model validation

The process of validating a mathematical model uses a different data set than that used to develop it. Model validation is performed by comparing the model with a second data set to see if it also fits the model. If it does not fit, another model is needed. A frequent way of obtaining a validation sample is to split the original sample into two parts with one part being held for validation purposes.

Holdout sample

Part of the process of evaluating a mathematical model involves separating the original data set into training data, and a holdout sample. The training data is used to develop a mathematical model by fitting a mathematical formula to it. This mathematical formula is then applied to the holdout sample, to validate the formula. To ensure that such a comparison is valid, you must make sure that both data sets are statistically meaningful. If you only have one original data set, it is important to separate the data randomly to keep both sets statistically meaningful.

How to split data into training and testing in R

Answering the question of how to split data into training and testing in R requires using the sample function. The sample function has the format of sample (dataset, size, probability, replace), and it returns a vector of randomly selected values from the dataset. The dataset variable is the group of values the sample function selects from. Size is the number of values the sample function returns to the output vector. Probability is the optional parameter setting the probability of getting the value in certain positions. Replace is an optional logical parameter for deciding whether to allow duplication in the selection.

> x=c(4,11,25,35,45,55,68,73,86,99)
 > set.seed(5)
 > a=sample(1:10,5,FALSE)
 > a
 [1] 2 9 7 3 1
 > x[a]
 [1] 11 86 68 25  4
 > x[-a]
 [1] 35 45 55 73 99

When this formula is applied to a data set such as a victor by using the results of sample function as the index [a] and anti-index [-a] of that dataset. The set.seed function serves the purpose of ensuring that others can replicate your results.

Application

Here is an example of the practical application of pedal data for the iris flower.

 > data(iris)
 > smp_size = floor(0.5 * nrow(iris))
 > smp_size
 [1] 75
 > set.seed(37)
 > train_ind = sample(seq_len(nrow(iris)), size = smp_size)
 > train = iris[train_ind, ]
 > test = iris[-train_ind, ]
 > train
 > test

It demonstrates the steps needed to separate a dataset into training and testing samples. These two samples can then be used to create and test a mathematical model.

The need to divide a data set to provide a testing sample is critical to validating mathematical models of data being evaluated. While there is no one function that does the entire job, the process is still quite simple in R.