Sometimes, it is necessary to organize a dataset around specific properties. Including a dummy variable to indicate if the property condition has been met makes them useful for statistical modeling since they make it easier to group similar items.
Dummy variables
Dummy variables are variables that are added to a dataset to store statistical data. It is used when you want to break the data into categories based on specific properties. You need one dummy variable less than the number of categories you want to create. To divide a group of people up according to the type of vehicle they drive with a dataset that has five different types of vehicles. You would set up four dummy variables that would have a value of 1 or 0. In this example, each dummy variable would represent a vehicle type that would be indicated by 1, with the fifth being indicated by all four dummy variables being equal to 0.
How to create a dummy variable in R
How to create a dummy variable in R is quite simple because all that is needed is a simple operator (%in%) and it returns true if the variable equals the value being looked for.
> them = data.frame(ID=c(“Bob”,”Sue”,”Tom”,”Ann”),
+ sex=c(“M”,”F”,”M”,”F”),
+ Height=c(5.4,5.2,6,5.6),
+ Weight=c(152,135,200,NA))
> them
ID sex Height Weight
1 Bob M 5.4 152
2 Sue F 5.2 135
3 Tom M 6.0 200
4 Ann F 5.6 NA
Here, we have a dataframe showing four people with their sex, height, and weight.
> them$male = them$sex %in% ‘M’
> them
ID sex Height Weight male
1 Bob M 5.4 152 TRUE
2 Sue F 5.2 135 FALSE
3 Tom M 6.0 200 TRUE
4 Ann F 5.6 NA FALSE
Here, we have added the dummy variable them$male to the dataframe giving us a new column. When it is printed we get the same data with the new variable added.
Practical application
In statistical modeling being able to group similar items together is often important. For example, a list of the change in gas mileage of different vehicles over time would probably not produce meaningful data unless you can separate them by the number of cylinders.
# how to create a dummy variable in r - base data
> salesteam = data.frame ('employee'=1:8, 'pastjob'=c('sales','sales admin','sales','sales admin','ops','ops','R&D','IT'), 'results'=c(150,200,250,300,125,150,175,150))
> salesteam
employee pastjob results
1 1 sales 150
2 2 sales admin 200
3 3 sales 250
4 4 sales admin 300
5 5 ops 125
6 6 ops 150
7 7 R&D 175
8 8 IT 150
# how to create a dummy variable in R
> salesteam$didsales = salesteam$pastjob %in% c('sales','sales admin')
> salesteam
employee pastjob results didsales
1 1 sales 150 TRUE
2 2 sales admin 200 TRUE
3 3 sales 250 TRUE
4 4 sales admin 300 TRUE
5 5 ops 125 FALSE
6 6 ops 150 FALSE
7 7 R&D 175 FALSE
8 8 IT 150 FALSE
# how to create a dummy variable in R - roll up
> aggregate(salesteam, by=list(salesteam$didsales),FUN=mean)
Group.1 employee pastjob results didsales
1 FALSE 6.5 NA 150 0
2 TRUE 2.5 NA 225 1
This example of a sales team creates a dummy variable, and it uses the aggregate() function to show their average performance. Having this information about a sales team tells the manager a lot about what they are doing as a group.
Dummy variables are a useful tool for creating groups within datasets. R make doing this extremely easy because it can be done with a simple operation. This is one of the many reasons that R is an excellent tool for data science.