How To Create a Dummy Variable in R

Sometimes, it is necessary to organize a dataset around specific properties. Including a dummy variable to indicate if the property condition has been met makes them useful for statistical modeling since they make it easier to group similar items.

Dummy variables

Dummy variables are variables that are added to a dataset to store statistical data. It is used when you want to break the data into categories based on specific properties. You need one dummy variable less than the number of categories you want to create. To divide a group of people up according to the type of vehicle they drive with a dataset that has five different types of vehicles. You would set up four dummy variables that would have a value of 1 or 0. In this example, each dummy variable would represent a vehicle type that would be indicated by 1, with the fifth being indicated by all four dummy variables being equal to 0.

How to create a dummy variable in R

How to create a dummy variable in R is quite simple because all that is needed is a simple operator (%in%) and it returns true if the variable equals the value being looked for.

> them = data.frame(ID=c(“Bob”,”Sue”,”Tom”,”Ann”),
+ sex=c(“M”,”F”,”M”,”F”),
+ Height=c(5.4,5.2,6,5.6),
+ Weight=c(152,135,200,NA))
> them
ID sex Height Weight
1 Bob M 5.4 152
2 Sue F 5.2 135
3 Tom M 6.0 200
4 Ann F 5.6 NA

Here, we have a dataframe showing four people with their sex, height, and weight.

> them$male = them$sex %in% ‘M’
> them
ID sex Height Weight male
1 Bob M 5.4 152 TRUE
2 Sue F 5.2 135 FALSE
3 Tom M 6.0 200 TRUE
4 Ann F 5.6 NA FALSE

Here, we have added the dummy variable them$male to the dataframe giving us a new column. When it is printed we get the same data with the new variable added.

Practical application

In statistical modeling being able to group similar items together is often important. For example, a list of the change in gas mileage of different vehicles over time would probably not produce meaningful data unless you can separate them by the number of cylinders.

# how to create a dummy variable in r - base data
> salesteam = data.frame ('employee'=1:8, 'pastjob'=c('sales','sales admin','sales','sales admin','ops','ops','R&D','IT'), 'results'=c(150,200,250,300,125,150,175,150))
 > salesteam
   employee     pastjob results
 1        1       sales     150
 2        2 sales admin     200
 3        3       sales     250
 4        4 sales admin     300
 5        5         ops     125
 6        6         ops     150
 7        7         R&D     175
 8        8          IT     150

# how to create a dummy variable in R
> salesteam$didsales = salesteam$pastjob %in% c('sales','sales admin')
 > salesteam
   employee     pastjob results didsales
 1        1       sales     150     TRUE
 2        2 sales admin     200     TRUE
 3        3       sales     250     TRUE
 4        4 sales admin     300     TRUE
 5        5         ops     125    FALSE
 6        6         ops     150    FALSE
 7        7         R&D     175    FALSE
 8        8          IT     150    FALSE

 # how to create a dummy variable in R - roll up
> aggregate(salesteam, by=list(salesteam$didsales),FUN=mean)
   Group.1 employee pastjob results didsales
 1   FALSE      6.5      NA     150        0
 2    TRUE      2.5      NA     225        1

This example of a sales team creates a dummy variable, and it uses the aggregate() function to show their average performance. Having this information about a sales team tells the manager a lot about what they are doing as a group.

Dummy variables are a useful tool for creating groups within datasets. R make doing this extremely easy because it can be done with a simple operation. This is one of the many reasons that R is an excellent tool for data science.