How to Run a Chi Square Test in R

You can use the Chi Square test in R to evaluate the association between two categorical variables. For the purposes of this exercise, we’re going to use a common marketing application – what % of a group of prospects accepted a new offer that we are testing?

Assume we have a larger data set (every prospect and their information, including if they responded). We can boil this down to a high level table that summarizes our count of records. We will feed this into the chi-square test in R to assess if there is a statistically significant relationship between the two categorical variables.

Using A Chi Square Goodness of Fit Test in T

We’ll dynamically generate the data set for chi square test example, as noted below.

# Chi Square test in R example; data setup
# chi square code in r

> recordcounts <- as.table(rbind(c(40, 5000), c(65, 5000)))
> dimnames(recordcounts) <- list(offer = c("old","new"), outcome=c('accept','reject'))

# Chi Square test in R example; inspect data
> recordcounts
     outcome
offer accept reject
  old     40   5000
  new     65   5000

# Chi Square test in R example; run test
# pearson's chi squared test in r
> chisq.test(recordcounts)

        Pearson's Chi-squared test with Yates' continuity correction

data:  recordcounts
X-squared = 5.424, df = 1, p-value = 0.01986

Interpretation of p Value in Chi Square Test

In review, we selected two groups of 5000 prospects (one for each offer). The offer was presented, resulting in a binary outcome (accept, reject). We tallied the number of each outcome into the table above.

Visual inspection suggests the new offer might have done well (yielded 65 acceptances against 40 acceptances with our current champion). But is this real or an artifact of chance? We run a chi-square test to gain perspective.

With a P value below .02, we will most likely accept that something worked. (Typical alpha is .05 or .025, depending on the standards of your employer).

R Chi Square Test – Summation & Usage

The chi squared test is the most common screening test used for categorical data. In addition to evaluating the degree of independence between the variables in a dataset, it can be used as part of larger procedures. One example of this is the CHAID process, used for building segmentation models using categorical data.

As you can see from the example, it is very convenient process using the R language.

Related Materials