R language in Ecommerce - Split Testing Tidyverse

Retail has always been a customer-focused business, even on the internet. To stay relevant, online sellers rely on sales forecasting through modeling and tracking models to develop and track their growing customer base and alter their marketing platforms to appeal more to what attracts shoppers to their site. In working for these retailers, programmers need to be familiar with these languages to the point of natural use, with familiarity in the use of programs and functions in taking pools of user input data and creating visual representations as they help make executive decisions regarding the activities of their companies, its best to grasp the basics of R before going further into its use.

The R language has been around for over a decade with users regarding it as a beginner’s entry into coding. With its simple syntax and plethora of documented guides available for reference, those entirely new to coding or in need of more background with grasping the full utility of R and its subset iterations don’t have to look far for answers. With a full library of functions and tools for statistical modeling, the R language is widely used by data scientists in modeling for predictive analytics. The goal of this report is to inform readers about a function library called Tidyverse which formats in R code the many data tool commands it uses to clean and organize large inputs of data. Towards the end, there are Tidyverse code snippets shown for visual aid. In the end, links to advanced functions using R syntax are available for those interested in more data analysis and examples of written code for further hands-on learning.

Welcome To Tidyverse

Tidyverse is a collection of functions using R syntax that provides tools designated for data analytics and visualization. While its origin dates back to the start of R language use and currently many editors have contributed to the development of Tidyverses collection of tools and functions, The library of functions is still looked at by analysts as a workflow optimizer with its consistent structuring of data organization that makes it reliable for processing huge datasets. With many variations on packages released by the r code community of programmers, there is constant updating to adapt to new parameters for testing and visualization. Right now, any Google Analytics API can use R programming to run countless variations of testing hypotheses by different sectors of data analysts in a large user community. As more data on customer behavior patterns are being analyzed by e-commerce groups, the approachable language functions are continuing to be in high demand for the foreseeable future.

A/B Testing – Finding What Works

Imagine having a group of responsive customers ordering from retailers like Amazon or Etsy. Online shoppers want to check out their cart, so they look to push the payment button to complete their order. The button is colored gray, so for people with slow internet or bad eyesight, they can’t see if they hit the button to enter their order. At the retail site’s HQ, the data analytics team wants to see what changing the color of the button from gray to green to make it easier to see would do for their rate of accepted orders. They created a test group of some 20,000 participants divided into a test group with the green button and a control group still using the grey button. Now here’s the takeaway:
The two groups will give results that the team is comparing for a single variable; the color of the button.
The analysts can set up a function using a reporting API like Google Optimize to run these experiments and report results to compare.
Google Analytics with R syntax takes variants and makes testable actions to generate predictions on outcomes for the two groups.

Rstudio Split Testing

The best way to see A/B testing in practice is through relevant examples. In covering data science topics for Medium’s blog, Etoma Egot explored a split-testing model in the popular R studio format. Testing starts with market analysis teams asking if a different color button on their retail site would influence their visitor numbers. To further define the data set given by the pool of visitors analysts have made it so both the controlled and testing group contain two categories of visitors; ones who visit and don’t hit the button, and another group who hit the button to purchase their item. Using filter functions in R, the program combs the dataset to differentiate inputs:

let’s filter out conversions for variant_A
conversion_subset_A ABTest %>% filter(variant == “A” & converted == “TRUE”)

Total Number of Conversions for variant_A
conversions_A nrow(conversion_subset_A)

Number of Visitors for variant_A
visitors_A nrow(ABTest %>% filter(variant == “A”))

Conversion_rate_A
conv_rate_A (conversions_A/visitors_A)
print(conv_rate_A) #0.02773925

let’s take a subset of conversions for variant_B
conversion_subset_B ABTest %>% filter(variant == “B” & converted == “TRUE”)

Number of Conversions for variant_B
conversions_B nrow(conversion_subset_B)

Number of Visitors for variant_B
visitors_B nrow(ABTest %>% filter(variant == “B”))

Conversion_rate_B
conv_rate_B (conversions_B/visitors_B)
print(conv_rate_B) #0.05068493

Further Testing

The tidyverse library can do far more with data once it’s been tested. The code sample;
uplift (conv_rate_B – conv_rate_A)/ conv_rate_A * 100
uplift #82.72%
#B is better than A by 83%,
then takes the conversion rate of the standard model and compares it to a “personalized” variant where one factor has changed. In tracking the uplift on different personalizations, you will see what factors create the best outcome compared to the original. From this point it’s using more advanced tools with the conversion rate define the datasets pooled probability, the margin of error, and point estimates:

p_pool (conversions_A + conversions_B)/(visitors_A + visitors_B)
print(p_pool) # 0.03928325

SE_pool sqrt(p_pool*(1-p_pool) * ((1/visitors_A) + (1/visitors_B)))
print(SE_pool) #0.01020014

MOE SE_pool * qnorm(0.975)
print(MOE) #0.0199919

d_hat conv_rate_B – conv_rate_A

These measurements collectively give a z-score and p-value, which gauges how relevant these results are and how accurate the test is.

z_score d_hat/SE_pool
print(z_score) #2.249546

p_value pnorm(q = -z_score, mean = 0, sd = 1) * 2
print(p_value) #0.02447777

This test has a p-value of 0.024, which is very significant, so the new hypothesis has proven more effective than the standard. In looking at different variants for testing, these functions can help filter out factors that do not affect the final result, so there’s no wasted time on pointless comparisons.

How is this relevant

Many e-commerce sites concern themselves with knowing their customer base and find that with the open-access of programs like google analytics API, they can rely more on split testing of user groups to determine everything from shopper interface when purchasing to how they market different platforms. Facebook’s servers use A/B to see what content presentation attracts more viewers to click on advertisements and gains subscribers in high numbers. When Amazon wants to know what products its shoppers buy in bulk transactions and which they seem to gravitate towards, they use control groups to field content for their suggestion bars based on tested response values. As more retail moves to online marketplaces, there will be growing fields of data analytics that further use split testing to single out customer behaviors, making the R language more valuable as a skill for your resume as the current trend of programming mixes with marketing and retail, it’s not just handy to learn R code, but a necessary feature of your skillset.

To read more into A/B testing and the tidylab library of functions, try these sources.
<a href=”https://medium.com/@etomaa/a-b-testing-analysis-using-rstudio-c9b5c67d6107″>Etom Egot</a> A/B Testing Analysis with RStudio – This was the sample code I referenced, Etom writes for Medium.com and publishes for their Towards Data Science segment Rpub Datacamp’s Paige Piccini provides a surplus of code in outlining the whole process of hypothesis testing with R language. Tidy Modeling with R is a full-on book made scrollable with explanations of the library functions available and effective use of these tools for different modeling applications

Hopefully, this answered some questions about A/B Testing and Tidyverse. Thank you for reading.