To A/B or Not to A/B: Unleashing the Power of A/B Testing

Size: px

Start display at page:

Download "To A/B or Not to A/B: Unleashing the Power of A/B Testing"

Godfrey Gordon
5 years ago
Views:

1 To A/B or Not to A/B: Unleashing the Power of A/B Testing Page 1

2 CONTENTS First things first what is an A/B Test? Why spend my time and efforts in A/B Testing? So, how s it done? How do I effectively perform and analyze an A/B Test? Misconceptions: What companies usually don t do while A/B testing, but should Number Crunching: The logic & math behind A/B testing Examples of Successful A/B Tests How A/B tests are done at Factspan Page 2

3 During Barack Obama s 2008 presidential campaign, his campaign committee had the goal of optimizing every aspect of his campaign. They figured that the right web design could mean raising millions of dollars in campaign funds. But how did they decide the right design? They tried as many as 24 different variations of the web page using a mixture of images and CTA buttons, and finally arrived at a combination that brought about the best results. The result? A 40% increase in signup rates, which led to a fundraising of $60 million! These experiments they conducted with the website is exactly what A/B testing is all about. First things first What is an A/B test? It s all in the name an A/B test is where you test two or more different versions of your product to find out which one performs best. But this doesn t mean the two versions (A and B) are poles apart. They need to be identical, except for few minor tweaks, which we suppose will change the user s behavior. Version A (Control) is the currently used version and Version B (Treatment) is the one with the minor modification. Why spend my time and efforts in A/B testing? Because one minor change can make a world of a difference. No exaggerations. If we take the web space for instance, particularly user experience design, A/B testing can help you identify changes to web pages that increase the user s interest. Here s what else A/B tests can do for you: Optimize product onboarding, user engagement and in-product experiences Reduce marketing spends Give you guaranteed ROI (Return on Investment) Those tiny changes in your product can cause a significant increase in click-through rates, conversions, average order value, or whatever else you re looking to achieve. This means more leads generated, better sales and increased revenue based on data, not on gut feelings or guesswork. But why isn t this powerful boon so popular, then? Because not many companies use it right over 75% of companies admit to not having suitable expertise to optimize their landing pages. Page 3

4 So, how s it done? How do I effectively perform and analyze an A/B Test? Define a Problem Statement Plan the Experiment Identify Target Customers Set up the Experiment Measure the KPIs Analyze the statistics Derive your insights Step 1: Define a problem statement Which means figuring out why you need A/B testing in the first place. This is where you must identify the problems with the current version of your product. Paying special attention here will minimize efforts and resources in the further stages. Here s an example The signup page has too many fields, which might cause the customer to lose interest while creating an account. The signup page could thus be subjected to experimentation. Step 2: Plan the experiment The most important thing before conducting an A/B test is to identify an MVP (Minimal Viable Product), which is the version of the product that gives the maximum amount of validated learning about customers, with the least effort. Defining the MVP then leads to the knowledge gathering phase (discovery) understanding the why behind the experiment. It involves: Setting up hypotheses around the MVPs that address customer needs. For instance, if it is identified that the bag page of an e-commerce website is not intuitive enough for the customers, and there is a gap between the customer s understanding and the details mentioned on the webpage, the possible hypotheses can be: Hypothesis 1 : Providing a link to check for available promo codes will ease the customer s pain of searching for valid promo codes, thus improving the conversions, revenue per session, and AOV (average order value) Hypothesis 2 : Providing information about acceptable payment methods visually will help customers understand it at a glance, thus reducing confusion and exits Hypothesis 3 : Enhancing the color and size of the checkout button will attract customer s attention, thus resulting in higher checkout, AOV, and revenue Hypothesis 4 : Implementing the feature of editing the quantity on the bag page will give customers the flexibility to modify their orders, thus resulting in a higher bag conversion and other revenue metrics Experimental design, where the key identifiers & performance indicators are defined to measure the performance of the website versions Page 4

5 Step 3: Identify target customers You need to know who you re making the changes for. This will give you accurate data. Consider the device used, demographics, the operating system, time duration, webpage details, and any other attribute you think is relevant. While conducting an A/B test, the visitors are first randomly split and exposed to Control and Test groups, so there is no bias. Then, based on the experiment s specific attributes (such as the ones defined above), the visitors are filtered for further testing to get more accurate results. Step 4: Set up the experiment For this, you need to define: The desired traffic split. Try and ensure that the control & test arms receive a similar amount of traffic, and keep a certain portion of the visitors in a Holdout group. This is in case there arises a question about the long-term impact of a feature. The holdout group receives the same experience as the control group. Keep a lower volume of traffic for a longer duration to ensure a stabilized result. The runtime. Typically, running your test for at least 2 weeks is effective, since it will involve a variety of customers and events. Experiment-specific tags / identifiers. Define test variant tags, Page IDs, Element interaction tags and so on, to filter your data according to the relevant success metric. Scripting. This means building codes to aggregate the data and calculate the KPIs (key performance indicators). Step 5: Measure the KPIs A/B test measurement usually starts with a traffic split check, followed by the query execution for the KPIs defined. We then get to performing the lift calculation of the KPI, which is the relative performance of the test KPI numbers with respect to control. Finally, revenue projections are also made to see what the impact on annual sales could be. Here are a few best practices we ve found out through experimentation: Measure traffic split daily. Highlight the days when the split seems off. Generally, if the actual split follows the defined split within ±1% tolerance, it is within limits. But if the split seems off, it means there are issues in data tagging, which must be checked and rectified. Run the experiment for the defined period, uninterrupted. The KPIs should be measured weekly, so that any unusual effects will be observed in advance and subjected to further drill-down. Page 5

6 Step 6: Analyze the statistics The KPIs alone don t suffice to make a conclusion, because only a sample group is exposed to the experiment. Step 7: Derive your insights Based on the KPI numbers, statistical p-values, and pre-power analysis, you can draw insights and make recommendations. This is to identify the most impacting MVPs, which can be taken to the next level and implemented on the website. Misconceptions: What Companies Usually Don t Do While A/B Testing, But Should While the concept of A/B testing is super-simple to implement, many organizations fail to follow the best protocol of conducting an A/B test. Be it due to premature significance, gut feeling-based sample estimations, pre-closure of experiments, etc., companies tend to get lured if they see desirable results anytime in the midst of the experiment. Obviously, it is strongly suggested not to manipulate the experiments based on early judgment. Here are few pointers that companies should do while A/B testing: Estimate the desired sample size and test duration. Defining a sample size and test run time based on gut feeling may lead to statistically insignificant results. Check for pre-power results. Re-test the experiments. An experiment that has failed once could work sometime in the future. Let the experiment run its course until the duration you had originally planned, even if the performance is undoubtedly great. A pre-power analysis before starting the experiment. This is recommended to find the ideal number of visitors expected to be achieved in the experiment run-time. Number Crunching: The Logic and Math in A/B Testing Before jumping to conclusions, the authenticity of the A/B test results must be proven by appropriate statistical methods. Hypothesis testing A hypothesis test evaluates two mutually exclusive statements (null hypothesis H 0, and alternate hypothesis H 1 ) about a population to determine which statement is best supported by the sample data. The success of an A/B test depends on how strong the hypothesis is. Page 6

7 Steps to test the hypothesis Define Null Hypothesis (H 0 ) - A hypothesis/statement opposite to the guess State the alternate/working/research hypothesis, denoted by H 1 Set the probability value of having Type I error (α), also called the significance level Compute the probability value (p-value) and α along with p-value If p-value α, then H 0 is rejected, and the results are statistically significant If p-value > α, then H 0 is valid, and the results are statistically insignificant Statistical Decision True State of the Null Hypothesis H o True H o False Reject H o Type I error Correct Do not Reject H o Correct Type II error Understanding the terminology Significance level (α) represents the probability of rejecting the null hypothesis when it is true Type I error is the incorrect rejection of a true null hypothesis (false positive). In other words, type I error is detecting an effect that is not present. The probability of having type I error is denoted by α Type II error is the failure of rejecting a false null hypothesis (false negative). It is an error that fails to detect an effect that is present. Probability of having type II error is denoted by β p-value is the probability of seeing a result or more, given that the null hypothesis is true Reject H₀ Fail to Reject H₀ Reject H₀ Reject H₀ p> p> Fail to Reject H₀ p< Reject H₀ In the figure above, the two shaded areas (critical regions) are equidistant from the null hypothesis value and each area has a probability of 0.025, for a total of Page 7

8 Statistical Power: Detecting an effect that exists Where significance is the probability of seeing an effect when none exists, power is the probability of seeing an effect where it does exist. So, when there are low power levels, there is a big chance that the real winner of an A/B test is not recognized. Statistical power is the likelihood that a study will detect an effect when there exists an effect to be detected. High statistical power reduces the probability of making a Type II error. The factors that affect the power of any statistical significance test are: the effect size (observed difference or lift), sample size (N), significance level (α) and statistical power (1- β). You can increase statistical power by: Enlarging the true effect size Minimizing the variance and thus the standard deviation Increasing the sample size Now that you ve been introduced to the ABCs of A/B tests, here are a few use cases, so you understand better. Page 8

Examples of Successful A/B Tests Use Case # 1: Displaying accepted

(MVP): Adding payment badges on the bag page Hypothesis: Displaying

friction during payment, thus leading to higher bag conversion UX

9 Examples of Successful A/B Tests Use Case # 1: Displaying accepted payment methods on the bag page Experiment Details Experiment Name (MVP): Adding payment badges on the bag page Hypothesis: Displaying accepted payment methods will result in customers experiencing less friction during payment, thus leading to higher bag conversion UX Wireframe Control Test Information badge about accepted payments (in Test variant) Page 9

10 The difference between Control and Test As shown in the comparative figure on the right side, the control version has the unchanged version of the website The test variant has the acceptable payment methods added on the bag page in the form of payment badge (no other changes) Impact More than 1% uplift in checkout conversion, bringing in a significantly incremental monetary value A comparatively higher Revenue Per Session in test, with an uplift of 1.2% versus control, suggesting a much-reduced friction for customers while shopping Recommendation(s) Adding a payment badge on the bag page enhanced customer experience, thus resulting in increased conversion & incremental revenue. The test version was scaled up on the online platform post experimentation. Page 10

11 Use Case # 2: Having minimum fields on the sign-up page Experiment Details Experiment Name (MVP): Reducing the friction that stops customers from completing the account creation form by removing a few fields Hypothesis: Poor sign-up rate can be owed to the large number of fields required to sign up, and reducing that can increase the sign-up rate The difference between Control and Test As shown in the comparative figure on the right side, the control version has the unchanged version of the website The test variant(s) are the modified versions of the page, with the following changes: 1. Test 1 Name fields (with & password) 2. Test 2 Address fields (with & password) 3. Test 3 Date of birth (with & password) 4. Test 4 Security fields (with & password) Control Test versions (in total 4 versions) Page 11

12 Impact More than 11% increase in sign-up rate, by keeping only the name fields (Test 1) versus Control Brought in an increment in annual revenue of an estimated $3.4 million Recommendation(s) Minimizing the length of the sign-up page by keeping only the Name fields with & password resulted in a maximized account creation rate, bringing in a significant incremental revenue to the online business Page 12

13 Use Case # 3: PayPal App Integration Experiment Details Experiment Name (MVP): PayPal App Integration Hypothesis: Integrating a PayPal checkout feature will enhance payment flexibility, resulting in higher customer satisfaction, bag conversion & AOV Checkout with PayPal CTA on bag Note: Control version doesn t have Checkout with PayPal button on bag page The difference between Control and Test The control version has the unchanged version of the website The test variant is the modified version of the page, having an additional functionality of checking out with a PayPal account (highlighted) Page 13

14 Impact Additional option for checking out resulted in more than 2% relative improvement in checkout conversion, versus control A comparatively higher revenue per session in test, with an increase of 5% versus control, suggesting a much-reduced friction for customers while shopping Recommendation(s) When provided with an additional flexibility to online shoppers for checking out with their PayPal account, it led to improved customer satisfaction, and hence a better conversion & AOV to the business Page 14

15 How A/B tests are done at Factspan This whitepaper explained how you can run and measure an A/B test. You also learned about the logic behind A/B test, the best practices and common misconceptions. Plus, we showed you how a US based e-commerce giant benefited from A/B testing. At Factspan, the process of A/B test measurement is setup and streamlined via automated modules, right from data access from online platforms, e.g., KISSmetrics, Omniture, Coremetrics, etc., till the reporting stage. Additionally, the automated setup helps in monitoring the stepwise progress, exception handling, traffic split, significance, etc. Online businesses need to have robust modules that can assist them in A/B testing design, and that s where Factspan also has the expertise in identifying the most impacting MVPs, that could be subjected to experimentations. Page 15