A/B testing (also known as split testing and bucket testing) is a randomized method of comparing two versions of an element (A and B) against each other to determine which one performs better using a metric to define success. We subject both versions to experimentation simultaneously, measure which version was more successful, and select that version for real-world use.
Which features of my product or service will increase my conversion rate?
Which is the most effective design of my product or service to increase sales?
What is the most effective design of my website to generate traffic?
Which website layout will lead to a higher sales-conversion rate?
A/B testing is similar to the experiments you did in Science 101. Remember the one where you tested various substances to see which supports plant growth and which suppresses it? You measured the growth of plants at different intervals as they were subjected to different conditions, and in the end tallied the increase in height of the different plants.
A/B testing allows you to show potential customers and users two versions of the same element and let them determine the winner. As the name implies, two versions (A and B) are compared that are identical except for one variation that might affect a user's behavior. Version A might be the currently used version (control), while version B is modified in some respect (variation).
In online settings, such as web design (especially user-experience design), the goal is to identify changes to web pages that increase or maximize an outcome of interest. Constantly testing and optimizing your web page can increase revenue, donations, leads, registrations, downloads, and user-generated content, while providing teams with valuable insight about their visitors.
For instance, on an ecommerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can sometimes be seen through testing elements such as copy, layouts, images, and colors.
By measuring the impact that changes have on your metrics (such as signups, downloads, purchases, or whatever else your goals might be), you can ensure that every change produces positive results.
This differs from multivariate testing, which tests multiple variations of a page at the same time.
The vastly larger group of statistics broadly referred to as "multivariate testing" or "multinomial testing" is similar to A/B testing, but may test more than two different versions at the same time and/or has more controls. Simple A/B tests are not valid for observational, quasi-experimental, or other nonexperimental situations, as is common with survey data, offline data, and other, more complex phenomena.
Imagine a company, Acme Cables, that operates a web store selling cables. The company’s ultimate goal is to sell more cables and increase their yearly revenue, thus the checkout funnel is the first place Acme’s head of marketing will focus optimization efforts.
The “Buy” button on each product page is the first element visitors interact with at the start of the checkout process. The team hypothesizes that making the button more prominent on the page would lead to more clicks and therefore more purchases. The team then makes the button red in the variation and leaves the button grey in the original. They are able to quickly set up an A/B test using an A/B testing tool that pits the two variations against each other.
As the test runs, all visitors to the Acme Cables site are bucketed into a variation. They are equally divided between the red button page and the original page.
Once enough visitors have run through the test, the Acme team ends the test and is able to declare a winner. The results show that 4.5 percent of visitors clicked on the red "Buy" button, and 1 percent clicked on the original version. The red “Buy” button led to a significant uplift in conversion rate, so Acme then redesigns their product pages accordingly. In subsequent A/B tests, Acme will apply the insight that red buttons convert better on their site than grey buttons.
You can also use it when you want to test your headline, but you have three possible variations. In this case, running a single test and splitting your visitors (or recipients in the case of an email) into three groups instead of two is reasonable and would likely still be considered an A/B test. This is more efficient than running three separate tests (A vs. B, B vs. C, and A vs. C). You may want to give your test an extra couple of days to run so you still have enough results on which to base any conclusions.
Testing more than one thing at a time, such as a headline and call to action, is a multivariate test, which is more complicated to run. There are plenty of resources out there for multivariate testing, but we won’t be covering that when talking about A/B testing.
Once you've concluded the test, you should update your product and/or site with the desired content variation(s) and remove all elements of the test as soon as possible.
A/B testing is not an overnight project. Depending on the amount of traffic you get, you might want to run tests for anywhere from a few days to a couple of weeks. And you’ll only want to run one test at a time for the most accurate results.
Considering the impact A/B testing can have on your bottom line, though, it’s worth taking a few weeks to properly conduct tests. Test one variable at a time, and give each test sufficient time to run.
Define the question you want to answer: "Why is the bounce rate of my website higher than the industry standard?" Start an A/B test by identifying a goal for your company, such as "Reduce bounce rate."
Do background research: Understand customer/consumer behavior. For websites, you can use Google Analytics and any other analytics tools. For other purposes, you can use consumer behavior analytics if they are available.
Construct a hypothesis: Define the hypothesis you want to test in a concise and measurable manner, such as "Adding more links in the footer will reduce the bounce rate."
Define metrics and significant difference: Derive one metric that measures the hypothesis, in this case the “bounce rate.” Once you have defined the metric, set the significant difference, that is, the minimum difference between the two versions’ metrics that will mean the change is worth it. For example: Version B must have at least 3 percent less bounce rate than Version A to consider it a successful and meaningful improvement.
Calculate the number of visitors/days you need to run the test for: Always calculate the number of visitors required for a test before starting the test. You can use an A/B Test Duration Calculator for website purposes.
Test your hypothesis: You create two products/services A and B, in which the variation (Version B) has the hypothesis you want to test, in this case a footer with more links. You test it against the original and gather results of the metric selected, in this case bounce rate.
Analyze data and draw conclusions: If the footer with more links reduces bounce rate more than the target set, then you can conclude that an increased number of links in the footer is one of the factors that reduces bounce. If there is no meaningful difference in bounce, then go back to Step 3 and construct a new hypothesis.
Report results to all concerned: Let others in marketing, IT, and UI/UX know of the test results and insights generated.
We must set from the beginning the significant difference (practically significant), that is, what difference between the version will lead to change. This decision is based on several factors, including investment of the changes and periodicity of changes. For online testing, a 1-2 percent difference is enough to justify the change. For offline testing (such as new medicine or a new hardware product), the difference to make the change beneficial can be around a 10-15 percent difference in magnitude.
We must ensure that what has been observed is repeatable, and not an isolated case. The size of the experiment must be constructed in a way that the statistical significance bar is lower than the practical significance.
It is important to note that if segmented results are expected from the A/B test, the test should be properly designed at the outset to be evenly distributed across key customer attributes, such as gender. That is, the test should both (a) contain a representative sample of men vs. women, and (b) assign men and women randomly to each “treatment” (treatment A vs. treatment B). Failure to do so could lead to experiment bias and inaccurate conclusions being drawn from the test.
Giving a test insufficient time can mean skewed results, as you don’t get a large enough group of visitors to be statistically accurate. Running a test for too long can also give skewed results, though, since there are more variables you can’t control over a longer period. Make sure that you stay abreast of anything that might affect your test results, so that you can account for any statistical anomalies when reviewing your results. If you’re in doubt, it’s perfectly reasonable to retest.
A/B testing is not so good for testing:
New things (such as change of version or novelty effect).
Too many changes, as the results will not be conclusive.
If something is missing (such as feature, style, information).
Keep your A/B Testing variations to a minimum to ensure meaningful results - @sircastel
Define your metrics and minimum success rate before running A/B Testing - @sircastel
Got a tip? Add a tweetable quote by emailing us: firstname.lastname@example.org
VWO - SaaS Pricing A/B Testing
Airbnb - Experiments at Airbnb
Got a case study? Add a link by emailing us: email@example.com