A/B testing (also known as split testing and bucket testing) is a randomized method of comparing two versions of an element (A and B) against each other to determine which one performs better using a metric to define success. You subject both versions to experimentation simultaneously, measure which version was more successful, and select that version for real-world use.
Which features of my product or service will increase my conversion rate?
Which is the most effective design of my product or service to increase sales?
What is the most effective design of my website to generate traffic?
Which website layout will lead to a higher sales conversion rate?
A/B testing is similar to the experiments you did in Science 101. Remember the one where you tested various substances to see which supports plant growth and which suppresses it? You measured the growth of plants at different intervals as they were subjected to different conditions and in the end tallied the increase in height of the different plants.
A/B testing allows you to show potential customers and users two versions of the same element and let them determine the winner. As the name implies, two versions (A and B) are compared that are identical except for one variation that might affect a user's behavior. Version A might be the currently used version (control), while version B is modified in some respect (variation).
In online settings, such as web design (especially user experience design), the goal is to identify changes to web pages that increase or maximize an outcome of interest. Constantly testing and optimizing your web page can increase revenue, donations, leads, registrations, downloads, and user generated content, while providing teams with valuable insight about their visitors.
For instance, on an ecommerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can sometimes be seen through testing elements like copy, layouts, images, and colors, but not always.
By measuring the impact that changes have on your metrics such as signups, downloads, purchases, or whatever else your goals might be, you can ensure that every change produces positive results.
This differs from multivariate testing, which tests multiple variations of a page at the same time.
The vastly larger group of statistics broadly referred to as multivariate testing or multinomial testing is similar to A/B testing but may test more than two different versions at the same time and/or has more controls, etc. Simple A/B tests are not valid for observational, quasi-experimental or other nonexperimental situations, as is common with survey data, offline data, and other, more complex phenomena.
Imagine a company, Acme Cables, that operates a web store selling cables. The company’s ultimate goal is to sell more cables and increase their yearly revenue, thus the checkout funnel is the first place Acme’s head of marketing will focus optimization efforts.
The “Buy” button on each product page is the first element visitors interact with at the start of the checkout process. The team hypothesizes that making the button more prominent on the page would lead to more clicks and therefore more purchases. The team then makes the button red in variation one and leaves the button grey in the original. They are able to quickly set up an A/B test using an A/B testing tool that pits the two variations against each other.
As the test runs, all visitors to the Acme Cables site are bucketed into a variation. They are equally divided between the red button page and the original page.
Once enough visitors have run through the test, the Acme team ends the test and is able to declare a winner. The results show that 4.5 percent of visitors clicked on the red buy button and 1 percent clicked on the original version. The red “Buy” button led to a significant uplift in conversion rate, so Acme then redesigns their product pages accordingly. In subsequent A/B tests, Acme will apply the insight that red buttons convert better on their site than grey buttons.
You can also use it when you want to test your headline, but you have three possible variations. In that case, running a single test and splitting your visitors (or recipients in the case of an email) into three groups instead of two is reasonable and would likely still be considered an A/B test. This is more efficient than running three separate tests (A vs. B, B vs. C, and A vs. C). You may want to give your test an extra couple of days to run, so that you still have enough results on which to base any conclusions. Testing more than one thing at a time, such as a headline and call to action, is a multivariate test, which is more complicated to run. There are plenty of resources out there for multivariate testing, but we won’t be covering that when talking about A/B testing. Once you've concluded the test, you should update your product and/or site with the desired content variation(s) and remove all elements of the test as soon as possible.
A/B testing is not an overnight project. Depending on the amount of traffic you get, you might want to run tests for anywhere from a few days to a couple of weeks. And you’ll only want to run one test at a time for the most accurate results. Considering the impact A/B testing can have on your bottom line, though, it’s worth taking a few weeks to properly conduct tests. Test one variable at a time, and give each test sufficient time to run.
Define the question you want to answer: "Why is the bounce rate of my website higher than industry standard?" Start an A/B test by identifying a goal for your company. E.g. reduce bounce rate.
Do background research: Understand customer/consumer behavior. For websites you can use Google Analytics and any other analytics tools. For other purposes you can use consumer behavior analytics if they are available. Construct a hypothesis: define the hypothesis you want to test in a concise and measurable manner, e.g., "Adding more links in the footer will reduce the bounce rate."
Define metrics and significant difference: derive one metric that measures the hypothesis, in this case the “bounce rate.” Once you have defined the metric, set the significant difference, that is, the minimum difference between the two versions’ metrics that will mean the change is worth it. E.g. significant difference: version B must have at least 3 percent less bounce rate than version A to consider it a successful and meaningful improvement.
Calculate the number of visitors/days you need to run the test for: Always calculate the number of visitors required for a test before starting the test. You can use an A/B Test Duration Calculator for website purposes.
Test your hypothesis: You create two products/services A and B, in which the variation (version B) has the hypothesis you want to test, in this case a footer with more links. You test it against the original and gather results of the metric selected, in this case bounce rate.
Analyze data and draw conclusions: If the footer with more links reduces bounce rate more than the target set, then you can conclude that increased number of links in the footer is one of the factors that reduces bounce. If there is no meaningful difference in bounce, then go back to step 3 and construct a new hypothesis.
Report results to all concerned: let others in marketing, IT and UI/UX know of the test results and insights generated.
We must set from the beginning the significant difference (practically significant), that is, what difference between the version will lead to change. This decision is based on several factors such as: investment of the changes, periodicity of changes, etc. For online testing, a 1-2 percent difference is enough to justify the change. For offline testing (e.g. new medicine or new hardware product), the difference to make the change beneficial can be around 10-15 percent difference in magnitude.
We must ensure that what has been observed is repeatability, and not an isolated case. The size of the experiment must be in a way that the statistical significance bar is lower than the practical significance.
It is important to note that if segmented results are expected from the A/B test, the test should be properly designed at the outset to be evenly distributed across key customer attributes, such as gender. That is, the test should both (a) contain a representative sample of men vs. women, and (b) assign men and women randomly to each “treatment” (treatment A vs. treatment B). Failure to do so could lead to experiment bias and inaccurate conclusions being drawn from the test.
Giving a test insufficient time can mean skewed results, as you don’t get a large enough group of visitors to be statistically accurate. Running a test for too long can also give skewed results, though, since there are more variables you can’t control over a longer period. Make sure that you stay abreast of anything that might affect your test results, so that you can account for any statistic anomalies when reviewing your results. If you’re in doubt, it’s perfectly reasonable to retest.
A/B testing is not so good for testing:
New things (e.g. change of version or novelty effect)
Too many changes, as the results will not be conclusive
If something is missing (e.g. feature, style, information)
Got a tip? Add a tweetable quote by emailing us: [email protected]
VWO - SaaS Pricing A/B Testing
Airbnb - Experiments at Airbnb
Got a case study? Add a link by emailing us: [email protected]