Anyone else see the irony concerning AB testing case studies?
On the one hand you believe in taking lots of unbiased random samples, objectively count conversions and you even use a little bit of stats to account for sampling errors, but on the other hand you only choose to report the few successful outcomes in your case study? That’s neither random or unbiased.
If I were a skeptical person I’d hypothesize that there were 999 unsuccessful experiments before the first successful one that gets written up in the case study. You conduct a number of experiments until you get a result you like, but the 999 prior experiments don’t get reported. I don’t mean to pick on anyone because we are all guilty of publication bias and we all want to put our best foot forward, but that doesn’t change the fact that the data is incomplete.
If you are thinking about AB testing, resist the urge to wait for significance. If you wait or hope to get to significance you are basically saying that the result depends on the intentions of the observer. Surely, that can’t be right. The data should speak for itself. Before you hit the “start experiment” button you should therefore decide the number of observations you’ll gather and then stick to it when you reach that number. If the result is inconclusive, so be it. You can use it to guide you in what to test next.
The quality of AB testing should be judged on the process of testing itself, not so much on the outcomes. I just worry that headlines such as “We used AB testing to lift conversions by 34.35355%” are discouraging to people when they start testing themselves and don’t get the same result.
Follow me on
{ 3 comments… read them below or add one }
I think the larger question is, are these sets of cases generalizable in any real sense. So outside of the bias question, is there even any information that can be extracted from analyzing (meta-analysis) all of these cases?
That’s a great point. I suspect that you can only rely on your own data and experience. The fact that another site achieved a certain result is pretty much irrelevant. Maybe there should be a disclaimer: *results not typical.
ha, great observation! It’s partly because most sites publishing case studies are selling their expertise at setting up these tests, but I’m sure it’s also that it just doesn’t seem as interesting to report on cases where nothing much happens. Of course that kind of practical info would actually be very useful for people figuring out how to deal with inconclusive or less than impressive results in real life!