AB Testing case studies are biased

Anyone else see the irony concerning AB testing case studies?

On the one hand you believe in taking lots of unbiased random samples, objectively count conversions and you even use a little bit of stats to account for sampling errors, but on the other hand you only choose to report the few successful outcomes in your case study? That’s neither random or unbiased.

If I were a skeptical person I’d hypothesize that there were  999 unsuccessful experiments before the first successful one that gets written up in the case study. You conduct a number of experiments until you get a result you like, but the 999 prior experiments don’t get reported. I don’t mean to pick on anyone because we are all guilty of [publication bias][4] and we all want to put our best foot forward, but that doesn’t change the fact that the data is incomplete.

If you are thinking about AB testing, resist the urge to wait for significance. If you wait or hope to get to significance you are basically saying that the result depends on the intentions of the observer. Surely, that can’t be right. The data should speak for itself. Before you hit the “start experiment” button you should therefore decide the number of observations you’ll gather and then stick to it when you reach that number. If the result is inconclusive, so be it. You can use it to guide you in what to test next.

The quality of AB testing should be judged on the process of testing itself, not so much on the outcomes. I just worry that headlines such as “We used AB testing to lift conversions by 34.35355%” are discouraging to people when they start testing themselves and don’t get the same result.

[4]: http://www.freakonomics.com/2012/08/14/is-statistically-significant-really-significant/