Experimental Design in Practice: Dos and Don’ts of Group Size

Today I would like to share one example of design and analysis that I came across awhile ago. It taught me that simplicity is often more important than complicated scientific logic.

A few years ago I was asked to look into direct mail programs run by another department in my company. The design was very similar to what I ran: you have a schedule with multiple mailpieces going out to targets, and a random control group is being held out from the mail. The analysis was pretty simple – define your response window, calculate new customer connect rates during this window for both mail and control group, and then compare them to determine lift.

I did just that. Since the other department ran around 20 different mail lists every quarter, many with additional sub-segments and variations of mailpieces, and I was asked to do a quick assessment of a year worth of effort, I just summed everything up, and ended up with… very meaningful negative impact of mail on new connects. That could not have been just a random aberration, since the total sample size was measured in millions.

What was the problem?

I dug into the numbers and noticed that some of the control groups were very small, for example, a mail program with 1 million targets would have a control group of just 1,000 targets. I stopped right there, and went to talk to people.

The answer was something I never expected, although it did have logic to it. This department used a formula, conveniently furnished by corporate analytics, to determine control group sizes for their mail pieces. The sizes were determined to get a certain level of statistical significance (they used independent samples T-test). The department applied the formula to every mail program they ran. Seems logical.

There was a small catch in the programs. Essentially, they ran two types of programs – carpet bombing and targeted programs. Targeted programs were targeting segments of known demographic profile, and they had a relatively high response rate. Carpet bombing programs went out to the rest of the targets, and they had a relatively low response rate. Some people will ask why carpet bomb at all if it has a low response rate. The answer is that they were not simply after raw response rate, but after lift, i.e. the difference between mail and control response rates, and those rates are often unrelated.

Here is roughly what this department got as control size recommendations from the t-test formula:

 

 3070753_orig
Now, if we follow these recommendations for both tests, and then sum up the results straight up, this is the lift we will be getting.
Control group in direct mail
This happens because the mail and control groups are no longer representative of each other, as Targeted Mail segment is 3,000/(1,000+3,000)=75% of the combined control group group, while it is 50,000/(1,000,000+50,000)=4.8% of the combined mail group. In other words, we are comparing response rates of the Carpet Bombing mail group to response rates of Targeted control group.

Leave a Reply