Experiment Design in Practice: When Random Selection is Impossible

One of the most important principles of true experiment design is making sure the control is representative of the treatment group. This is generally achieved by random assignment. However, sometimes random selection of customers into the groups is impossible. Here are some examples:

  • Market level test. Whether you test a coupon redeemable at any store of the market or you test market-wide media like radio or TV, market tests are pretty common.
  • Operational restrictions. Sometimes you can only flip the test in a large group of people. Many pricing experiments work that way.
  • Legal restrictions. Again, often related to pricing, there are laws and regulation on how and to whom you can change price.
  • Natural experiments. This is a catch-all category of past actions that many analysts are asked to look into. That’s just how things happened, and unless you have a time machine, you can’t go back and randomize the treatment.

In science, these situations fall under the umbrella of either quasi-experiments or observational studies. However, you don’t have to give up on effective measurement if randomized test is impossible. Even if the groups are not representative, there are many things analysts can do to compensate for it.

There are several situations when randomized selection is impossible. Let’s take the first one – your minimal treatment unit is large, like a DMA or organizational unit. In this case, it makes sense to compare against the most “like” unit, or several units, if you can.

You don’t have to give up on effective measurement if randomized test is impossible.

First, see if you can include more than one unit into the treatment group. Two small units are usually better than one big unit. Second, if you can choose your unit, look for well matched pairs – similar DMAs or org units. Then, create your imperfect control from the other, like units. Even if you have one treatment unit, it usually pays to have two or more control units, as that would decrease the variance in case something happens in the control territory (flood, hurricane, other unexpected local event).

To make sure your units are well matched, you should look at their history over time and compare test and control. Naturally, you want to compare the outcome that will be measured in the test, like sales, new subscribers, or churn. This is called “pre-period matching”, and six months of comparison is usually ideal, though a shorter period might work as well. If there are some systemic differences in your matched groups, you may be able to compensate for them by segmenting customers/locations, and then reweighing them to match your test group.

The next case is an observational study, or a situation where you get historical records of a natural experiment, and now have to figure out what incremental impact it had. Generally, this natural test group is comprised of customers rather than org units. In this case, your best bet is to create a synthetic control group from the customers that were not in the natural experiment. Often the differences between those who are “in” vs “out” are quite drastic, which makes creating synthetic control group more difficult.

To match your synthetic control customers to the treatment group you first need to know which variables impact your outcome the most. For example, if your outcome is churn rate, you may know that customer marital status impacts the likelihood of an upgrade, but if it does not impact the probability to churn, you should not use this variable for matching. You want to make sure the number of your matched variables is manageable, and between two and six is usually a good number. Pro tip: look for variables that describe past behavior of the customer like RFM (recency, frequency, monetary value), especially, if you deal with existing customers, not prospects.

There are two ways to match your synthetic control – recruiting and weighting. Recruiting refers to creating control as a stratified random sample based on the composition of the test group. Weighting means re-weighting the outcome of the experiment to better represent the test group. Using both is ideal, but if you don’t have an automated system or resources to do both, I recommend starting with re-weighting. In my practice, it’s an efficient way to match the groups quickly.

In case of the synthetic control group, you also want to make sure that customer behavior in the pre-period confirms that your groups are well matched.

If you have to go this route, my recommendation is to look at a solution called Test&Learn by Applied Predictive Technologies. They are the powerhouse of creating matched controls, and there is really nobody on the market better than them. Highly recommended.

Leave a Reply