Controlled experiments are the gold standard for determining causality. The most important principle of this design is making sure the control is representative of the treatment group. This is typically achieved by random assignment of subjects into groups.
However, often the random assignment is impossible. Here are some examples:
- Market level test. Whether you test a coupon redeemable at any store of the market or DMA-level media like radio or TV, market tests are common.
- Operational restrictions. Sometimes you have to the test in a specific group of customers. Experiments tied to systems and operations work this way.
- Legal restrictions. In my practice, these were related to pricing. There were laws and regulation related to pricing and promotions.
- Natural experiments. This is a catch-all category of past actions that many analysts are asked to look into. Natural experiments is how things happened to have happened, and unless you have a time machine, you can’t go back and randomize the treatment.
In science, these scenarios fall under the umbrella of either quasi-experiments or observational studies. The good news is you don’t have to give up on determining causality if a true randomized test is impossible. Even when the groups are not representative, there are many things analysts can do to adjust for it.
I have dealt with many quasi-experiments and observational studies, from market-level tests to natural experiments. Here are my recommendations on how to get the best value out of them.
Market/Orgalizational Level Test
In these quasi-tests, your minimum test size is large, like a DMA or an organizational unit (region, division). Therefore, it makes sense to compare the test unit to the most “like” unit, or several units, if you can.
- Having more than one unit in the treatment group is preferable to testing in a single unit/market. Two small units are better than one big unit because they tend to have more stable results, aka lower variance. This applies to both treatment and control units, so always try to compare to a control that is a sum of multiple units/markets.
- If you are able choose your test and control units, look for well matched pairs – similar DMAs or org units. Make sure your company’s geography is well represented in among the test units.
To make sure your groups are well matched, you should look at their history and compare test and control performance over time. The best metric to compare is the one measured in the test, like sales, new subscribers, or churn. This is called “pre-period matching”, and six months of comparison is usually ideal, though a shorter period might work as well.
If there are some systemic differences in your matched groups, you may be able to compensate for them by segmenting customers/locations in the control, and then reweighing them to match your test group.
Natural Experiments or Observational Studies
In a natural experiment, there is a change in the relationship with some but not all customers, and our goal is to measure its impact on behavior.
If this natural test group consists of customers rather than organizational units, your best bet is to create a synthetic control group from the customers that were not in the natural experiment. Beware that this synthetic control group may not be representative of the treatment group. In this case, you need to match your synthetic control to your treatment group.
How to Create Matched Synthetic Control Groups
The match is done based on the historical performance, and you need to know the variables that impact your outcome the most. For example, if your outcome is the churn rate, you know that seasonality impacts churn, you should match your group by month or use year over year measurement.
Make sure the number of your matching variables is manageable. Between two and six is usually a good number. Pro tip: if you match active customers and not prospects, look at variables that describe their past behavior, like RFM (recency, frequency, monetary value).
There are two ways to match your synthetic control performance – recruiting and weighting. Recruiting is creating a control group from a stratified random sample, which mimics the structure of your test group. Weighting means re-weighting the outcome of the control group by the matching variables to better represent the test group.
Using both recruiting and weighting is ideal, but if you don’t have an automated system or resources to do both, I recommend starting with re-weighting. In my practice, it’s an efficient way to match the groups quickly.
To confirm that your synthetic control is well matched to your treatment group, you should compare their performance in the pre-period.
If you have to go this route, my recommendation is to look at a solution called Test&Learn by Applied Predictive Technologies. They are the powerhouse of creating matched controls, and there is really nobody on the market better than them. Highly recommended.