The bread and butter of marketing analytics is evaluating the results of marketing programs and tests. When measuring campaign performance, the first thing that comes to mind is the direct outcome (redemptions, sales, clicks, etc). While it is relatively easy to calculate the direct outcome, it is much harder to interpret this data in the context of whether the number is good or bad.
This is where the art of appropriate comparison becomes very handy. If you can compare your program targets to a matched group of customers that were not targeted, called the benchmark or control group, then you can tell how the program really impacted behavior.
In analytics, bias is a situation where your benchmark group is not representative of your target group. As a result, your assessment of the results of the program can be wrong, leading to bad decisions.
Let me be very honest here: bias can be hard to recognize, even for experienced analysts. Moreover, even if you recognize the presence of bias, it is often hard to explain how and why this bias was introduced. Too often, by the time you come across a “wonderful” piece of biased results and recognize the shortcomings, the “good” news has already traveled all over the organization, and it is almost impossible to stop this train.
I have no better way to teach about bias than to provide real life examples of analyses that were biased. My hope is that the readers will see how similar analyses in their own businesses may have the same problem, and will take steps to eliminate bias.
Selection Bias Due to an External Confounding Variable
Rate Increase Subscriber Analysis.
This is a classic case of Simpson's paradox, where we have a confounding variable that makes groups incomparable.
The goal of the analysis was to figure out how change in retail price of cable service impacted disconnect rate. The disconnect rate of customers who had the rate increase was compared to that of those who did not have the rate increase. The customers were matched by product holding.
The initial analysis showed that disconnect rate for those who had the rate increase was lower than those who did not. In layman's terms, increasing the rates appeared to have reduced churn.
In this case, selection bias was caused by the difference in tenure structure of the groups. Customers became eligible for the rate increase after their promotion expired, when they were paying retail rates. New customers were getting promotional rates, and thus were not eligible for the rate increase.
Since short tenure is strongly associated with high disconnect rates, the breakdown of the disconnect rate by tenure showed that those who had the rate increase, in fact, had a higher churn rate in every tenure group.
Solution: reweigh the subscriber churn rate by tenure.
After the reweighing was done, the outcome is more indicative of what is going on with the subscriber behavior after the rate increase.
Year Over Year Retail Ticket Analysis.
A large national retailer looked at the dynamics of its transactions by ticket (total $ per transaction) on a year over year basis. The conclusion was that customers have reduced their purchases in the lower priced ticket buckets. The analysis did not take into account a change in pricing that was caused by the change in commodity prices of the underlying raw materials.
Solution: while in some businesses it is possible to adjust the data using price inflation, for retailer with over 100K SKUs it was not a feasible option. The real solution is to scrap this analysis and look at the data in a different way.
Selection Bias Due to Internal Design Flaw
Regression to the Mean: the low performing stores project.
Every year a national retail chain designated the bottom 5% of stores into a special group, which was addressed by a taskforce. Every year, the work of this taskforce was a great success, as the majority of the stores inevitably moved up from the "bottom 5" the next year.
This phenomena was described by Nobel Laureate Daniel Kahneman in his book “Thinking Fast and Slow”, and it is called regression to the mean. This outcome is explained by the fact that performance on a store level is not fully deterministic - there are random factors that impact performance in the short term, and these factors rarely persist for longer than a year. This is why the following year any outlier store is likely to perform closer to the mean (average).
For example, if there was road construction in front of a store, and access to the store was restricted for much of the summer, it is likely that the store performance was artificially low in the classification year, and would improve dramatically the year after.
This particular analysis also had a bonus shortcoming, as performance was measured on a year over year basis. A drop in sales in the classification year would make the basis for the following year lower, thus pushing the store up the rankings, and magnifying the effect of regression to the mean.
Solution: use historical data to run a “placebo program” analysis. Select your groups in prior years when there was no special programs for underperforming stores, and see how much of regression to mean was happening historically. That can put the result in perspective.
Biased by Design: Retail Channel Analysis.
This was the big analysis piece that was sited by my previous employer, a large multi-channel US retailer on many occasions, including quarterly earnings calls. Please note that the analysis was done in the early 2000s, and the retailer had a strong catalog business, while eCommerce was still an emergent channel. Here is how they described the conclusions from this analysis: the more channels the customer buys from, the more she buys. Those who buy only from brick and mortar stores spend less than those who buy from stores and catalog, and those who buy from stores and catalog spend less than those who buy from all three channels. The spend by channel looked additive, and the chart looked like this:
The analysis appears to make a strong statement, but is it odd that the chart is perfectly additive? Turns out, it is odd. The problem with this analysis was that the database used to produce this analysis had 1.4 transactions per identified customer, and the sample was not normalized for the total number of transactions the customer had. Why is that a problem? That's because for any customer who shopped all three channels, they must have had at least three transactions. And those who shopped at least two channels would have at least two transactions. In essence, all you are doing is comparing customers who had 3.1 transactions to those who had 2.2 transactions to those who had 1.2 transactions. How about that for pitting winners against losers?
Solution: make analysis groups more representative by suppressing those with fewer than 3 transactions.
Market Basket Analysis.
Market basket analysis is a method of looking into whether a customer is more or less likely to buy item B if that customer bought item A. The underlying theory is that there are different types of customers who have different market baskets, for example A+B and C+D being most common. This is another case of analysis where the sample tends to be biased by design based on how many "best customers" the business has in each customer grouping.
To illustrate, here is a numerical example of how market basket analysis of three categories (A - most popular, B, and C - least popular) can play out in a way that in every group, the customers are just as or more likely to buy another category as average. In the table, Customers 1-4 bought every category, Customers 5-6 bought Categories A and B, while the rest bought just one Category.
The effect is most amplified on the tail end of the Categories, as those who bought C appear to be more likely to buy every other category than average. This happens because share of "best customers", in this case, defined as those who bought every category, is going to be higher in the tail-end categories.
This effect can be seen in statements alleging that buying a certain category of products makes customers more likely to buy in other categories. Statements like "those who bought C were also more likely to buy B, at 80% compared to 67% on average" does imply that there is some sort of causative effect of buying C on buying B, or that purchases in these categories are somehow related, which is not always the case.
Subscribers Who Did X Analysis.
This is by far my favorite example of research design bias. It is very subtle, hard to identify, produces very believable effects, and the the effects are always go "the right way", i.e. showing that the program "works". It also speaks to people's natural instincts on how to measure outcomes, that I see it being made over and over. This is a great example of a self fulfilling prophecy.
Research question: We have run program X for our customers over the past year, and many customers have responded. Here is the list of customer accounts and dates on which they participated in X. Can you tell us if participating in X reduces churn?
Study design: Customers who used program X are treatment group. Beginning of the year is the start date, and the treatment group is cleaned out to only keep customers active at the beginning of the year. Control group: customers who were active at the beginning of the year, not in the treatment group, and are matched by main churn drivers. Compare churn rates of the treatment and control group over the past year.
Result: Assuming uniform distribution of events X throughout the year and no impact of program X on churn rates, the treated group will have half the churn of the control group.
The answer is survivorship bias. The treatment group is not representative of the control group in one very important way - we know that every customer in the treated group used program X at some point during the analysis year. In other words, the customer was active at that point in time. There is no restriction on being active at any other point during the analysis year for the control customers. Presuming the usage of the program X was uniformly distributed throughout the year, customers in the treated group, on average, had half the time to disconnect compared to the control group.
Solution: Never let your classification and outcome periods overlap. Split the classification year into a manageable number of periods, classify your customers during these periods, create a matched control group, and analyze the outcome after the classification period is closed. For example, select customers who used X in January, create a matching control group, and analyze churn in February-April.
Comparing Former Customers to Active Customers (disconnector market research).
This is something you see in subscription service market research: to understand the disconnects we survey the disconnectors and then compare the splits to active customers. Something like this:
The conclusions made from this research are along the lines of: 'Tough times' segment is over-represented among the disconnects, thus we are losing these customers. 'Happily retired' segment is under-represented among the disconnects, thus, we are growing our business among these customers.
In reality, this chart is simply a reflection of how transient these groups are. Some segments have a large percent of people changing services every month, contributing an outsize share of both connects and disconnects. The growth of the segment is a function of net difference of ins and outs, and is not related to the turnover rate.
Let's look at this data as ingredients of churn rate calculation:
When we see this report, do we conclude that some segments are growing or shrinking at a certain rate? Generally, no. We accept that certain segments have different churn rates by the nature of the segments, and these different rates do not mean that the structure of the active base is changing. Yet, this is the same type of data as the disconnector research , simply presented differently.
Two Types of Bias in Analytics
I have gone through many examples of bias in marketing and general business analytics, and have come to a conclusion that some of them are more straightforward and easier to explain than others. Unfortunately, it’s the difficult kind that are more subtle, harder to explain, and more prevalent in the business world. I believe, the distinction is worth nothing.
Two main sources of bias in analytics: selection bias and analysis design bias.
- In the case of simple selection bias, the groups of analysis subjects are different due to a confounding variable external to the analysis.
- In the case of the internal design flaw, the design of the study is a proverbial self-fulfilling prophecy, as performance plays into assignment of the groups, often through a proxy variable.
Some may wonder why there is a distinction between two types of bias. In both cases, bias is bad and leads to incorrect interpretation of the data. After all, bias is bias. However, let me highlight the difference for you using the following example.
Let’s say you are evaluating results of a group of runners in a running race using average time. In the case of simple selection bias you may get incomparable results by not taking into account an “external” variable like athlete’s gender, age, years of running experience, etc. It is fairly easy to understand why such variables are important and what you can do to eliminate this kind of bias.
However, if some circular logic in the design of your study results in comparison of top finishers against bottom finishers, no amount of reweighting on the “external” variables will produce good analysis because this type of a study is internally flawed. I have observed that many comparison group setups that have both types of bias, and fixing one type is not going to fix the whole analysis. This is why it is the internal design bias is more common and harder to recognize.