The bread and butter of marketing analytics is evaluating the results of marketing programs or tests that the company runs. When looking at a campaign, the first thing that comes to mind is the direct outcome of the program (redemptions, sales, clicks, etc). While it is relatively easy to calculate the direct outcome, it is much harder to interpret this data in the context of whether the number is good or bad.
This is where the art of appropriate comparison becomes very handy. If you can compare your program subject to a matched group of non-program subjects, called the benchmark or control group, then you can tell how the program really impacted behavior.
In analytics, bias is a situation where your benchmark group is not representative of your program group. As a result, your assessment of the results of the program can be wrong, leading to bad decisions.
Let me be very honest here: bias can be hard to recognize, even for experienced analysts. Moreover, even if you recognize the presence of bias, it is often hard to explain how and why this bias was introduced. Too often, by the time you come across a “wonderful” piece of biased analysis and recognize the shortcomings, the “good” news has traveled all over the organization, and it is almost impossible to stop this train.
I have no better way to teach about bias than to provide real life examples of analyses that were biased. My hope is that the readers will see how similar analyses in their own businesses may have the same problem, and will take steps to eliminate bias.
Selection Bias Due to an External Confounding Variable
Rate Increase Subscriber Analysis.
This is a classic case of Simpson's paradox, where we have a confounding variable that makes groups incomparable.
The goal of the analysis was to figure out how change in retail price of a cable service impacted subscriber behavior, disconnect rate in particular. Disconnect rate of customers who had the rate increase was compared to that of those who did not have the rate increased. The customers were matched by product holding. The initial analysis showed that disconnect rate for those who had the rate increase was lower than those who did not.
Selection bias was caused by the difference in tenure structure of the groups. Customers became eligible for the rate increase after their promotion expired, and they were paying retail rates. New customers were getting promotional rates, thus not being eligible for the rate increase.
Since short tenure is strongly associated with high disconnect rates, the breakdown of the disconnect rate by tenure showed that those who had the rate increase, in fact, had a higher churn rate in every tenure group.
Solution: reweigh the subscriber churn rate by tenure.
After the reweighing was done, the outcome is more indicative of what is going on with the subscriber behavior after the rate increase.
Year Over Year Retail Ticket Analysis.
A large national retailer looked at the dynamics of its transactions by ticket (total $ sold) on a year over year basis. The conclusion was that customers have reduced their purchases in the lower priced ticket buckets. The analysis did not take into account a change in pricing that was caused by the change in commodity prices of the underlying raw materials. [Chart or numeric example.]
Solution: while in some businesses it is possible to adjust the data using price inflation, for retailer with over 100K SKUs it was not a feasible option. The real solution is to scrap this analysis and look at the data in a different way.
Selection Bias Due to Internal Design Flaw
Regression to the Mean: the Low performing stores project.
Every year a national retail chain designated the bottom 5% of stores into a special group, which was addressed by a taskforce. Every year, the work of this taskforce was a great success, as the majority of the stores inevitably moved up from the "bottom 5" the next year.
This phenomena was described by Nobel Laureate Daniel Kahneman in his book “Thinking Fast and Slow”, and it is called regression to the mean. This outcome is explained by the fact that performance on a store level is not fully deterministic - there are random factors that impact performance in the short term, and these factors rarely persist for longer than a year. This is why the following year any outlier store is likely to perform closer to the mean (average).
For example, if there was road construction in front of a store, and access to the store was restricted for much of the summer, it is likely that the store performance would improve dramatically next year. This particular analysis also had a bonus shortcoming, as performance was measured on a year over year basis. A drop in sales in the selection year would make the basis for the following year lower, thus pushing the store up the rankings, and magnifying the effect of regression to the mean.
Solution: use historical data to run a “no program” program analysis. Select your groups in prior years when there was no special programs run, and see how much of regression to mean was happening historically. That can put the result in perspective.
Biased by Design: Retail Channel Analysis.
This was the big analysis piece that was sited by my previous employer, a large multi-channel US retailer on many occasions, including quarterly earnings calls. Please note that the analysis was done in the early 2000s, and the retailer had a strong catalog business, while eCommerce was still an emergent channel. Here is how they described the conclusions from this analysis: the more channels the customer buys from, the more she buys. Those who buy only from brick and mortar stores spend less than those who buy from stores and catalog, and those who buy from stores and catalog spend less than those who buy from all three channels. The send by channel looked additive, and the chart looked like this:
The analysis appears to make a strong statement, but how odd that the chart appears to be so perfectly additive? Turns out, it is odd. The problem with this analysis was that the database of transactions had 1.4 transactions per identified customer, and the sample was not normalized for the total number of transactions the customer had. Why is that a problem? That's because for any customer who shopped all three channels, they must have had at least three transactions. And those who shopped at least two channels would have at least two transactions. In essence, you are comparing customers who had 3.1 transactions to those who had 2.2 transactions to those who had 1.2 transactions. How about that for pitting winners against losers?
Market Basket Analysis.
Market basket analysis is a method of looking into whether a customer is more or less likely to buy item B given that the customer bought item A. The underlying theory is that there are different types of customers who have different market baskets, for example A+B and C+D being most common. This is another case of analysis where the sample tends to be biased by design based on how many "best customers" the business has in relation to its total customer base.
To illustrate, here is a numerical example of how market basket analysis of three categories (A - most popular, B, and C - least popular) can play out in a way that in every group, the customers are just as or more likely to buy another category as average. In the table, Customers 1-4 bought every category, Customers 5-6 bought Categories A and B, while the rest bought just one Category.
The effect is most amplified on the tail end of the product basket, as those who bought C appear to be more likely to buy every other category than average. This happens because share of "best customers", in this case, defined as those who bought every category, is going to be higher on the tail-ends of the categories.
This effect usually appears in reports as statements alleging that buying a certain category of products makes customers buy more products in other categories. Statements like "those who bought C were also more likely to buy B, at 80% compared to 67% on average" does imply that there is some sort of causative effect of buying C on other category, or that buying respective the categories is somehow related, which is not always the case.
Subscribers Who Did X Analysis.
This is by far my favorite example of research design bias. It is very subtle, hard to identify, produces very believable effects, and the the effects are always "the right way", i.e. showing that the program "works". It also fits with very practical ways to measure outcomes, that I see it being made over and over. This is the epitome of the self fulfilling prophecy.
Research question: We have run program X for our customers over the past year, and many customers have responded. Here is the list of customer accounts and dates on which they participated in X. Can you tell us if participating in X reduces churn?
Study design: Customers who used program X are marked as treatment group. Customers in the treatment group are cleaned up of those who were not active a year ago, so only those active remain. To compensate for self selection, find customers who were active a year ago and not in the treatment group, and create a matching group of control customers. Compare churn rates of the treatment and control group over the past year.
Result: Assuming uniform distribution of events X throughout the year and no impact of program X on churn rates, the treated group will have half the churn of the control group.
The answer is survivorship bias. The treatment group is not representative of the control group in one very important way - we know that every customer in this group used program X during the next year, and it means that the customer was active at that point in time. There is no resriction on being active at any other point in time for the control customers. Presuming the usage of the program X was uniformly distributed throughout the year, customers in the treated group, on average, had half the time to disconnect compared to the control group.
Comparing Former Customers to Active Customers (disconnector market research).
This is something you see in market research: to understand the disconnects we survey the disconnectors and then compare the splits to active customers. Something like this:
The conclusions made from this kind of research are usually along the lines of: 'Tough times' segment is over-represented among the disconnects, thus we are losing these customers. 'Happily retired' segment is under-represented among the disconnects, thus, we are growing our business among these customers.
In reality, this chart is simply a reflection of different disconnect rates by customer segment. It provides no information about the growth of any of these segments. To determine whether a segment is shrinking or growing we need consider both connect and disconnects into the segment, and more transient segments tend to be over-represented on both connects and disconnects.
When we see this report, do we conclude that some segments are growing or shrinking at a certain rate? Generally, no. We accept that certain segments have different churn rates by the nature of the segments, and these different rates do not mean that the structure of the active base is changing. Yet, this is the same type of data as the disconnector research , simply presented differently.
Two Types of Bias in Analytics
I have gone through many examples of bias in marketing and general business analytics, and have come to a conclusion that some of them are more straightforward and easier to explain than others. Unfortunately, it’s the difficult kind that are more subtle, harder to explain, and more prevalent in the business world. I believe, the distinction is worth nothing.
Two main sources of bias in analytics: selection and analysis design.
- In the case of simple selection bias, the groups of analysis subjects are different due to a confounding variable external to the analysis.
- In the case of the internal design flaw, the design of the study is a proverbial self-fulfilling prophecy, as performance plays into assignment of the groups, often through a proxy variable.
Some may wonder why there is a distinction between two types of bias. In both cases, bias is bad and leads to incorrect interpretation of the data. After all, bias is bias. However, let me highlight the difference for you using the following example.
Let’s say you are evaluating results of a group of runners in a running race using average time. In the case of simple selection bias you may get incomparable results by not taking into account an “external” variable like athlete’s gender, age, years of running experience, etc. It is fairly easy to understand why such variables are important and what you can do to eliminate this kind of bias. However, if some circular logic in the design of your study results in comparison of top finishers against bottom finishers, no amount of reweighting on the “external” variables will produce good analysis because this type of a study is internally flawed. I have observed that many comparison group setups that have both types of bias, and fixing one type is not going to fix the whole analysis. This is why it is the internal design bias is more common and harder to recognize.