Why My Model with 90% Accuracy Doesn’t Work
Model Performance Metrics for Imbalanced Datasets
When you’re dealing with marketing problems like customer churn (when a customer stops using a company’s product over a certain period of time) prediction, the raw dataset is often imbalanced, meaning that the classes are inherently not balanced. Basically, what this means is the percentage of your customers who churn might be a lot lower than those who don’t. In this example, the binary classification problem might have an 80–20 split, with only 20% of customers discontinuing their engagement with the company and 80% continuing to make a purchase.
The problem is, that 20% could be VERY important to the business’s bottom line.
Think about it — a gifting company has 100,000 customers with an average value of $50 per person. that’s $5,000,000 from those customers, but if 20% stop buying from the company, you’re losing $1,000,000! Over the course of a few years, that can add up for even the largest of ecommerce companies or brick and mortar shops. Therefore, a major goal for the company’s marketing department is to predict when a customer will churn and implement some sort of intervention to prevent that from happening.
Machine Learning for Churn Prediction
Now if your company has a good data science/data analytics team, you could be in luck. With a good customer churn prediction model, you can intervene before a customer abandons your business and potentially get them to return to their loyal status. Now, we get to the next problem — when we’re working with binary classification models, the imbalanced classes tend to make things a bit messy. Mainly, the core performance metric that many analysts turn to is accuracy, but accuracy only tells you so much…
What is Accuracy?
In another post on Towards Data Science, Koo Ping Shung reviews different performance metrics to evaluate your ML models. In this article, we will review a few of these options and why they might be better suited for imbalanced data than accuracy.
Accuracy = Total Number of Correct Predictions/ Total Number of Predictions
Intuitively, accuracy makes sense — you want to know how many predictions you got right. Unfortunately, with imbalanced data, it’s not that simple. Let’s look at an example…
You have your customer churn data from the marketing department for the past year and there were a total of 100,000 customers and 20,000 churned. Now, if we were to predict that all 100,000 will NOT churn, that means we have 80,000/100,000 correct (80% accuracy). In that example, you actually failed to identify any of the problem cases. If you take it up a notch and your data is even more imbalanced (90–10 split) if you predict that everyone will churn, your model will actually come out with a 90% accuracy, despite not identifying a single problem case.
In the end, the model with the 90% accuracy might not work.
So, how do we address this problem?
As discussed in the previously noted article, there are other metrics of model performance. For our purposes, we will review three such measures:
Precision = True Positives/ (True Positives + False Positives)
This might not seem as clear as accuracy, but in essence, precision tells you how close you are to your target. Meaning, you get more ‘points’ if you get predictions right, but you also lose some if you incorrectly classify something. So if we were to catch all 20,000 of our customers who churn, that’s 20,000 True Positives, but if we also lump in another 20,000 who didn’t really churn, our precision goes down, as seen below.
Without false positives: 20,000/(20,000+0)= 100%
With false positives: 20,000/(20,000+20,000)= 50%
Precision comes in handy when you have an imbalanced dataset and you want to prevent false positives (or Type 1 errors, as they call it in stats). For instance, if you’re diagnosing cancer and implementing a risky treatment, you want to make sure that you’re only treating those who are sick. If you use the treatment for a large portion of people who aren’t really ill (false positives), you can risk some very negative effects.
Precision = True Positives/ (True Positives + False Negatives)
If precision is used to prevent false positives, recall is the equivalent measure to prevent false negatives (also known as ‘misses’ or type 2 errors). Looking at the same type of example, if we classify all our churns correctly and we don’t miss any of them, we will have perfect recall, as seen below:
Without false negatives: 20,000/(20,000+0)= 100%
Now, if we miss 5,000 of our target cases, our recall goes down, both through the numerator and denominator, as seen below:
With false negatives: 15,000/(15,000+5,000)= 50%
Recall is a good alternative to accuracy if your classes are imbalanced and the most important this is you find all the problem cases. In the previous example with customer churn prediction, let’s say the goal of the analysis is to find out who is most likely to churn and send out a noninvasive message to those customers, reminding them to make another purchase.
If the only thing you are risking by having false positives is sending out a few extra notices, recall could be a fine choice for a performance metric. You might not care if you waste 500 post cards by sending them out to people who really weren’t likely to churn — what you care about is you caught ALL of the true positive cases. (You identified every HIT).
F1 Score
The F1 score is probably the least intuitive performance metric, but it just might be the right one for your problem.
F1= 2 X (Precision * Recall) / (Precision + Recall)
Basically, the F1 score is a combination of precision and recall, which allows you to weight your false positives and your false negatives when determining your model’s performance.
If want to better understand the raw numbers that go into your calculation, Wikipedia does a good job of breaking down the math.
If you identified 15,000 of the 20,000 target cases, but you also falsely identified 5,000 (and missed 5,000) your F1 score would look like this:
F1: 15,000 / (15,000+.5 (5,000+5,000) ) = 75%
The culmination of the F1 score is the harmonic mean of the precision and recall measures.
Next Steps
Now that we’ve gone through an example of how you might analyze and imbalanced dataset, it is clear that accuracy might not always be the best measure. The bottom line is, you can have a model with 90% accuracy, but 0% recall or precision. If you are running a logistic regression in python, you might want to consider a few options for dealing with these types of problems:
Imbalanced classification problems can be difficult to deal with, even when using R and python for your machine learning algorithms. However, the goal of this article was to prevent you from making logical fallacies based on an incomplete analysis of the data. The key takeaways are as follows:
I hope you enjoyed this article and if you have any questions, please feel free to reach out!
This content was originally published here.