Make Data Science easier: Evaluating Analyses — Introduction

5 min readApr 16, 2021

Every day we are able to observe how machine learning models are used in the most diverse areas of knowledge and having an accurate model is very important. It’s really important to choose an appropriate and meaningful evaluation process for your analysis.

In the vast majority of research papers on reporting standard measures of performance has focus on, accuracy, precision, recall, F1, and ROC curves, area under the curve.

So, without further ado, let’s get started.

Accuracy

Firstly, let us look at the following some confusion matrix. What is the accuracy for the model?

Very easily, you will notice that the accuracy for this model is very very high, at 99.9%!! Wow! You have hit the jackpot and holy grail. Broadly speaking:

Accuracy = TP+TN/TP+FP+FN+TN

What is TP TN FP FN TN ?

True Positives (TP) — These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes. E.g. if actual class value indicates that this passenger survived and predicted class tells you the same thing.
True Negatives (TN) — These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no. E.g. if actual class says this passenger did not survive and predicted class tells you the same thing.
False positives and false negatives, these values occur when your actual class contradicts with the predicted class.
False Positives (FP) — When actual class is no and predicted class is yes. E.g. if actual class says this passenger did not survive but predicted class tells you that this passenger will survive.
False Negatives (FN) — When actual class is yes but predicted class in no. E.g. if actual class value indicates that this passenger survived and predicted class tells you that passenger will die.

Ok, Accuracy is not the be-all and end-all model metric to use when selecting the best model…now what?

Precision and Recall

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good.

Precision = TP/TP+FP

Let me put in the confusion matrix and its parts here.

Immediately , you can see that Precision talks about how precise/accurate your model is out of those predicted positive, how many of them are actual positive.

Recall is the ratio of correctly predicted positive observations to the all observations in actual class — yes. The question recall answers is: Of all the passengers that truly survived, how many did we label? We have got recall of 0.631 which is good for this model as it’s above 0.5. Almost the same logic for epression:

Recall = TP/TP+FN

There you go! So Recall actually calculates how many of the Actual Positives our model capture through labeling it as Positive (True Positive). Applying the same understanding, we know that Recall shall be the model metric we use to select our best model when there is a high cost associated with False Negative.

F1 Score

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. In our case, F1 score is 0.701

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

So, whenever you build a model, this article should help you to figure out what these parameters mean and how good your model has performed.

I hope you found this blog useful. Please leave comments or send me an email if you think I missed any important details or if you have any other questions or feedback about this topic.

F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND there is an uneven class distribution (large number of Actual Negatives).

Attention

In a lot of cases, you’re looking at a binary prediction problem, like stars versus galaxies, healthy versus unhealthy, lives versus dies, whatever. And so we can think of that in terms of positives and negatives. And when a system is predicting those outcomes and makes a mistake, there is often a very different cost associated with a false positive.