Making a data science easier: Noise vs Outliers

An outlier is a data point that is different from the remaining data, we can do an easy comparison with abnormalities, discordance, and deviants. Whereas noise can be defined as mislabeled examples (class noise) or errors in the values of attributes (attribute noise), an outlier is a broader concept that includes errors and discordant data that may arise from the natural variation within the population or process.

3 min readApr 28, 2021

As such, outliers often contain interesting and useful information about the underlying system.

These particularities have been exploited in fraud control, intrusion detection systems, web robot detection, weather forecasting, law enforcement, and medical diagnosis, generally using methods of supervised outlier detection.

The consequences of not screening the data for outliers can be catastrophic.

The negative effects of outliers can be summarized as:

Increase in error variance and reduction in statistical power;
Decrease in normality for the cases where outliers are non-randomly distributed;
Model bias by corrupting the true relationship between exposure and outcome.

A good understanding of the data itself is required before choosing a model to detect outliers, and several factors influence the choice of an outlier identification method, including the type of data, its size and distribution, the availability of ground truth about the data, and the need for interpretability in a model. For example, regression-based models are better suited for finding outliers in linearly correlated data, while clustering methods are advisable when the data is not linearly distributed along correlation planes. While this chapter describes some of the most common methods for outlier detection, many others exist.

Evaluating the effectiveness of an outlier detection algorithm and comparing the different approaches is complex. Moreover, the ground truth about outliers is often unavailable, as in the case of unsupervised scenarios, hampering the use of quantitative methods to assess the effectiveness of the algorithms rigorously.