Skip to content

Identifying Anomalies through Unsupervised Machine Learning Approaches

Unusual data points, frequently referred to as outliers, are a common topic in Data Science discussions. These data points can skew our analysis and negatively impact modeling when the algorithm in use is not resilient to anomalies. In numerous datasets, the bulk of observations tends to fall...

Discovering Anomalies through Unsupervised Machine Learning Techniques
Discovering Anomalies through Unsupervised Machine Learning Techniques

Identifying Anomalies through Unsupervised Machine Learning Approaches

In the realm of Data Science, identifying and handling outliers is crucial for maintaining the integrity and accuracy of analysis and modeling. Two popular methods for outlier detection are Local Outlier Factor (LOF) and Gaussian Mixture Models (GMM), both of which can be easily implemented using Scikit-Learn.

1. Local Outlier Factor (LOF) for Outlier Detection

Local Outlier Factor (LOF) is an algorithm that compares the local density of a point to the densities of its neighbors. Points in regions significantly less dense than their neighbors are flagged as outliers [1][4].

Here's a step-by-step guide on how to use LOF in Python:

  1. Fit the model on the data.
  2. The method returns for outliers and for inliers.
  3. The attribute gives the outlier score (lower means more likely outlier).

```python from sklearn.neighbors import LocalOutlierFactor import numpy as np

X = np.array([[1, 2], [2, 3], [1, 1], [10, 10], [10, 12], [11, 11]])

lof = LocalOutlierFactor(n_neighbors=2)

outlier_pred = lof.fit_predict(X)

outlier_scores = lof.negative_outlier_factor_

print("Outlier Predictions:", outlier_pred) # Example output: [ 1 1 1 -1 -1 -1] print("Outlier Scores:", outlier_scores) ```

2. Gaussian Mixture Models (GMM) for Outlier Detection

Gaussian Mixture Models (GMM) work by fitting multiple Gaussian distributions to the data. Points with very low probability under all these Gaussian components are outliers [4].

Here's a step-by-step guide on how to use GMM in Python:

  1. Fit a model on the dataset.
  2. Calculate the log probability of each point.
  3. Set a threshold (e.g., based on quantiles or manual inspection) to flag points with low likelihood as outliers.

```python from sklearn.mixture import GaussianMixture import numpy as np

X = np.array([[1, 2], [2, 3], [1, 1], [10, 10], [10, 12], [11, 11]])

gmm = GaussianMixture(n_components=2, covariance_type='full') gmm.fit(X)

log_prob = gmm.score_samples(X)

threshold = np.percentile(log_prob, 10)

outliers = log_prob < threshold

print("Outliers detected (Gaussian Mixture):", outliers) ```

Summary

| Method | How it works | Key Scikit-Learn Classes | Output | |----------------------|-----------------------------------------------------------------------------|----------------------------------|----------------------------------| | Local Outlier Factor | Compares local point density vs neighbors’ density; outliers in sparse areas | | outlier, inlier | | Gaussian Mixture Model| Fits multiple Gaussians; points with low probability are outliers | | Boolean mask based on chosen threshold |

Both methods can detect outliers but from different perspectives: density vs probabilistic distribution.

References

  • LOF explanation and example: [1][4]
  • GMM anomaly detection explanation: [4]
  • Scikit-learn documentation for LOF and GMM utilities

If you're interested in a combined, runnable Python script using your own dataset or synthetic data, I can provide that as well. Outliers can distort analysis and affect modeling, particularly in Linear Regression. The LOF algorithm is available in the module and uses K-nearest neighbors algorithm. It has a hyperparameter to determine the threshold for outliers. GMM calculates scores for observations according to the densities of where each point is located. Points in areas with higher densities are less likely to be outliers, while points in low density areas are more likely to be outliers. Gaussian Mixture Model (GMM) divides data into n groups by calculating and creating n Gaussian distributions, then assigning each observation to the group with the highest probability. Other outlier detection algorithms exist, such as Isolation Forest, Z-Score, and IQR. The LOF algorithm uses a contamination rate to determine the proportion of outliers in the data set. The Local Outlier Factor (LOF) algorithm can be used by calling . Two methods for outlier detection are discussed: KNN with Local Outlier Factor (LOF) and Gaussian Mixture Models, both from Scikit-Learn. The car_crashes dataset from the seaborn package in Python is used to demonstrate the LOF algorithm.

[1] Aurélien Géron, "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow" [4] Michael Walker, "Data Cleaning and Exploration with Machine Learning"

  1. In the context of data-and-cloud-computing and technology, the Local Outlier Factor (LOF) algorithm, a method for outlier detection within Scikit-Learn, compares the local density of a point with that of its neighbors to flag outliers.
  2. Gaussian Mixture Models (GMM), another outlier detection technique available in Scikit-Learn, work by fitting multiple Gaussian distributions to the data, with points having very low probability under all components classified as outliers.

Read also:

    Latest