Skip to content

A Julia implementation of Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data, Kriegel et al., 2009.

License

Notifications You must be signed in to change notification settings

sharash/subspace-outlier-detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Subspace outlier detection

This repo contains a Jupyter notebook with a Julia implementation of the outlier detection algorithm from Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data, Kriegel et al., 2009.

See the notebook here for code and further details.

Idea behind SOD

For any given point, find a set of reference points, for example the nearest neighbors. Using a subset of the features, create a hyperplane along which the variance of the reference points is high. Then see how far from this hyperplane the point of interest is.

To put it another way, find the stable features for the reference points, then measure how far away the point of interest is from the reference points in these stable features. In the following image, SOD finds the green hyperplane (line), and measures the distance between the red point and the line to determine whether a point is an outlier.

At this point, I suggest a simple modification: what if we use a PCA to create a variance-maximizing hyperplane fit to the reference points, and measure the distance to the point of interest? In other words, instead of creating a hyperplane using a subset of the features, we create the hyperplane using a linear combination of all features (using PCA).

To explain it using images, SOD creates a hyperplane (line) along the axis that the reference point features vary the most, as in the figure above, and then measures the distance between that line and the point of interest. PCA outlier detection finds the hyperplane (line) that maximizes the variance of the reference point features:

Unlike SOD, the hyperplane does not need to be parallel to the feature axes. In theory, this makes PCA outlier detection more flexible and thus useful for a greater range of datasets.

Results on real data

To assess the quality of the methods, I use three datasets where some points have been labeled outliers. Here are the ROC curves for SOD:

and the corresponding curves for PCAOD:

The PCA-based method seems to have a slight edge on the arrythmia and ionosphere datasets.

See the Jupyter notebook for code and further details.

About

A Julia implementation of Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data, Kriegel et al., 2009.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published