Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Analyzer

Abe Stanway edited this page Jun 22, 2013 · 7 revisions

The Analyzer service is responsible for analyzing collected data. It has a very simple divide-and-conquer strategy. It first checks Redis to get the total number of metrics stored, and then it fires up a number of processes equal to settings.ANALYZER_PROCESSES, assigning each processes a number of metrics. Analyzing a metric is a very CPU-intensive process, because each timeseries must be decoded from Messagepack and then run through the algorithms. As such, it is advisable to set settings.ANALYZER_PROCESSES to about the number of cores you have - leaving a few for the Horizon service and for Redis.

Algorithms

Skyline was designed to handle a very large number of metrics, for which picking models by hand would prove infeasible. As such, Skyline relies upon the consensus of an ensemble of a few different algorithms. If the majority of algorithms agree that any given metric is anomalous, the metric will be classified as anomalous, and will be surfaced to the webapp.

Currently, Skyline does not come with very many algorithmic batteries included. This is by design. We have included a few algorithms to get you started, but you are not obligated to use them and are encouraged to extend them to accomodate your particular data. Indeed, you are ultimately responsible for using the proper statistical tools the correct way with respect to your data.

Of course, we welcome all pull requests containing additional algorithms to make this tool as robust as possible. To this end, the algorithms were designed to be very easy to extend and modify. All algorithms are located in algorithms.py. To add an algorithm to the ensemble, simply define your algorithm and add the name of your algorithm settings.ALGORITHMS. Make sure your algorithm returns either True or False, and be sure to update the settings.CONSENSUS setting appropriately.

Algorithm philosophy

The basic algorithm is based on 3-sigma, derived from Shewhart's statistical process control. However, you are not limited to 3-sigma based algorithms if you don't want to use them - as long as you return a boolean, you can add any sort of algorithm you like to run on timeseries and vote.

Explanation of Exceptions:

TooShort: The timeseries was too short, as defined in settings.MIN_TOLERABLE_LENGTH
Incomplete: The timeseries was less than settings.FULL_DURATION seconds long
Stale: The timeseries has not received a new metric in more than settings.STALE_PERIOD seconds
Boring: The timeseries has been the same value for the past settings.MAX_TOLERABLE_BOREDOM seconds
Other: There's probably an error in your code, if you've been making changes.