Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the value of "n_remove" in riksloc #11

Open
ZhihuangLi1221 opened this issue Nov 10, 2022 · 3 comments
Open

Question about the value of "n_remove" in riksloc #11

ZhihuangLi1221 opened this issue Nov 10, 2022 · 3 comments

Comments

@ZhihuangLi1221
Copy link

Hi,

I hope you are well.

When I used riskloc in my dataset, I noticed that it can precisely found the root cause. However, my purpose is to find those anomalies that occur more frequently, so I would consider those rare root causes I found would be some outliers. Then I tried to increase the value of "n_remove" , but still not got my expected result.

Also, when I decrease the "n_remove" to 1, the "cutoff" value shifted a lot, and the output return null. When I do the same thing in another dataset, the result was not affected. I compared the distributions of measurements of 2 datasets, the first one is more like normal distribution, the second one is like long-tailed distribution.

Here are my questions:

  1. Is adjusting n_remove a way to do what I expect? If yes, is there some more reliable way than setting constants arbitrarily?
  2. Does the distribution of the measurements range affect the performance of the algorithm?

I am looking forward to your reply.

@chaochaobar
Copy link

I also have similar questions with you. whether there is a more resonable way to set the param 'n_remove'? Looking forward to the author's reply

@shaido987
Copy link
Owner

Hello @ZhihuangLi1221 and @chaochaobar ,

Thanks for your interest.

n_remove is used to remove some outliers in the deviation scores to get a reasonable cutoff point. This cutoff point is then used to partition the data into an abnormal and a normal part. This way of finding the cutoff point assumes that the normal data is relatively evenly distributed around 0, with a few possible outliers that n_remove handles.
Illustrative example (blue dots are normal data while the three colors are concurrent anomalies with different root causes):

hard_example_edited_multi_line

In the figure above, the dashed green line represents the minimum deviation score with 5 outliers removed (i.e., using n_remove=5) while the dashed red line is the maximum deviation score with outliers removed (also 5).
Since the minimum absolute value is smaller than the maximum absolute value, we determine that the anomalies are on the right-hand side of the plot (i.e., the real values of the anomalies are below the predicted values). The cutoff point is then the negation of the minimum value (green solid line in the figure). You can refer to Algorithm 1 in the paper.

For your questions:

  1. Adjusting n_remove will remove outliers when computing the cutoff point but will not affect data points on the partition deemed abnormal (so it will consider all data point to the right of the solid green line in the figure above when localizing the root cause). So you can't use n_remove to remove any rare anomalies/data points. Instead, you can try to increase the pep_threshold (proportional ep_threshold). Increasing this will only return anomalies that have a higher explanatory power which should remove smaller anomalies.

    Alternatively, if you only want to consider larger aggregated elements (and not very fine-grained/specific anomalies), you could adjust the code to only run n layers deep by setting a maximum value here:

    def search_anomaly(df, attributes, pruned_elements, risk_threshold=0.5, adj_ep_threshold=0.0, debug=True):
    for layer in range(1, len(attributes) + 1):

    Or, if you have some knowledge of what points should be removed, you can remove these as a preprocessing step before running riskloc.

  2. Yes, the distribution of the normal data's deviation scores will affect the result. The computation of the cutoff point is done with the assumption of the deviation scores being relatively evenly spread around 0. You could plot a similar figure to the one above to investigate if the obtained cutoff point is reasonable or not and if not, how it needs to be adjusted.

    I created an example where the normal data has a long-tail in the positive direction:
    long_tail

    As you can see, the cutoff point is too conservative and a lot of the normal data points will be considered when computing the potential root cause which may affect the accuracy. You could look at the clustering methods used in Autoroot and Squeeze (KDE clustering / using a histogram) and try to adapt them to return a single cutoff point to see if they work better for your data. Or as a first step, you can set a fixed value (e.g., 1 in the long-tailed data figure above).

@ZhihuangLi1221
Copy link
Author

Hi @shaido987 ,

Really appreciate your reply, it helps a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants