Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotspot方法:关于PS度量因置信度的可解释性 #5

Open
mambasmile opened this issue Jun 27, 2022 · 5 comments
Open

hotspot方法:关于PS度量因置信度的可解释性 #5

mambasmile opened this issue Jun 27, 2022 · 5 comments

Comments

@mambasmile
Copy link

大佬您好,PS方法采用RE(涟漪效应)来度量因的置信度,如何理解PS方法的原理

image

很多人的猜想类似于下面的:
如果属性值是因 , 属性值的变化和属性值样本的变化符合涟漪效应;
如果属性值的变化和属性值样本的变化符合涟漪效应,则属性值是因

这种理解对么

@shaido987
Copy link
Owner

Hello,

Although I know a bit of Chinese, I'm in no way fluent so I will answer in English.

Following the ripple effect property, we know that:

  • The prediction errors of a root cause will propagate to its descendant leaf elements.
  • The impact is proportional, so the change in the real/actual value depends on how large it is (measured by the forecast value).

So we know that the above are properties of the true root cause. The problem now is to find which set of elements is the root cause. To do this we need to search through sets of elements and measure their likelihood of being the root cause (HotSpot uses the PS score to do this).

What is done in HotSpot is

  1. Assume a set of elements S is the root cause.
  2. Change the real/actual values of all descendant leaf elements, i.e., the forecasting error in S is proportionally applied to the leaf elements. If S has a 20% forecast error then all leaf elements also have a 20% error.
  3. If the adjusted values (a in the formula) are close to the actual values of the leaf elements (v), then S has a high potential score (PS). In the case where a == v, the distance between the two d(v,a) will be 0 and the PS score will be 1.

The key idea is that a root cause in multi-dimensional data like this will affect all the descendant elements evenly. This is what the PS score (and GPS in Squeeze, NPS in AutoRoot, and partly the risk score in RiskLoc) try to measure.

I hope the above helped a bit in understanding. If you have an interest in this work, consider staring the github repository.

@mambasmile
Copy link
Author

thanks

but there is a situation in reality, where S decreases by 20%, but e does not necessarily decrease by 20%, so the ripple effect has certain limitations. Do you know what scenarios the ripple effect is suitable for?

@shaido987
Copy link
Owner

I assume e is a leaf element of S? Since S decrease by 20% then these 20% need to come from somewhere, this somewhere is the leaf elements of S (since those build up S together). For S to have a forecast error of 20% then the leaf elements (as an aggregate, i.e., together) must also have have forecast error of 20% due to the nature of the multi-dimensional problem.

If S is a root cause of an anomaly, then the leaf elements will have its forecasting error evenly distributed following the ripple effect. If the forecasting error is more randomly distributed among the leaf elements then its less likely that S is the root cause. The above is also the asusmption of the ripple effect. So it's suitable in situations where you believe that prediction errors in the root cause elements will be evenly distributed (in practice this seems to work quite well).

In practice, I found that the most difficult step is to get accurate forecasting values for all leaf elements. Since these are usually quite fine-grained, they don't actually have much data and any forecasts are often inaccurate. This can skew the results.

@mambasmile
Copy link
Author

thanks for your answer
If an attribute value is the root cause and drops by 20%, the sample corresponding to the attribute value should change evenly by 20%, which belongs to the ripple effect theory
I personally think that the generality of this theory is not particularly strong.

For example, the following figure shows that province=Beijing is the root cause. The KPI corresponding to province=Beijing has dropped by 40%. The first sample (Province=Beijing, ISP = Mobile) has dropped by 60%, while the second sample (Province=Beijing, ISP = Unicom) does not change, the ripple effect does not hold here

image

@shaido987
Copy link
Owner

shaido987 commented Jun 28, 2022

Actually, I would say that it does work however the true root cause is not Beijing. In the example, (beijing, unicom) is normal so it does not make much sense to say that the whole (beijing, *) is abnormal. Instead, the root cause that best explains the anomaly should be (beijing, mobile). Note that both (shanghai, mobile) and (guangdong, mobile) are normal so the root cause won't be (*, mobile).

So, even if the (beijing, *) had dropped 40% and should by itself be considered abnormal the location of the problem is actually the Mobile ISP in Beijing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants