Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mov quantile #761

Merged
merged 1 commit into from
Aug 1, 2024
Merged

Mov quantile #761

merged 1 commit into from
Aug 1, 2024

Conversation

novertia
Copy link

Fix #751

This PR adds the moving quantile algorithm based on the articles mentioned in the issue. It follows the quantile approximation strategy when the rolling window do not have enough data. This PR adds both the code logic under mov_quantile.h as well as the required unit test under test_mov_quantile.cpp

@github-actions github-actions bot added the work in progress Pull request is still in progress and changing label Jul 17, 2024
@gavv gavv added the contribution A pull-request by someone else except maintainers label Jul 17, 2024
@novertia novertia marked this pull request as ready for review July 19, 2024 04:49
@github-actions github-actions bot added ready for review Pull request can be reviewed and removed work in progress Pull request is still in progress and changing labels Jul 19, 2024
@novertia
Copy link
Author

@gavv
for calculating the index of heap to return (root index) of partition heap I am currently taking a percentile of size_t checking if its 0 -100 and then calculating the index (percentile * window_length)/100. I am not sure if this approach is okay or should I instead take percentile between 0-1 and calculate and type cast accordingly.

@gavv
Copy link
Member

gavv commented Jul 19, 2024

Thanks for PR! Me or @adrianopol will do review in the upcoming days.

Regarding 0..100 vs 0..1: I guess we should either name it MovPercentile and use 0..100 or name it MovQuantile and use 0..1. In both cases, we need to use float or double for this number.

@novertia
Copy link
Author

If we use float or double for taking the percentile or quantile for index calculation at one point we will have to type cast window length from size_t to double or percentile to size_t since the index and window_length is in size_t. Should this approach be okay considering precision loss cases for the aforementioned type casting.

Can we take k instead of a percentile or quantile which can be the kth smallest element user want from the window length like in the document. Although not sure if this approach is in sync with the requirement.

@gavv
Copy link
Member

gavv commented Jul 20, 2024

Just to be sure I understand the question correctly. You're asking what pair of parameters to use, among these three alternatives, right?

  1. window_size (L) + K in range [0; L)
  2. window_size (L) + percentile (P) in range [0; 100) (then K = L * P / 100)
  3. window_size (L) + quantile (Q) in range [0; 1) (then K = L * Q)

I think in code that will use MovPercentile/MovQuantile, the original parameters that we have would be window_size and percentile/quantile. E.g. we may need to compute 95th percentile of inter-arrival jitter for last 300 packets. We'll likely have numbers 300 and 95 (or 0.95) in config. We could compute K from those numbers, though it would be convenient if MovPercentile/MovQuantile would compute it for us.

As for rounding errors, I believe that if Q is 32-bit float in range [0; 1), then we can compute K precisely for any L in range [0; 2^23), because floats in range [0; 1) can be mapped one-to-one to integers in range [0; 2^23). If that's true, it looks more than enough and we don't have to bother about rounding errors.

I suggest to stick with MovQuantile approach with parameters L + Q, just to be sure we use floats or doubles in range [0; 1), since that range has highest precision (24 bits for 32-bit float, 48 bits for 64-bit float). If you think floats are not enough, let's use doubles.

@novertia
Copy link
Author

Sure I understood will make the changes

novertia added a commit to novertia/roc-toolkit that referenced this pull request Jul 20, 2024
@adrianopol adrianopol added the review in progress Pull request is being reviewed label Jul 25, 2024
@adrianopol adrianopol self-requested a review July 25, 2024 20:06
@gavv gavv removed the ready for review Pull request can be reviewed label Jul 25, 2024
@gavv gavv added this to the next milestone Aug 1, 2024
@gavv gavv merged commit 6f18841 into roc-streaming:develop Aug 1, 2024
1 check passed
@gavv gavv requested review from gavv and removed request for adrianopol August 1, 2024 11:06
@gavv
Copy link
Member

gavv commented Aug 1, 2024

Awesome, thanks a lot for PR and sorry for delay!

All suggestions I have are cosmetic, so I just committed them by myself: 4e6b413 (copyright year, make methods private, use const, and a few renames).

I also added a simple stress test with random window size and values: 0bc19da, seems to work fine.

One small question about this snippet:

        if (win_filled_) {
            ...
            min_heap_index_ = win_len_ - 1;
            max_heap_index_ = 0;

Are there situations when min_heap_index_ and max_heap_index_ are not already set to these values in this branch?

@gavv gavv removed the review in progress Pull request is being reviewed label Aug 1, 2024
@github-actions github-actions bot added the work in progress Pull request is still in progress and changing label Aug 1, 2024
@gavv gavv removed the work in progress Pull request is still in progress and changing label Aug 1, 2024
@novertia
Copy link
Author

novertia commented Aug 1, 2024

At the very start when there are not enough values to fill the window at that point the values will be different. Both of them will start from heap root and max_heap_index_ will decrement and min_heap_index_ will increment. Once the window is filled then they will take on the mentioned values. This method will still give us quantile even if we don't have all the values in the window.

Thanks for merging the PR

@gavv
Copy link
Member

gavv commented Aug 1, 2024

Yeah, I mean, by the time when we enter if (win_filled_) { branch, min_heap_index_ and max_heap_index_ always already have values win_len_ - 1 and 0, right? And these assignments are redundant.

Or am I missing something, and and it's possible that win_filled_ is true, but those two variables have different values?

@novertia
Copy link
Author

novertia commented Aug 1, 2024

That is correct and the assignments are redundant. Sorry i missed that.

@novertia
Copy link
Author

novertia commented Aug 1, 2024

when win_filled_ that values max_heap_index_ will be 0 and min_heap_index_ is win_len - 1 there can't be another case.

@gavv
Copy link
Member

gavv commented Aug 1, 2024

Got it, thanks. No worries, I just wanted to ensure I'm not missing something important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution A pull-request by someone else except maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants