`count_by` and `endpoints` grouping modes #1312

mathemancer · 2022-04-26T12:07:11Z

Fixes #384

This adds two grouping mode: count_by for Number typed columns, and endpoints for any sets of columns which can be ordered.

Technical details

`endpoints`

This mode is meant to be internal for now. That said, it can be called from the API if desired. The reason for making it internal is that some validation of the parameters (specifically the ordering of the endpoint tuples described below) isn't performant without multiple queries per transaction, and we don't currently have that included in our execute_query function. To avoid duplicating work, I'm deferring implementing that validation till we have that functionality.

The way this mode works is that you give an ascending-order array of arrays where each inner array represents a tuple of values from the columns chosen for the grouping. The values do not need to exist in the columns, but they do need to be of appropriate type. That is, the bounds can be chosen between values, as long as there is space between those values for that type. Order for tuples is defined in the same way that PostgreSQL orders rows by a set of columns. This means that if you give the columns in a different order, the order of the tuples in the given array needs to change correspondingly.

The value defining the mode is "endpoints", and the extra parameter "bound_tuples" is required (and is an array of arrays where each inner array has the same number of elements as the number of columns).

`count_by`

This mode lets a user specify a global_min, global_max and count_by parameters, each of which should be a number (ideally with count_by < global_max - global_min). This will return groups satisfying those parameters in the following way:

Given [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], if the user chooses global_min = 2, global_max = 17, and count_by = 3, then the resulting groups will be: [2, 3, 4], [5, 6, 7], [8, 9, 10], [11, 12, 13], [14, 15]. The parameters do not need to be integers. Internally, this sets up bounds by choosing the global_min as the lowest tuple, then adding count_by iteratively until the global_max is reached, then using the endpoints mode internally. Note that for continuous data, this means the intervals will be greater than or equal to their lower bound, but strictly less than their upper bound.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the master branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

mathemancer added 11 commits April 12, 2022 16:53

first pass custom grouping function

adcbb86

Merge branch 'master' into custom_grouping

373e8cc

fix bugs in custom range select generating function

99ca9e2

wire endpoints function up to API

0571bdc

add tests, including one before solution

eca7752

Move GroupBy validation to __init__

e468b41

Merge branch 'master' into custom_grouping

7fa5548

add count_by mode to GroupBy

14a3516

test count_by grouping mode, fix bugs found

442bcc7

remove debugging print statement

a25d676

Merge branch 'master' into custom_grouping

cf77378

mathemancer requested review from a team, silentninja and dmos62 and removed request for a team and silentninja April 26, 2022 12:07

kgodey assigned dmos62 Apr 26, 2022

kgodey added the pr-status: review A PR awaiting review label Apr 26, 2022

mathemancer mentioned this pull request Apr 26, 2022

First letter grouping #1314

Merged

7 tasks

Merge branch 'master' into custom_grouping

65a6653

mathemancer mentioned this pull request May 3, 2022

Preprocessed distinct grouping mode #1342

Merged

7 tasks

dmos62 approved these changes May 4, 2022

View reviewed changes

Merge branch 'master' into custom_grouping

7d6fc78

dmos62 enabled auto-merge May 4, 2022 12:19

dmos62 merged commit 103c099 into master May 4, 2022

dmos62 deleted the custom_grouping branch May 4, 2022 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`count_by` and `endpoints` grouping modes #1312

`count_by` and `endpoints` grouping modes #1312

mathemancer commented Apr 26, 2022

count_by and endpoints grouping modes #1312

count_by and endpoints grouping modes #1312

Conversation

mathemancer commented Apr 26, 2022

endpoints

count_by

Checklist

Developer Certificate of Origin

`count_by` and `endpoints` grouping modes #1312

`count_by` and `endpoints` grouping modes #1312

`endpoints`

`count_by`