Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine the method for generating batches efficiently #619

Closed
renzokuken opened this issue Oct 21, 2022 · 1 comment · Fixed by #653
Closed

Determine the method for generating batches efficiently #619

renzokuken opened this issue Oct 21, 2022 · 1 comment · Fixed by #653
Assignees
Labels
priority: p1 High priority. Fix may be included in the next release.

Comments

@renzokuken
Copy link
Collaborator

Sub-issue of issue #598

The fastest solution will be to use a MIN/MAX split of a numeric primary key and compute the batches external to the database in order to generate config files for efficient partitioned execution.

Longer term we may need to look into more complex solutions such as RANK_ORDER and MOD division in order to generate batches across combo analytical keys.

Exit criteria for this ticket should be the generation of multiple YAML files that contain partitioned filters.

@renzokuken renzokuken self-assigned this Oct 21, 2022
@nehanene15 nehanene15 added the priority: p1 High priority. Fix may be included in the next release. label Nov 22, 2022
@renzokuken renzokuken assigned renzokuken and unassigned renzokuken Dec 12, 2022
@mohdt786
Copy link
Contributor

TODO:

  1. Methods to save the partitions?:
    i. Ask user for config-folder instead of config-file and group YAMLs per table(Have multiple YAML config files)
    ii. Save all the partitions in the same config file per table (One YAML config file per table)
    iii. Save all the partitions in the same config file for all the tables (Single YAML file for all tables)
  2. Add partition filter with existing filter specified by user
  3. Filters to be added after config-manager object is created to reduce hitting Database with multiple queries
  4. Currently, user specified Filter is not supported for Custom-Query validate??
    i. Add filter support
    ii. Add the partition filter in the query itself

Later:

  1. Add multiple partition logics: Primary key, Mod + Primary Key, Hash + Primary key + Mod, Hash + Mod

mohdt786 added a commit that referenced this issue Jan 25, 2023
…653) (Issue #619,#662)

Features:
1. New command 'generate-table-partitions' added to generate partitions for `row` type validation
2. --partition-num: Number of partitions/config files to create.
    Range=[1,1000]
    If specified value is greater than count(*), value if coalesced to count(*)
3. --config-dir: Directory Path to store YAML Config Files. Either local or GCS path can be supplied
5. Added required arguments group to distinguish from optional arguments
6. Added mutually exclusive arguments group for --hash and --concat
7. --partition-key: Column on which the partitions would be generated. Column type must be integer. Defaults to Primary key

Tests:
Added unit tests for partition_builder.py, provides coverage for partition_row_builder.py

README.md & examples.md:
1. Added description for usage of 'generate-table-partitions' command
2. Added examples
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 High priority. Fix may be included in the next release.
Projects
None yet
4 participants