-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Issue619 determine the method for generating batches efficiently #653
feat: Issue619 determine the method for generating batches efficiently #653
Conversation
New: Arguments - file: cli_tools.py 1. New command 'get-partitions' added to generate partitions for the following Validation types: 1. row 2. custom-query(TODO) 2. --partition-type: Specify the type of partition logic: 1. primary_key 2. primary_key_mod(TODO) 3. hash_mod(TODO) 3. --partition-num: Number of partitions/config files to create. Range=[1,1000] If specified value is greater than count(*), value if coalesced to count(*) 4. --config-dir: Directory Path to store YAML Config Files 5. Added required arguments group to distinguish from optional arguments 6. Added mutually exclusive arguments group for --hash and --concat Constants - file: consts.py 1. Added DEFAULT_PARTITION_TYPE 2. Added PARTITION_TYPES Partition methods - file: __main__.py 1. _get_arg_partition_type(args): extract and return partition logic 2. partition_and_store_config_files(args): Build/split config managers and store yaml files 3. partition_configs(args, config_managers): Create a list of lists of config managers using partition filters 4. _get_primary_key_partition_filters(args, config_manager): Get filters for primary_key partition logic 5. _add_partition_filters_to_config(config_managers, partition_filters): Split ConfigManager objects and Add partition Filters 6. get_dataframe(config_manager): Build source and target pandas dataframes from input ConfigManager object 7. build_primary_key_agg_config_managers_from_args(args): Build a list of ConfigManager object for finding count, min and max of primary_key Partition methods - file: data_validation.py 1. get_pandas_df(): Build source and target queries, return source and target dataframes Type Hints and Doc string: 1. Added Type Hints to the above methods 2. Added Doc string with desc, args and return type for above methods
…ing-batches-efficiently
New: Partition methods - file: __main__.py 1. _add_partition_filters_and_store(config_managers, partition_filters,config_dir,args): Split ConfigManager objects, Add partition Filters and store in target dir 2. _get_arg_config_dir(args): Return String yaml config folder pathfrom args. Partition methods - file: cli_tools.py 1. get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory Partition methods - file: state_manager.py 1. create_partition_config_directory(config_dir: str,target_folder_name: str) Type Hints and Doc string: 1. Added Type Hints to the above methods 2. Added Doc string with desc, args and return type for above methods
New: Partition methods - file: __main__.py 1. _add_partition_filters_and_store(config_managers, partition_filters,config_dir,args): Split ConfigManager objects, Add partition Filters and store in target dir 2. _get_arg_config_dir(args): Return String yaml config folder pathfrom args. Partition methods - file: cli_tools.py 1. get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory Partition methods - file: state_manager.py 1. create_partition_config_directory(config_dir: str,target_folder_name: str) Type Hints and Doc string: 1. Added Type Hints to the above methods 2. Added Doc string with desc, args and return type for above methods
New: Arguments - file: cli_tools.py 1. New command 'get-partitions' added to generate partitions for the following Validation types: 1. row 2. custom-query(TODO) 2. --partition-type: Specify the type of partition logic: 1. primary_key 2. primary_key_mod(TODO) 3. hash_mod(TODO) 3. --partition-num: Number of partitions/config files to create. Range=[1,1000] If specified value is greater than count(*), value if coalesced to count(*) 4. --config-dir: Directory Path to store YAML Config Files 5. Added required arguments group to distinguish from optional arguments 6. Added mutually exclusive arguments group for --hash and --concat Example: data-validation get-partitions row \ -sc BQ_CONN \ -tc BQ_CONN \ -tbls bigquery-public-data.new_york_citibike.citibike_stations,mohammedturky-sql.dvt.citibike_stations \ "--primary-keys station_id,region_id \ --hash * \ --filter-status fail \ --filters 'station_id>3000:station_id>3000' \ --config-dir partitions_dir \ --partition-type primary_key \ --partition-num 20 Constants - file: consts.py 1. Added DEFAULT_PARTITION_TYPE 2. Added PARTITION_TYPES Partition methods - file: __main__.py 1. _get_arg_partition_type(args): extract and return partition logic 2. partition_and_store_config_files(args): Build/split config managers and store yaml files 3. partition_configs(args, config_managers): Create a list of lists of config managers using partition filters 4. _get_primary_key_partition_filters(args, config_manager): Get filters for primary_key partition logic 5. _add_partition_filters_and_store(config_managers, partition_filters,config_dir,args): Split ConfigManager objects, Add partition Filters and store in target dir 6. get_dataframe(config_manager): Build source and target pandas dataframes from input ConfigManager object 7. build_primary_key_agg_config_managers_from_args(args): Build a list of ConfigManager object for finding count, min and max of primary_key 8. _get_arg_config_dir(args): Return String yaml config folder pathfrom args. Partition methods - file: data_validation.py 1. get_pandas_df(): Build source and target queries, return source and target dataframes Partition methods - file: cli_tools.py 1. get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory Partition methods - file: state_manager.py 1. create_partition_config_directory(config_dir: str,target_folder_name: str) Type Hints and Doc string: 1. Added Type Hints to the above methods 2. Added Doc string with desc, args and return type for above methods New Command added - 'get-partitions'
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
Validation type: custom-query Partition type: primary_key_mod & hash_mod
/gcbrun |
cc: @Raniksingh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest suggestion is to use Ibis instead of pandas to generate the partitions similar to the pattern in random row builder
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM besides a 2 comments on the tests - thanks!
tests/unit/test_partition_builder.py
Outdated
parser = cli_tools.configure_arg_parser() | ||
mock_args = parser.parse_args(CLI_ARGS_JSON_SOURCE) | ||
|
||
mock_partition_filters = _generate_fake_partition_filters( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of generating fake partition filters, we can just manually check that the partition_filters_list has the correct length/attributes that we expect so we don't have to copy the logic over.
expected_partition_filters_list = ["key >= 0 and key < 10", "key >= 11 and key < 20"]
assert len(partition_filters_list[0]) == 20
assert partition_filters_list[0] == expected_partition_filters_list
This can also be done for _generate_fake_yaml_configs() to avoid having any logic in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed generation of partition_filters_list
and yaml_configs_list
via function.
Added expected PARTITION_FILTERS_LIST
and YAML_CONFIGS_LIST
to a json file tests/unit/test_inputs/test_partition_builder.json
since expected YAML_CONFIGS_LIST
is too large to store in tests/unit/test_partition_builder.py
and would make it less readable.
/gcbrun |
@@ -0,0 +1,1956 @@ | |||
{ | |||
"PARTITION_FILTERS_LIST": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is really long... can we adjust to 3-5 partitions instead of 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Reduced test input size
/gcbrun |
/gcbrun |
Closes issue #619, #662
Added Partition support to generate multiple yaml config files
New:
- Arguments - file: cli_tools.py
- Partition methods - file: main.py
- Partition methods - file: partition_builder.py
- Partition methods - file: partition_row_builder.py
- Partition methods - file: cli_tools.py
- Partition methods - file: state_manager.py
- Type Hints and Doc string: