Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Issue619 determine the method for generating batches efficiently #653

Conversation

mohdt786
Copy link
Contributor

@mohdt786 mohdt786 commented Dec 21, 2022

Closes issue #619, #662
Added Partition support to generate multiple yaml config files

New:

- Arguments - file: cli_tools.py

1. New command 'generate-partitions' added to generate partitions for the following Validation types:
    1. row 
2. --partition-type: Specify the type of partition logic:
    1. primary_key
    2. primary_key_mod(TODO)
    3. hash_mod(TODO)
3. --partition-num: Number of partitions/config files to create.
    Range=[1,1000]
    If specified value is greater than count(*), value if coalesced to count(*)
4. --config-dir: Directory Path to store YAML Config Files
5. Added required arguments group to distinguish from optional arguments
6. Added mutually exclusive arguments group for --hash and --concat
Example: data-validation generate-partitions row \
            -sc BQ_CONN \
            -tc BQ_CONN \ 
            -tbls bigquery-public-data.new_york_citibike.citibike_stations,mohammedturky-sql.dvt.citibike_stations \   
            --primary-keys station_id,region_id \ 
            --hash \* \
            --filter-status fail \
            --filters 'station_id>3000:station_id>3000' \
            --config-dir partitions_dir \
            --partition-type primary_key \
            --partition-num 20

For details, use: data-validation generate-partitions -h

- Partition methods - file: main.py

partition_and_store_config_files(): Build ConfigManager and Route based on Validate command

- Partition methods - file: partition_builder.py

1. _get_arg_partition_type(): Get config Directory from args
2. _get_arg_partition_type(): Get partition type from args
3. _get_yaml_from_config(): Convert ConfigManager object to YAML
4. partition_configs(): Route according to partition type
5. _get_primary_key_partition_filters(): Get partition filters for Primary key type partition logic
6. _add_partition_filters_and_store(): Add partition filters to generate multiple yamls and store

- Partition methods - file: partition_row_builder.py

1. _compile_query(): Return Ibis Query object
2. get_max_query(): Build max query
3. get_min_query(): Build min query
4. get_count_query(): Build count query

- Partition methods - file: cli_tools.py

get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory

- Partition methods - file: state_manager.py

create_partition_config_directory(config_dir: str,target_folder_name: str): Create target directory for each table

- Type Hints and Doc string:

1. Added Type Hints to the above methods
7. Added Doc string with desc, args and return type for above methods

New:
    Arguments - file: cli_tools.py
    1. New command 'get-partitions' added to generate partitions for the following Validation types:
        1. row
        2. custom-query(TODO)
    2. --partition-type: Specify the type of partition logic:
        1. primary_key
        2. primary_key_mod(TODO)
        3. hash_mod(TODO)
    3. --partition-num: Number of partitions/config files to create.
        Range=[1,1000]
        If specified value is greater than count(*), value if coalesced to count(*)
    4. --config-dir: Directory Path to store YAML Config Files
    5. Added required arguments group to distinguish from optional arguments
    6. Added mutually exclusive arguments group for --hash and --concat

    Constants - file: consts.py
    1. Added DEFAULT_PARTITION_TYPE
    2. Added PARTITION_TYPES

    Partition methods - file: __main__.py
    1. _get_arg_partition_type(args): extract and return partition logic
    2. partition_and_store_config_files(args): Build/split config managers and store yaml files
    3. partition_configs(args, config_managers): Create a list of lists of config managers using partition filters
    4. _get_primary_key_partition_filters(args, config_manager): Get filters for primary_key partition logic
    5. _add_partition_filters_to_config(config_managers, partition_filters): Split ConfigManager objects and Add partition Filters
    6. get_dataframe(config_manager): Build source and target pandas dataframes from input ConfigManager object
    7. build_primary_key_agg_config_managers_from_args(args): Build a list of ConfigManager object for finding count, min and max of primary_key

    Partition methods - file: data_validation.py
    1. get_pandas_df(): Build source and target queries, return source and target dataframes

    Type Hints and Doc string:
    1. Added Type Hints to the above methods
    2. Added Doc string with desc, args and return type for above methods
New:

    Partition methods - file: __main__.py
    1. _add_partition_filters_and_store(config_managers, partition_filters,config_dir,args): Split ConfigManager objects, Add partition Filters and store in target dir
    2. _get_arg_config_dir(args): Return String yaml config folder pathfrom args.

    Partition methods - file: cli_tools.py
    1. get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory

    Partition methods - file: state_manager.py
    1. create_partition_config_directory(config_dir: str,target_folder_name: str)

    Type Hints and Doc string:
    1. Added Type Hints to the above methods
    2. Added Doc string with desc, args and return type for above methods
@mohdt786 mohdt786 linked an issue Dec 21, 2022 that may be closed by this pull request
New:

    Partition methods - file: __main__.py
    1. _add_partition_filters_and_store(config_managers, partition_filters,config_dir,args): Split ConfigManager objects, Add partition Filters and store in target dir
    2. _get_arg_config_dir(args): Return String yaml config folder pathfrom args.

    Partition methods - file: cli_tools.py
    1. get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory

    Partition methods - file: state_manager.py
    1. create_partition_config_directory(config_dir: str,target_folder_name: str)

    Type Hints and Doc string:
    1. Added Type Hints to the above methods
    2. Added Doc string with desc, args and return type for above methods
New:      Arguments - file: cli_tools.py     1. New command 'get-partitions' added to generate partitions for the following Validation types:         1. row          2. custom-query(TODO)     2. --partition-type: Specify the type of partition logic:         1. primary_key         2. primary_key_mod(TODO)         3. hash_mod(TODO)     3. --partition-num: Number of partitions/config files to create.         Range=[1,1000]         If specified value is greater than count(*), value if coalesced to count(*)     4. --config-dir: Directory Path to store YAML Config Files     5. Added required arguments group to distinguish from optional arguments     6. Added mutually exclusive arguments group for --hash and --concat     Example: data-validation get-partitions row \                 -sc BQ_CONN \                 -tc BQ_CONN \                  -tbls bigquery-public-data.new_york_citibike.citibike_stations,mohammedturky-sql.dvt.citibike_stations \                   "--primary-keys station_id,region_id \                  --hash * \                 --filter-status fail \                 --filters 'station_id>3000:station_id>3000' \                 --config-dir partitions_dir \                 --partition-type primary_key \                 --partition-num 20      Constants - file: consts.py     1. Added DEFAULT_PARTITION_TYPE     2. Added PARTITION_TYPES      Partition methods - file: __main__.py     1. _get_arg_partition_type(args): extract and return partition logic     2. partition_and_store_config_files(args): Build/split config managers and store yaml files     3. partition_configs(args, config_managers): Create a list of lists of config managers using partition filters     4. _get_primary_key_partition_filters(args, config_manager): Get filters for primary_key partition logic     5. _add_partition_filters_and_store(config_managers, partition_filters,config_dir,args): Split ConfigManager objects, Add partition Filters and store in target dir     6. get_dataframe(config_manager): Build source and target pandas dataframes from input ConfigManager object     7. build_primary_key_agg_config_managers_from_args(args): Build a list of ConfigManager object for finding count, min and max of primary_key     8. _get_arg_config_dir(args): Return String yaml config folder pathfrom args.       Partition methods - file: data_validation.py     1. get_pandas_df(): Build source and target queries, return source and target dataframes      Partition methods - file: cli_tools.py     1. get_target_table_folder_path(config_dir, target_folder_name): Create and return target directory      Partition methods - file: state_manager.py     1. create_partition_config_directory(config_dir: str,target_folder_name: str)      Type Hints and Doc string:     1. Added Type Hints to the above methods     2. Added Doc string with desc, args and return type for above methods

New Command added - 'get-partitions'
@mohdt786 mohdt786 changed the title Issue619 determine the method for generating batches efficiently feat: Issue619 determine the method for generating batches efficiently Dec 21, 2022
@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786 mohdt786 marked this pull request as draft December 21, 2022 19:09
@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786 mohdt786 marked this pull request as ready for review December 22, 2022 03:16
Validation type: custom-query
Partition type: primary_key_mod & hash_mod
@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

cc: @Raniksingh

Copy link
Collaborator

@nehanene15 nehanene15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest suggestion is to use Ibis instead of pandas to generate the partitions similar to the pattern in random row builder

data_validation/cli_tools.py Outdated Show resolved Hide resolved
data_validation/cli_tools.py Outdated Show resolved Hide resolved
data_validation/cli_tools.py Outdated Show resolved Hide resolved
data_validation/cli_tools.py Outdated Show resolved Hide resolved
data_validation/cli_tools.py Outdated Show resolved Hide resolved
data_validation/state_manager.py Outdated Show resolved Hide resolved
data_validation/__main__.py Outdated Show resolved Hide resolved
data_validation/__main__.py Outdated Show resolved Hide resolved
data_validation/__main__.py Outdated Show resolved Hide resolved
data_validation/__main__.py Outdated Show resolved Hide resolved
@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

Copy link
Collaborator

@nehanene15 nehanene15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides a 2 comments on the tests - thanks!

tests/unit/test_partition_builder.py Show resolved Hide resolved
parser = cli_tools.configure_arg_parser()
mock_args = parser.parse_args(CLI_ARGS_JSON_SOURCE)

mock_partition_filters = _generate_fake_partition_filters(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of generating fake partition filters, we can just manually check that the partition_filters_list has the correct length/attributes that we expect so we don't have to copy the logic over.

expected_partition_filters_list = ["key >= 0 and key < 10", "key >= 11 and key < 20"]
assert len(partition_filters_list[0]) == 20
assert partition_filters_list[0] == expected_partition_filters_list

This can also be done for _generate_fake_yaml_configs() to avoid having any logic in the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed generation of partition_filters_list and yaml_configs_list via function.

Added expected PARTITION_FILTERS_LIST and YAML_CONFIGS_LIST to a json file tests/unit/test_inputs/test_partition_builder.json since expected YAML_CONFIGS_LIST is too large to store in tests/unit/test_partition_builder.py and would make it less readable.

@mohdt786
Copy link
Contributor Author

/gcbrun

@@ -0,0 +1,1956 @@
{
"PARTITION_FILTERS_LIST": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is really long... can we adjust to 3-5 partitions instead of 20?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Reduced test input size

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786
Copy link
Contributor Author

/gcbrun

@mohdt786 mohdt786 merged commit f79c308 into develop Jan 25, 2023
@mohdt786 mohdt786 deleted the issue619-determine-the-method-for-generating-batches-efficiently branch January 25, 2023 04:19
@mohdt786 mohdt786 self-assigned this Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants