Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Adds custom query row level hash validation feature. #440

Merged
merged 36 commits into from
Apr 28, 2022
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
308d5b4
added custom-query sub-option to validate command
Robby29 Mar 2, 2022
614ade6
add source and target query option in custom query
Robby29 Mar 7, 2022
72b55f3
added min,max,sum aggregates with custom query
Robby29 Mar 8, 2022
1567c09
fixed hive t0 column name addition issue
Robby29 Mar 12, 2022
f9ab9f2
added empty query file check
Robby29 Mar 13, 2022
455eeab
linting fixes
Robby29 Mar 13, 2022
f732c41
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Mar 16, 2022
5b2b24b
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Mar 21, 2022
994064f
added unit tests
Robby29 Mar 21, 2022
9b084e4
Merge branch 'develop' of https://github.com/Robby29/professional-ser…
Robby29 Mar 21, 2022
78b8146
incorporated black linting changes
Robby29 Mar 21, 2022
cd43524
incorporated flake linter changes
Robby29 Mar 21, 2022
a6d8ab2
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Mar 24, 2022
3bec521
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Apr 4, 2022
7546fa9
Fixed result schema status to validation_status to avoid duplicate co…
Raniksingh Apr 4, 2022
cb622fd
Fixed linting on tests folder
Raniksingh Apr 5, 2022
dbea3e7
BREAKING CHANGE: update BQ results schema column name 'status' to 'va…
nehanene15 Apr 5, 2022
8d29441
Added script to update Bigquery schema
Raniksingh Apr 9, 2022
e2ed1d3
Moved bq_utils to right folder
Raniksingh Apr 11, 2022
3f26768
Updated bash script path and formatting
Raniksingh Apr 11, 2022
56287fa
Added custom query row validation feature
Robby29 Apr 12, 2022
fc5294d
Added custom query row validation feature.
Robby29 Apr 12, 2022
8c0ea59
Merge branch 'develop' of https://github.com/GoogleCloudPlatform/prof…
Robby29 Apr 12, 2022
c929532
Incorporated black and flake8 linting changes.
Robby29 Apr 12, 2022
d4b0f11
Added wildcard-include-string-len sub option
Robby29 Apr 13, 2022
9169fc4
Fixed custom query column bug
Robby29 Apr 14, 2022
baa511a
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Apr 14, 2022
05ab925
Made changes as per review from @dhercher
Robby29 Apr 14, 2022
b6b374d
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Apr 23, 2022
210222a
new changes according to Neha's review requests
Robby29 Apr 23, 2022
8448587
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Apr 28, 2022
74ab258
changed custom query type from list to string
Robby29 Apr 28, 2022
ee26db9
made custom query type argument required=true
Robby29 Apr 28, 2022
0e30f18
Merge branch 'GoogleCloudPlatform:develop' into develop
Robby29 Apr 28, 2022
70b78f0
typo changes
Robby29 Apr 28, 2022
021865f
typo fix
Robby29 Apr 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 51 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,9 +230,9 @@ data-validation (--verbose or -v) validate schema
Defaults to table.
```

#### Custom Query Validations
### Custom Query Column Validations

Below is the command syntax for custom query validations.
Below is the command syntax for custom query column validations.

```
data-validation (--verbose or -v) validate custom-query
Expand All @@ -246,7 +246,10 @@ data-validation (--verbose or -v) validate custom-query
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
--source-query-file SOURCE_QUERY_FILE, -sqf SOURCE_QUERY_FILE
--custome-query-type CUSTOM_QUERY_TYPE, -cqt CUSTOM_QUERY_TYPE
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
Type of custom query validation: ('row'|'column')
Enter 'column' for custom query column validation
--source-query-file SOURCE_QUERY_FILE, -sqf SOURCE_QUERY_FILE
File containing the source sql commands
--target-query-file TARGET_QUERY_FILE, -tqf TARGET_QUERY_FILE
File containing the target sql commands
Expand All @@ -273,6 +276,51 @@ The [Examples](docs/examples.md) page provides few examples of how this tool can
used to run custom query validations.


### Custom Query Row Validations

#### (Note: Row hash validation is currently only supported for BigQuery, Imapala/Hive and Teradata)

Below is the command syntax for row validations. In order to run row level
validations you need to pass `--hash` flag with `*` value which means all the fields
of the custom query result will be concatenated and hashed.

Below is the command syntax for custom query row validations.

```
data-validation (--verbose or -v) validate custom-query
--source-conn or -sc SOURCE_CONN
Source connection details
See: *Data Source Configurations* section for each data source
--target-conn or -tc TARGET_CONN
Target connection details
See: *Connections* section for each data source
--tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
--custome-query-type CUSTOM_QUERY_TYPE, -cqt CUSTOM_QUERY_TYPE
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
Type of custom query validation: ('row'|'column')
Enter 'row' for custom query column validation
--source-query-file SOURCE_QUERY_FILE, -sqf SOURCE_QUERY_FILE
File containing the source sql commands
--target-query-file TARGET_QUERY_FILE, -tqf TARGET_QUERY_FILE
File containing the target sql commands
--hash '*' '*' to hash all columns.
[--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
BigQuery destination for validation results. Defaults to stdout.
See: *Validation Reports* section
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
```

The [Examples](docs/examples.md) page provides few examples of how this tool can
used to run custom query row validations.


### Running Custom SQL Exploration

There are many occasions where you need to explore a data source while running
Expand Down
6 changes: 6 additions & 0 deletions data_validation/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,12 @@ def build_config_from_args(args, config_manager):

if config_manager.validation_type == consts.CUSTOM_QUERY:
config_manager.append_aggregates(get_aggregate_config(args, config_manager))
if args.custom_query_type is not None:
config_manager.append_custom_query_type(args.custom_query_type)
else:
raise ValueError(
"Expected custom query type to be given, got empty string."
)
if args.source_query_file is not None:
query_file = cli_tools.get_arg_list(args.source_query_file)
config_manager.append_source_query_file(query_file)
Expand Down
12 changes: 12 additions & 0 deletions data_validation/cli_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -537,6 +537,12 @@ def _configure_schema_parser(schema_parser):
def _configure_custom_query_parser(custom_query_parser):
"""Configure arguments to run custom-query validations."""
_add_common_arguments(custom_query_parser)
custom_query_parser.add_argument(
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
"--custom-query-type",
"-cqt",
required=True,
help="Which type of custom query (row/column)",
)
custom_query_parser.add_argument(
"--source-query-file",
"-sqf",
Expand Down Expand Up @@ -609,6 +615,12 @@ def _configure_custom_query_parser(custom_query_parser):
"-pk",
help="Comma separated list of primary key columns 'col_a,col_b'",
)
custom_query_parser.add_argument(
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
"--wildcard-include-string-len",
"-wis",
action="store_true",
help="Include string fields for wildcard aggregations.",
)


def _add_common_arguments(parser):
Expand Down
11 changes: 7 additions & 4 deletions data_validation/combiner.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,9 +75,11 @@ def generate_report(
differences_pivot = _calculate_differences(
source, target, join_on_fields, run_metadata.validations, is_value_comparison
)

source_pivot = _pivot_result(
source, join_on_fields, run_metadata.validations, consts.RESULT_TYPE_SOURCE
)

target_pivot = _pivot_result(
target, join_on_fields, run_metadata.validations, consts.RESULT_TYPE_TARGET
)
Expand Down Expand Up @@ -149,7 +151,6 @@ def _calculate_difference(field_differences, datatype, validation, is_value_comp
.else_(consts.VALIDATION_STATUS_SUCCESS)
.end()
)

return (
difference.name("difference"),
pct_difference.name("pct_difference"),
Expand Down Expand Up @@ -178,7 +179,6 @@ def _calculate_differences(
# When no join_on_fields are present, we expect only one row per table.
# This is validated in generate_report before this function is called.
differences_joined = source.cross_join(target)

differences_pivots = []
for field, field_type in schema.items():
if field not in validations:
Expand All @@ -201,7 +201,6 @@ def _calculate_differences(
)
]
)

differences_pivot = functools.reduce(
lambda pivot1, pivot2: pivot1.union(pivot2), differences_pivots
)
Expand All @@ -210,7 +209,11 @@ def _calculate_differences(

def _pivot_result(result, join_on_fields, validations, result_type):
all_fields = frozenset(result.schema().names)
validation_fields = all_fields - frozenset(join_on_fields)
validation_fields = (
all_fields - frozenset(join_on_fields)
if "hash__all" not in join_on_fields
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
nehanene15 marked this conversation as resolved.
Show resolved Hide resolved
else all_fields
)
pivots = []

for field in validation_fields:
Expand Down
11 changes: 11 additions & 0 deletions data_validation/config_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,17 @@ def append_query_groups(self, grouped_column_configs):
self.query_groups + grouped_column_configs
)

@property
def custom_query_type(self):
"""Return custom query type from config"""
return self._config.get(consts.CONFIG_CUSTOM_QUERY_TYPE, "")

def append_custom_query_type(self, custom_query_type):
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
"""Append custom query type config to existing config."""
self._config[consts.CONFIG_CUSTOM_QUERY_TYPE] = (
self.custom_query_type + custom_query_type
)

@property
def source_query_file(self):
"""Return SQL Query File from Config"""
Expand Down
2 changes: 1 addition & 1 deletion data_validation/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
CONFIG_MAX_RECURSIVE_QUERY_SIZE = "max_recursive_query_size"
CONFIG_SOURCE_QUERY_FILE = "source_query_file"
CONFIG_TARGET_QUERY_FILE = "target_query_file"

CONFIG_CUSTOM_QUERY_TYPE = "custom_query_type"
CONFIG_FILTER_SOURCE_COLUMN = "source_column"
CONFIG_FILTER_SOURCE_VALUE = "source_value"
CONFIG_FILTER_TARGET_COLUMN = "target_column"
Expand Down
9 changes: 9 additions & 0 deletions data_validation/data_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,10 +291,19 @@ def _execute_validation(self, validation_builder, process_in_memory=True):
if self.config_manager.validation_type == consts.ROW_VALIDATION
else set(validation_builder.get_group_aliases())
)
if (
self.config_manager.validation_type == consts.CUSTOM_QUERY
and self.config_manager.custom_query_type == "row"
):
join_on_fields = set(["hash__all"])

# If row validation from YAML, compare source and target agg values
Robby29 marked this conversation as resolved.
Show resolved Hide resolved
is_value_comparison = (
self.config_manager.validation_type == consts.ROW_VALIDATION
or (
self.config_manager.validation_type == consts.CUSTOM_QUERY
and self.config_manager.custom_query_type == "row"
)
)

if process_in_memory:
Expand Down
Loading