Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Refactor CLI to fit Command Pattern #303

Merged
merged 5 commits into from
Sep 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 61 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,19 @@ Once you have your connections set up, you are ready to run the validations.

### Validation command syntax and options

Below are the command syntax and options for running validations from the CLI.
DVT supports column (including grouped column) and schema validations.

#### Column Validations

Below is the command syntax for column validations. To run a grouped column validation,
simply specify the `--grouped-columns` flag. You can also take grouped column validations
a step further by providing the `--primary-key` flag. With this flag, if a mismatch was found,
DVT will dive deeper into the slice with the error and find the row (primary key value) with the
inconsistency.

```
data-validation run
--type or -t TYPE Type of Data Validation (Column, GroupedColumn, Row, Schema)
data-validation (--verbose or -v) validate column
--source-conn or -sc SOURCE_CONN
Source connection details
See: *Data Source Configurations* section for each data source
Expand All @@ -83,35 +93,33 @@ data-validation run
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
--grouped-columns or -gc GROUPED_COLUMNS
[--grouped-columns or -gc GROUPED_COLUMNS]
Comma separated list of columns for Group By i.e col_a,col_b
(Optional) Only used in GroupedColumn validations
--primary-keys or -pc PRIMARY_KEYS
[--primary-keys or -pk PRIMARY_KEYS]
Comma separated list of columns to use as primary keys
(Optional) Only use in Row validations
--count COLUMNS Comma separated list of columns for count or * for all columns
--sum COLUMNS Comma separated list of columns for sum or * for all numeric
--min COLUMNS Comma separated list of columns for min or * for all numeric
--max COLUMNS Comma separated list of columns for max or * for all numeric
--avg COLUMNS Comma separated list of columns for avg or * for all numeric
--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE
(Optional) BigQuery destination for validation results. Defaults to stdout.
(Note) Only use with grouped column validation
[--count COLUMNS] Comma separated list of columns for count or * for all columns
[--sum COLUMNS] Comma separated list of columns for sum or * for all numeric
[--min COLUMNS] Comma separated list of columns for min or * for all numeric
[--max COLUMNS] Comma separated list of columns for max or * for all numeric
[--avg COLUMNS] Comma separated list of columns for avg or * for all numeric
[--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
BigQuery destination for validation results. Defaults to stdout.
See: *Validation Reports* section
--service-account or -sa PATH_TO_SA_KEY
(Optional) Service account to use for BigQuery result handler output.
--filters SOURCE_FILTER:TARGET_FILTER
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--filters SOURCE_FILTER:TARGET_FILTER]
Colon spearated string values of source and target filters.
If target filter is not provided, the source filter will run on source and target tables.
See: *Filters* section
--config-file or -c CONFIG_FILE
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
--threshold or -th THRESHOLD
(Optional) Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
--labels or -l KEY1=VALUE1,KEY2=VALUE2
(Optional) Comma-separated key value pair labels for the run.
--verbose or -v Verbose logging will print queries executed
--format or -fmt Format for stdout output, Supported formats are (text, csv, json, table)
It defaults to table.
[--threshold or -th THRESHOLD]
Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
```

The default aggregation type is a 'COUNT *'. If no aggregation flag (i.e count,
Expand All @@ -120,6 +128,33 @@ sum , min, etc.) is provided, the default aggregation will run.
The [Examples](docs/examples.md) page provides many examples of how a tool can
used to run powerful validations without writing any queries.

#### Schema Validations
Below is the syntax for schema validations. These can be used to compare column types between source
and target.

```
data-validation (--verbose or -v) validate schema
--source-conn or -sc SOURCE_CONN
Source connection details
See: *Data Source Configurations* section for each data source
--target-conn or -tc TARGET_CONN
Target connection details
See: *Connections* section for each data source
--tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
[--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
BigQuery destination for validation results. Defaults to stdout.
See: *Validation Reports* section
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
```

### Running Custom SQL Exploration

There are many occasions where you need to explore a data source while running
Expand All @@ -142,7 +177,7 @@ case specific CLI arguments or editing the saved YAML configuration file.
For example, the following command creates a YAML file for the validation of the
`new_york_citibike` table:
```
data-validation run -t Column -sc my_bq_conn -tc my_bq_conn -tbls
data-validation validate column -sc my_bq_conn -tc my_bq_conn -tbls
bigquery-public-data.new_york_citibike.citibike_trips -c citibike.yaml
```

Expand Down Expand Up @@ -360,8 +395,7 @@ View the schema of the results [here](terraform/results_schema.json).
### Configure tool to output to BigQuery

```
data-validation run
-t Column
data-validation validate column
-sc bq_conn
-tc bq_conn
-tbls bigquery-public-data.new_york_citibike.citibike_trips
Expand Down
48 changes: 34 additions & 14 deletions data_validation/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,15 +80,16 @@ def build_config_from_args(args, config_manager):
config_manager (ConfigManager): Validation config manager instance.
"""
config_manager.append_aggregates(get_aggregate_config(args, config_manager))
if config_manager.validation_type in [
consts.GROUPED_COLUMN_VALIDATION,
consts.ROW_VALIDATION,
]:
if args.primary_keys and not args.grouped_columns:
raise ValueError(
"Grouped columns must be specified for primary key level validation."
)
if args.grouped_columns:
grouped_columns = cli_tools.get_arg_list(args.grouped_columns)
config_manager.append_query_groups(
config_manager.build_config_grouped_columns(grouped_columns)
)
if config_manager.validation_type in [consts.ROW_VALIDATION]:
if args.primary_keys:
primary_keys = cli_tools.get_arg_list(args.primary_keys, default_value=[])
config_manager.append_primary_keys(
config_manager.build_config_grouped_columns(primary_keys)
Expand All @@ -103,12 +104,14 @@ def build_config_managers_from_args(args):
"""Return a list of config managers ready to execute."""
configs = []

config_type = args.type
if args.type is None:
config_type = args.validate_cmd.capitalize()
else:
config_type = args.type

source_conn = cli_tools.get_connection(args.source_conn)
target_conn = cli_tools.get_connection(args.target_conn)

labels = cli_tools.get_labels(args.labels)

result_handler_config = None
if args.bq_result_handler:
result_handler_config = cli_tools.get_result_handler(
Expand All @@ -119,14 +122,18 @@ def build_config_managers_from_args(args):
args.result_handler_config, args.service_account
)

filter_config = []
if args.filters:
filter_config = cli_tools.get_filters(args.filters)
# Schema validation will not accept filters, labels, or threshold as flags
filter_config, labels, threshold = [], [], 0.0
if config_type != consts.SCHEMA_VALIDATION:
if args.filters:
filter_config = cli_tools.get_filters(args.filters)
if args.threshold:
threshold = args.threshold
labels = cli_tools.get_labels(args.labels)

source_client = clients.get_data_client(source_conn)
target_client = clients.get_data_client(target_conn)

threshold = args.threshold if args.threshold else 0.0
format = args.format if args.format else "table"

is_filesystem = True if source_conn["source_type"] == "FileSystem" else False
Expand All @@ -149,7 +156,10 @@ def build_config_managers_from_args(args):
filter_config=filter_config,
verbose=args.verbose,
)
configs.append(build_config_from_args(args, config_manager))
if config_type != consts.SCHEMA_VALIDATION:
config_manager = build_config_from_args(args, config_manager)

configs.append(config_manager)

return configs

Expand Down Expand Up @@ -302,7 +312,7 @@ def run_validations(args, config_managers):


def store_yaml_config_file(args, config_managers):
"""Build a YAML config file fromt he supplied configs.
"""Build a YAML config file from the supplied configs.

Args:
config_managers (list[ConfigManager]): List of config manager instances.
Expand Down Expand Up @@ -338,6 +348,14 @@ def run_connections(args):
raise ValueError(f"Connections Argument '{args.connect_cmd}' is not supported")


def validate(args):
""" Run commands related to data validation."""
if args.validate_cmd == "column" or args.validate_cmd == "schema":
run(args)
else:
raise ValueError(f"Validation Argument '{args.validate_cmd}' is not supported")


def main():
# Create Parser and Get Deployment Info
args = cli_tools.get_parsed_args()
Expand All @@ -353,6 +371,8 @@ def main():
print(find_tables_using_string_matching(args))
elif args.command == "query":
print(run_raw_query_against_connection(args))
elif args.command == "validate":
validate(args)
else:
raise ValueError(f"Positional Argument '{args.command}' is not supported")

Expand Down
Loading