Skip to content

Commit

Permalink
feat: Refactor CLI to fit Command Pattern (#303)
Browse files Browse the repository at this point in the history
* feat: refactor CLI to command pattern

* fix: added format flag for validate command

* fix: update readme formatting

* fix: update sample code to use new CLI options
  • Loading branch information
nehanene15 committed Sep 16, 2021
1 parent 923413d commit f6d2b9d
Show file tree
Hide file tree
Showing 14 changed files with 304 additions and 94 deletions.
88 changes: 61 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,19 @@ Once you have your connections set up, you are ready to run the validations.

### Validation command syntax and options

Below are the command syntax and options for running validations from the CLI.
DVT supports column (including grouped column) and schema validations.

#### Column Validations

Below is the command syntax for column validations. To run a grouped column validation,
simply specify the `--grouped-columns` flag. You can also take grouped column validations
a step further by providing the `--primary-key` flag. With this flag, if a mismatch was found,
DVT will dive deeper into the slice with the error and find the row (primary key value) with the
inconsistency.

```
data-validation run
--type or -t TYPE Type of Data Validation (Column, GroupedColumn, Row, Schema)
data-validation (--verbose or -v) validate column
--source-conn or -sc SOURCE_CONN
Source connection details
See: *Data Source Configurations* section for each data source
Expand All @@ -83,35 +93,33 @@ data-validation run
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
--grouped-columns or -gc GROUPED_COLUMNS
[--grouped-columns or -gc GROUPED_COLUMNS]
Comma separated list of columns for Group By i.e col_a,col_b
(Optional) Only used in GroupedColumn validations
--primary-keys or -pc PRIMARY_KEYS
[--primary-keys or -pk PRIMARY_KEYS]
Comma separated list of columns to use as primary keys
(Optional) Only use in Row validations
--count COLUMNS Comma separated list of columns for count or * for all columns
--sum COLUMNS Comma separated list of columns for sum or * for all numeric
--min COLUMNS Comma separated list of columns for min or * for all numeric
--max COLUMNS Comma separated list of columns for max or * for all numeric
--avg COLUMNS Comma separated list of columns for avg or * for all numeric
--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE
(Optional) BigQuery destination for validation results. Defaults to stdout.
(Note) Only use with grouped column validation
[--count COLUMNS] Comma separated list of columns for count or * for all columns
[--sum COLUMNS] Comma separated list of columns for sum or * for all numeric
[--min COLUMNS] Comma separated list of columns for min or * for all numeric
[--max COLUMNS] Comma separated list of columns for max or * for all numeric
[--avg COLUMNS] Comma separated list of columns for avg or * for all numeric
[--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
BigQuery destination for validation results. Defaults to stdout.
See: *Validation Reports* section
--service-account or -sa PATH_TO_SA_KEY
(Optional) Service account to use for BigQuery result handler output.
--filters SOURCE_FILTER:TARGET_FILTER
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--filters SOURCE_FILTER:TARGET_FILTER]
Colon spearated string values of source and target filters.
If target filter is not provided, the source filter will run on source and target tables.
See: *Filters* section
--config-file or -c CONFIG_FILE
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
--threshold or -th THRESHOLD
(Optional) Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
--labels or -l KEY1=VALUE1,KEY2=VALUE2
(Optional) Comma-separated key value pair labels for the run.
--verbose or -v Verbose logging will print queries executed
--format or -fmt Format for stdout output, Supported formats are (text, csv, json, table)
It defaults to table.
[--threshold or -th THRESHOLD]
Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
```

The default aggregation type is a 'COUNT *'. If no aggregation flag (i.e count,
Expand All @@ -120,6 +128,33 @@ sum , min, etc.) is provided, the default aggregation will run.
The [Examples](docs/examples.md) page provides many examples of how a tool can
used to run powerful validations without writing any queries.

#### Schema Validations
Below is the syntax for schema validations. These can be used to compare column types between source
and target.

```
data-validation (--verbose or -v) validate schema
--source-conn or -sc SOURCE_CONN
Source connection details
See: *Data Source Configurations* section for each data source
--target-conn or -tc TARGET_CONN
Target connection details
See: *Connections* section for each data source
--tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
[--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
BigQuery destination for validation results. Defaults to stdout.
See: *Validation Reports* section
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
```

### Running Custom SQL Exploration

There are many occasions where you need to explore a data source while running
Expand All @@ -142,7 +177,7 @@ case specific CLI arguments or editing the saved YAML configuration file.
For example, the following command creates a YAML file for the validation of the
`new_york_citibike` table:
```
data-validation run -t Column -sc my_bq_conn -tc my_bq_conn -tbls
data-validation validate column -sc my_bq_conn -tc my_bq_conn -tbls
bigquery-public-data.new_york_citibike.citibike_trips -c citibike.yaml
```

Expand Down Expand Up @@ -360,8 +395,7 @@ View the schema of the results [here](terraform/results_schema.json).
### Configure tool to output to BigQuery

```
data-validation run
-t Column
data-validation validate column
-sc bq_conn
-tc bq_conn
-tbls bigquery-public-data.new_york_citibike.citibike_trips
Expand Down
48 changes: 34 additions & 14 deletions data_validation/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,15 +80,16 @@ def build_config_from_args(args, config_manager):
config_manager (ConfigManager): Validation config manager instance.
"""
config_manager.append_aggregates(get_aggregate_config(args, config_manager))
if config_manager.validation_type in [
consts.GROUPED_COLUMN_VALIDATION,
consts.ROW_VALIDATION,
]:
if args.primary_keys and not args.grouped_columns:
raise ValueError(
"Grouped columns must be specified for primary key level validation."
)
if args.grouped_columns:
grouped_columns = cli_tools.get_arg_list(args.grouped_columns)
config_manager.append_query_groups(
config_manager.build_config_grouped_columns(grouped_columns)
)
if config_manager.validation_type in [consts.ROW_VALIDATION]:
if args.primary_keys:
primary_keys = cli_tools.get_arg_list(args.primary_keys, default_value=[])
config_manager.append_primary_keys(
config_manager.build_config_grouped_columns(primary_keys)
Expand All @@ -103,12 +104,14 @@ def build_config_managers_from_args(args):
"""Return a list of config managers ready to execute."""
configs = []

config_type = args.type
if args.type is None:
config_type = args.validate_cmd.capitalize()
else:
config_type = args.type

source_conn = cli_tools.get_connection(args.source_conn)
target_conn = cli_tools.get_connection(args.target_conn)

labels = cli_tools.get_labels(args.labels)

result_handler_config = None
if args.bq_result_handler:
result_handler_config = cli_tools.get_result_handler(
Expand All @@ -119,14 +122,18 @@ def build_config_managers_from_args(args):
args.result_handler_config, args.service_account
)

filter_config = []
if args.filters:
filter_config = cli_tools.get_filters(args.filters)
# Schema validation will not accept filters, labels, or threshold as flags
filter_config, labels, threshold = [], [], 0.0
if config_type != consts.SCHEMA_VALIDATION:
if args.filters:
filter_config = cli_tools.get_filters(args.filters)
if args.threshold:
threshold = args.threshold
labels = cli_tools.get_labels(args.labels)

source_client = clients.get_data_client(source_conn)
target_client = clients.get_data_client(target_conn)

threshold = args.threshold if args.threshold else 0.0
format = args.format if args.format else "table"

is_filesystem = True if source_conn["source_type"] == "FileSystem" else False
Expand All @@ -149,7 +156,10 @@ def build_config_managers_from_args(args):
filter_config=filter_config,
verbose=args.verbose,
)
configs.append(build_config_from_args(args, config_manager))
if config_type != consts.SCHEMA_VALIDATION:
config_manager = build_config_from_args(args, config_manager)

configs.append(config_manager)

return configs

Expand Down Expand Up @@ -302,7 +312,7 @@ def run_validations(args, config_managers):


def store_yaml_config_file(args, config_managers):
"""Build a YAML config file fromt he supplied configs.
"""Build a YAML config file from the supplied configs.
Args:
config_managers (list[ConfigManager]): List of config manager instances.
Expand Down Expand Up @@ -338,6 +348,14 @@ def run_connections(args):
raise ValueError(f"Connections Argument '{args.connect_cmd}' is not supported")


def validate(args):
""" Run commands related to data validation."""
if args.validate_cmd == "column" or args.validate_cmd == "schema":
run(args)
else:
raise ValueError(f"Validation Argument '{args.validate_cmd}' is not supported")


def main():
# Create Parser and Get Deployment Info
args = cli_tools.get_parsed_args()
Expand All @@ -353,6 +371,8 @@ def main():
print(find_tables_using_string_matching(args))
elif args.command == "query":
print(run_raw_query_against_connection(args))
elif args.command == "validate":
validate(args)
else:
raise ValueError(f"Positional Argument '{args.command}' is not supported")

Expand Down
Loading

0 comments on commit f6d2b9d

Please sign in to comment.