feat: Refactor CLI to fit Command Pattern (#303)

* feat: refactor CLI to command pattern * fix: added format flag for validate command * fix: update readme formatting * fix: update sample code to use new CLI options
GoogleCloudPlatform · Sep 16, 2021 · f6d2b9d · f6d2b9d
1 parent 923413d
commit f6d2b9d
Show file tree

Hide file tree

Showing 14 changed files with 304 additions and 94 deletions.
diff --git a/README.md b/README.md
@@ -70,9 +70,19 @@ Once you have your connections set up, you are ready to run the validations.
 
 ### Validation command syntax and options
 
+Below are the command syntax and options for running validations from the CLI.
+DVT supports column (including grouped column) and schema validations.
+
+#### Column Validations
+
+Below is the command syntax for column validations. To run a grouped column validation,
+simply specify the `--grouped-columns` flag. You can also take grouped column validations
+a step further by providing the `--primary-key` flag. With this flag, if a mismatch was found,
+DVT will dive deeper into the slice with the error and find the row (primary key value) with the
+inconsistency. 
+
 ```
-data-validation run
-  --type or -t TYPE  Type of Data Validation (Column, GroupedColumn, Row, Schema)
+data-validation (--verbose or -v) validate column
   --source-conn or -sc SOURCE_CONN
                         Source connection details
                         See: *Data Source Configurations* section for each data source
@@ -83,35 +93,33 @@ data-validation run
                         Comma separated list of tables in the form schema.table=target_schema.target_table
                         Target schema name and table name are optional.
                         i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
-  --grouped-columns or -gc GROUPED_COLUMNS
+  [--grouped-columns or -gc GROUPED_COLUMNS]
                         Comma separated list of columns for Group By i.e col_a,col_b
-                        (Optional) Only used in GroupedColumn validations
-  --primary-keys or -pc PRIMARY_KEYS
+  [--primary-keys or -pk PRIMARY_KEYS]
                         Comma separated list of columns to use as primary keys
-                        (Optional) Only use in Row validations
-  --count COLUMNS       Comma separated list of columns for count or * for all columns
-  --sum COLUMNS         Comma separated list of columns for sum or * for all numeric
-  --min COLUMNS         Comma separated list of columns for min or * for all numeric
-  --max COLUMNS         Comma separated list of columns for max or * for all numeric
-  --avg COLUMNS         Comma separated list of columns for avg or * for all numeric
-  --bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE
-                        (Optional) BigQuery destination for validation results. Defaults to stdout.
+                        (Note) Only use with grouped column validation
+  [--count COLUMNS]     Comma separated list of columns for count or * for all columns
+  [--sum COLUMNS]       Comma separated list of columns for sum or * for all numeric
+  [--min COLUMNS]       Comma separated list of columns for min or * for all numeric
+  [--max COLUMNS]       Comma separated list of columns for max or * for all numeric
+  [--avg COLUMNS]       Comma separated list of columns for avg or * for all numeric
+  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
+                        BigQuery destination for validation results. Defaults to stdout.
                         See: *Validation Reports* section
-  --service-account or -sa PATH_TO_SA_KEY
-                        (Optional) Service account to use for BigQuery result handler output.
-  --filters SOURCE_FILTER:TARGET_FILTER
+  [--service-account or -sa PATH_TO_SA_KEY]
+                        Service account to use for BigQuery result handler output.
+  [--filters SOURCE_FILTER:TARGET_FILTER]
                         Colon spearated string values of source and target filters.
                         If target filter is not provided, the source filter will run on source and target tables.
                         See: *Filters* section
-  --config-file or -c CONFIG_FILE
+  [--config-file or -c CONFIG_FILE]
                         YAML Config File Path to be used for storing validations.
-  --threshold or -th THRESHOLD
-                        (Optional) Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
-  --labels or -l KEY1=VALUE1,KEY2=VALUE2
-                        (Optional) Comma-separated key value pair labels for the run.
-  --verbose or -v       Verbose logging will print queries executed
-  --format or -fmt      Format for stdout output, Supported formats are (text, csv, json, table)
-                        It defaults to table.
+  [--threshold or -th THRESHOLD]
+                        Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
+  [--labels or -l KEY1=VALUE1,KEY2=VALUE2]
+                        Comma-separated key value pair labels for the run.
+  [--format or -fmt]    Format for stdout output. Supported formats are (text, csv, json, table).
+                        Defaults to table.
 ```
 
 The default aggregation type is a 'COUNT *'. If no aggregation flag (i.e count,
@@ -120,6 +128,33 @@ sum , min, etc.) is provided, the default aggregation will run.
 The [Examples](docs/examples.md) page provides many examples of how a tool can
 used to run powerful validations without writing any queries.
 
+#### Schema Validations
+Below is the syntax for schema validations. These can be used to compare column types between source
+and target.
+
+```
+data-validation (--verbose or -v) validate schema
+  --source-conn or -sc SOURCE_CONN
+                        Source connection details
+                        See: *Data Source Configurations* section for each data source
+  --target-conn or -tc TARGET_CONN
+                        Target connection details
+                        See: *Connections* section for each data source
+  --tables-list or -tbls SOURCE_SCHEMA.SOURCE_TABLE=TARGET_SCHEMA.TARGET_TABLE
+                        Comma separated list of tables in the form schema.table=target_schema.target_table
+                        Target schema name and table name are optional.
+                        i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
+  [--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
+                        BigQuery destination for validation results. Defaults to stdout.
+                        See: *Validation Reports* section
+  [--service-account or -sa PATH_TO_SA_KEY]
+                        Service account to use for BigQuery result handler output.
+  [--config-file or -c CONFIG_FILE]
+                        YAML Config File Path to be used for storing validations.
+  [--format or -fmt]    Format for stdout output. Supported formats are (text, csv, json, table). 
+                        Defaults  to table.
+```
+
 ### Running Custom SQL Exploration
 
 There are many occasions where you need to explore a data source while running
@@ -142,7 +177,7 @@ case specific CLI arguments or editing the saved YAML configuration file.
 For example, the following command creates a YAML file for the validation of the
 `new_york_citibike` table: 
 ```
-data-validation run -t Column -sc my_bq_conn -tc my_bq_conn -tbls
+data-validation validate column -sc my_bq_conn -tc my_bq_conn -tbls
 bigquery-public-data.new_york_citibike.citibike_trips -c citibike.yaml
 ```
 
@@ -360,8 +395,7 @@ View the schema of the results [here](terraform/results_schema.json).
 ### Configure tool to output to BigQuery
 
 ```
-data-validation run
-  -t Column
+data-validation validate column
   -sc bq_conn
   -tc bq_conn
   -tbls bigquery-public-data.new_york_citibike.citibike_trips

diff --git a/data_validation/__main__.py b/data_validation/__main__.py
@@ -80,15 +80,16 @@ def build_config_from_args(args, config_manager):
         config_manager (ConfigManager): Validation config manager instance.
     """
     config_manager.append_aggregates(get_aggregate_config(args, config_manager))
-    if config_manager.validation_type in [
-        consts.GROUPED_COLUMN_VALIDATION,
-        consts.ROW_VALIDATION,
-    ]:
+    if args.primary_keys and not args.grouped_columns:
+        raise ValueError(
+            "Grouped columns must be specified for primary key level validation."
+        )
+    if args.grouped_columns:
         grouped_columns = cli_tools.get_arg_list(args.grouped_columns)
         config_manager.append_query_groups(
             config_manager.build_config_grouped_columns(grouped_columns)
         )
-    if config_manager.validation_type in [consts.ROW_VALIDATION]:
+    if args.primary_keys:
         primary_keys = cli_tools.get_arg_list(args.primary_keys, default_value=[])
         config_manager.append_primary_keys(
             config_manager.build_config_grouped_columns(primary_keys)
@@ -103,12 +104,14 @@ def build_config_managers_from_args(args):
     """Return a list of config managers ready to execute."""
     configs = []
 
-    config_type = args.type
+    if args.type is None:
+        config_type = args.validate_cmd.capitalize()
+    else:
+        config_type = args.type
+
     source_conn = cli_tools.get_connection(args.source_conn)
     target_conn = cli_tools.get_connection(args.target_conn)
 
-    labels = cli_tools.get_labels(args.labels)
-
     result_handler_config = None
     if args.bq_result_handler:
         result_handler_config = cli_tools.get_result_handler(
@@ -119,14 +122,18 @@ def build_config_managers_from_args(args):
             args.result_handler_config, args.service_account
         )
 
-    filter_config = []
-    if args.filters:
-        filter_config = cli_tools.get_filters(args.filters)
+    # Schema validation will not accept filters, labels, or threshold as flags
+    filter_config, labels, threshold = [], [], 0.0
+    if config_type != consts.SCHEMA_VALIDATION:
+        if args.filters:
+            filter_config = cli_tools.get_filters(args.filters)
+        if args.threshold:
+            threshold = args.threshold
+        labels = cli_tools.get_labels(args.labels)
 
     source_client = clients.get_data_client(source_conn)
     target_client = clients.get_data_client(target_conn)
 
-    threshold = args.threshold if args.threshold else 0.0
     format = args.format if args.format else "table"
 
     is_filesystem = True if source_conn["source_type"] == "FileSystem" else False
@@ -149,7 +156,10 @@ def build_config_managers_from_args(args):
             filter_config=filter_config,
             verbose=args.verbose,
         )
-        configs.append(build_config_from_args(args, config_manager))
+        if config_type != consts.SCHEMA_VALIDATION:
+            config_manager = build_config_from_args(args, config_manager)
+
+        configs.append(config_manager)
 
     return configs
 
@@ -302,7 +312,7 @@ def run_validations(args, config_managers):
 
 
 def store_yaml_config_file(args, config_managers):
-    """Build a YAML config file fromt he supplied configs.
+    """Build a YAML config file from the supplied configs.
 
     Args:
         config_managers (list[ConfigManager]): List of config manager instances.
@@ -338,6 +348,14 @@ def run_connections(args):
         raise ValueError(f"Connections Argument '{args.connect_cmd}' is not supported")
 
 
+def validate(args):
+    """ Run commands related to data validation."""
+    if args.validate_cmd == "column" or args.validate_cmd == "schema":
+        run(args)
+    else:
+        raise ValueError(f"Validation Argument '{args.validate_cmd}' is not supported")
+
+
 def main():
     # Create Parser and Get Deployment Info
     args = cli_tools.get_parsed_args()
@@ -353,6 +371,8 @@ def main():
         print(find_tables_using_string_matching(args))
     elif args.command == "query":
         print(run_raw_query_against_connection(args))
+    elif args.command == "validate":
+        validate(args)
     else:
         raise ValueError(f"Positional Argument '{args.command}' is not supported")