diff --git a/README.md b/README.md index f4c1b856f..56d072547 100644 --- a/README.md +++ b/README.md @@ -1,27 +1,45 @@ # Data Validation Tool -The goal of the data validation tool is to allow easy comparison and validation between different tables. This Python CLI tool supports several types of comparison: - - Count Validation: Total counts and other aggregates match between source and destination - - Partitioned Count: Partitioned counts and other aggregations between source and destination - - Grouped values ie SELECT updated_at::DATE, COUNT(1) matches for each date - - Filters: Count or Partitioned Count WHERE FILTER_CONDITION +The goal of the data validation tool is to allow easy comparison and validation +between different tables. This Python CLI tool supports several types of +comparison: + +``` +- Count Validation: Total counts and other aggregates match between source and destination +- Partitioned Count: Partitioned counts and other aggregations between source and destination + - Grouped values ie SELECT updated_at::DATE, COUNT(1) matches for each date +- Filters: Count or Partitioned Count WHERE FILTER_CONDITION +``` ## Installation -The [installation](docs/installation.md) page describes the prerequisites and setup steps needed to install and use the data validation tool. + +The [installation](docs/installation.md) page describes the prerequisites and +setup steps needed to install and use the data validation tool. ## Usage -Before using this tool, you will need to create connections to the source and target tables. Once the connections are created, you can run validations on those tables. Validation results can be printed to stdout (default) or outputted to BigQuery. The validation tool also allows you to save or edit validation configurations in a YAML file. This is useful for running common validations or updating the configuration. + +Before using this tool, you will need to create connections to the source and +target tables. Once the connections are created, you can run validations on +those tables. Validation results can be printed to stdout (default) or outputted +to BigQuery. The validation tool also allows you to save or edit validation +configurations in a YAML file. This is useful for running common validations or +updating the configuration. ### Connections -The [Connections](docs/connections.md) page provides details about how to create and list connections for the validation tool. + +The [Connections](docs/connections.md) page provides details about how to create +and list connections for the validation tool. ### Running CLI Validations The data validation CLI is a main interface to use this tool. -The CLI has several different commands which can be used to create and re-run validations. +The CLI has several different commands which can be used to create and re-run +validations. -The validation tool first expects connections to be created before running validations. To create connections please review the [Connections](connections.md) page. +The validation tool first expects connections to be created before running +validations. To create connections please review the +[Connections](connections.md) page. Once you have your connections set up, you are ready to run the validations. @@ -36,8 +54,8 @@ data-validation run --target-conn TARGET_CONN, -tc TARGET_CONN Target connection details See: *Data Source Configurations* section for each data source - --tables-list TABLES, -tbls TABLES - JSON List of tables + --tables-list TABLES, -tbls TABLES + JSON List of tables '[{"schema_name":"bigquery-public-data.new_york_citibike","table_name":"citibike_trips","target_table_name":"citibike_trips"}]' --grouped-columns GROUPED_COLUMNS, -gc GROUPED_COLUMNS JSON List of columns to use in group by '["col_a"]' @@ -60,60 +78,96 @@ data-validation run --verbose, -v Verbose logging will print queries executed ``` -The [Examples](docs/examples.md) page provides many examples of how a tool can used to run powerful validations without writing any queries. - +The [Examples](docs/examples.md) page provides many examples of how a tool can +used to run powerful validations without writing any queries. ### Running Custom SQL Exploration + There are many occasions where you need to explore a data source while running -validations. To avoid the need to open and install a new client, the CLI allows -you to run custom queries. -``` -data-validation query - --conn connection-name The named connection to be queried. - --query, -q The Raw query to run against the supplied connection -``` +validations. To avoid the need to open and install a new client, the CLI allows +you to run custom queries. `data-validation query --conn connection-name The +named connection to be queried. --query, -q The Raw query to run against the +supplied connection` ## Query Configurations -You can customize the configuration for any given validation by providing use case specific CLI arguments or editing the saved YAML configuration file. +You can customize the configuration for any given validation by providing use +case specific CLI arguments or editing the saved YAML configuration file. -For example, the following command creates a YAML file for the validation of the `new_york_citibike` table. -``` -data-validation run -t Column -sc bq -tc bq -tbls '[{"schema_name":"bigquery-public-data.new_york_citibike","table_name":"citibike_trips"}]' -c citibike.yaml -``` +For example, the following command creates a YAML file for the validation of the +`new_york_citibike` table. `data-validation run -t Column -sc bq -tc bq -tbls +'[{"schema_name":"bigquery-public-data.new_york_citibike","table_name":"citibike_trips"}]' +-c citibike.yaml` Here is the generated YAML file named `citibike.yaml`: + ``` result_handler: {} -source: bq -target: bq -validations: -- aggregates: - - field_alias: count - source_column: null - target_column: null - type: count - filters: [] - labels: [] - schema_name: bigquery-public-data.new_york_citibike - table_name: citibike_trips - target_schema_name: bigquery-public-data.new_york_citibike - target_table_name: citibike_trips - type: Column - +source: bq target: +bq validations: + - aggregates: + - field_alias: count + source_column: null + target_column: null + type: count + filters: [] + labels: [] +schema_name: bigquery-public-data.new_york_citibike +table_name: citibike_trips +target_schema_name: bigquery-public-data.new_york_citibike +target_table_name: citibike_trips type: Column ``` -You can now edit the YAML file if, for example, the `new_york_citibike` table is stored in datasets that have different names in the source and target systems. Once the file is updated and saved, the following command runs the new validation: +You can now edit the YAML file if, for example, the `new_york_citibike` table is +stored in datasets that have different names in the source and target systems. +Once the file is updated and saved, the following command runs the new +validation: + ``` data-validation run-config -c citibike.yaml ``` +The Data Validation Tool exposes several components that can be stitched +together to generate a wide range of queries + +### Aggregated Fields + +Aggregate fields contain the SQL fields that you want to produce an aggregate +for. Currently the functions `COUNT()`, `AVG()`, `SUM()`, `MIN()` and `MAX()` +are supported. + +#### Sample Aggregate Config + +``` +validations: +- aggregates: + - field_alias: count + source_column: null + target_column: null + type: count + - field_alias: count__tripduration + source_column: tripduration + target_column: tripduration + type: count + - field_alias: sum__tripduration + source_column: tripduration + target_column: tripduration + type: sum + - field_alias: bit_xor__hashed_column + source_column: hashed_column + target_column: hashed_column + type: bit_xor +``` ### Filters -Currently the only form of filter supported is a custom filter written by you in the syntax of the given source. In future we will also release pre-built filters to cover certain usecases (ie. `SELECT * FROM table WHERE created_at > 30 days ago;`). +Filters let you apply a WHERE statement to your validation query. Currently the +only form of filter supported is a custom filter written by you in the syntax of +the given source. In future we will also release pre-built filters to cover +certain usecases (ie. `SELECT * FROM table WHERE created_at > 30 days ago;`). #### Custom Filters + ``` { "type": "custom", @@ -122,18 +176,147 @@ Currently the only form of filter supported is a custom filter written by you in } ``` -Note that you are writing the query to execute, which does not have to match between source and target as long as the results can be expected to align. +Note that you are writing the query to execute, which does not have to match +between source and target as long as the results can be expected to align. + +### Grouped Columns + +Grouped Columns contain the fields you want your aggregations to be broken out +by, e.g. `SELECT last_updated::DATE, COUNT(*) FROM my.table` will produce a +resultset that breaks down the count of rows per calendar date. + +### Calculated Fields + +Sometimes direct comparisons are not feasible between databases due to +differences in how particular data types may be handled. These differences can +be resolved by applying functions to columns in the source query itself. +Examples might include trimming whitespace from a string, converting strings to +a single case to compare case insensitivity, or rounding numeric types to a +significant figure. + +Once a calculated field is defined, it can be referenced by other calculated +fields at any "depth" or higher. Depth controls how many subqueries are executed +in the resulting query. For example, with the following yaml config... + +``` +- calculated_fields: + - field_alias: rtrim_col_a + source_calculated_columns: ['col_a'] + target_calculated_columns: ['col_a'] + type: rtrim + depth: 0 # generated off of a native column + - field_alias: ltrim_col_a + source_calculated_columns: ['col_b'] + target_calculated_columns: ['col_b'] + type: ltrim + depth: 0 # generated off of a native column + - field_alias: concat_col_a_col_b + source_calculated_columns: ['rtrim_col_a', 'ltrim_col_b'] + target_calculated_columns: ['rtrim_col_a', 'ltrim_col_b'] + type: concat + depth: 1 # calculated one query above +``` + +is equivalent to the following SQL query... + +``` +SELECT + CONCAT(rtrim_col_a, rtrim_col_b) AS concat_col_a_col_b +FROM ( + SELECT + RTRIM(col_a) AS rtrim_col_a + , LTRIM(col_b) AS ltrim_col_b + FROM my.table + ) as table_0 +``` +Calculated fields can be used by aggregate fields to produce validations on +calculated or sanitized raw data, such as calculating the aggregate hash of a +table. For example the following yaml config... + +``` +validations: +- aggregates: + - field_alias: xor__multi_statement_hash + source_column: multi_statement_hash + target_column: multi_statement_hash + type: bit_xor + calculated_fields: + - field_alias: multi_statement_hash + source_calculated_columns: [multi_statement_concat] + target_calculated_columns: [multi_statement_concat] + type: hash + depth: 2 + - field_alias: multi_statement_concat + source_calculated_columns: [calc_length_col_a, + calc_ifnull_col_b, + calc_rstrip_col_c, + calc_upper_col_d] + target_calculated_columns: [calc_length_col_a, + calc_ifnull_col_b, + calc_rstrip_col_c, + calc_upper_col_d] + type: concat + depth: 1 + - field_alias: calc_length_col_a + source_calculated_columns: [col_a] + target_calculated_columns: [col_a] + type: length + depth: 0 + - field_alias: calc_ifnull_col_b + source_calculated_columns: [col_b] + target_calculated_columns: [col_b] + type: ifnull + depth: 0 + - field_alias: calc_rstrip_col_c + source_calculated_columns: [col_c] + target_calculated_columns: [col_c] + type: rstrip + depth: 0 + - field_alias: calc_upper_col_d + source_calculated_columns: [col_d] + target_calculated_columns: [col_d] + type: upper + depth: 0 +``` + +is equivalent to the following SQL query... + +``` +SELECT + BIT_XOR(multi_statement_hash) AS xor__multi_statement_hash +FROM ( + SELECT + FARM_FINGERPRINT(mult_statement_concat) AS multi_statement_hash + FROM ( + SELECT + CONCAT(calc_length_col_a, + calc_ifnull_col_b, + calc_rstrip_col_c, + calc_upper_col_d) AS multi_statement_concat + FROM ( + SELECT + CAST(LENGTH(col_a) AS STRING) AS calc_length_col_a + , IFNULL(col_b, + 'DEFAULT_REPLACEMENT_STRING') AS calc_ifnull_col_b + , RTRIM(col_c) AS calc_rstrip_col_c + , UPPER(col_d) AS calc_upper_col_d + FROM my.table + ) AS table_0 + ) AS table_1 + ) AS table_2 +``` ## Validation Reports The data validation tool can write the results of a validation run to Google BigQuery or print to Std Out. -The output handlers tell the data validation where to store the results of each validation. -By default the handler will print to stdout. +The output handlers tell the data validation where to store the results of each +validation. By default the handler will print to stdout. ### Configure tool to output to BigQuery + ``` { # Configuration Required for All Data Soures @@ -164,19 +347,26 @@ The find-tables tool: ## Add Support for an existing Ibis Data Source -If you want to add an Ibis Data Source which exists, but was not yet supported in the Data Validation tool, it is a simple process. +If you want to add an Ibis Data Source which exists, but was not yet supported +in the Data Validation tool, it is a simple process. + +1. In data_validation/data_validation.py -1. In data_validation/data_validation.py - - Import the extened Client for the given source (ie. from ibis.sql.mysql.client import MySQLClient). - - Add the "": Client to the global CLIENT_LOOKUP dictionary. + - Import the extened Client for the given source (ie. from + ibis.sql.mysql.client import MySQLClient). + - Add the "": Client to the global CLIENT_LOOKUP dictionary. -2. In third_party/ibis/ibis_addon/operations.py - - Add the RawSQL operator to the data source registry (for custom filter support). +2. In third_party/ibis/ibis_addon/operations.py -3. You are done, you can reference the data source via the config. - - Config: {"source_type": "", ...KV Values required in Client...} + - Add the RawSQL operator to the data source registry (for custom filter + support). + +3. You are done, you can reference the data source via the config. + + - Config: {"source_type": "", ...KV Values required in Client...} ## Deploy to Composer + ``` #!/bin/bash