Skip to content

Commit

Permalink
feat: Add support to generate a JSON config file only for application…
Browse files Browse the repository at this point in the history
…s purposes (#1089)

* Initial changes to get the JSON config

* Initial changes to add new config-file-json flag feature

* feat: only support saving to JSON, not running JSON configs

* feat: remove configs run flags for JSON configs

* Update docstrings

* Start adding JSON config file in our docs

* Add unit test. Remove unused function param

* Add PROJECT_ID env for unit tests at Nox file

* Reformatted noxfile.py with Black lib

* Reformatted noxfile.py with Black lib

* Change scope to get the PROJECT_ID

* Move new JSON config test to BQ system tests file

* Update related documentation. Delete unused directory

* Changes after PR review

---------

Co-authored-by: Neha Nene <[email protected]>
  • Loading branch information
helensilva14 and nehanene15 committed Feb 12, 2024
1 parent c599ebf commit d463038
Show file tree
Hide file tree
Showing 16 changed files with 164 additions and 99 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ terraform.rc

# Custom
*.yaml
*.json
partitions_dir
setup.sh

Expand Down
37 changes: 26 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ setup steps needed to install and use the Data Validation Tool.
Before using this tool, you will need to create connections to the source and
target tables. Once the connections are created, you can run validations on
those tables. Validation results can be printed to stdout (default) or outputted
to BigQuery (recommended). DVT also allows you to save or edit validation
configurations in a YAML file. This is useful for running common validations or
to BigQuery (recommended). DVT also allows you to save and edit validation
configurations in a YAML or JSON file. This is useful for running common validations or
updating the configuration.

### Managing Connections
Expand Down Expand Up @@ -123,7 +123,10 @@ data-validation (--verbose or -v) (--log-level or -ll) validate column
If target filter is not provided, the source filter will run on source and target tables.
See: *Filters* section
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
YAML Config File Path to be used for storing validations and other features.
See: *Running DVT with YAML Configuration Files* section
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations only for application purposes.
[--threshold or -th THRESHOLD]
Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Expand Down Expand Up @@ -159,7 +162,7 @@ and finally hashing the row.

Under the hood, row validation uses
[Calculated Fields](https://github.com/GoogleCloudPlatform/professional-services-data-validator#calculated-fields) to
apply functions such as IFNULL() or RTRIM(). These can be edited in the YAML config to customize your row validation.
apply functions such as IFNULL() or RTRIM(). These can be edited in the YAML or JSON config file to customize your row validation.

```
data-validation (--verbose or -v) (--log-level or -ll) validate row
Expand Down Expand Up @@ -190,7 +193,10 @@ data-validation (--verbose or -v) (--log-level or -ll) validate row
If target filter is not provided, the source filter will run on source and target tables.
See: *Filters* section
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
YAML Config File Path to be used for storing validations and other features.
See: *Running DVT with YAML Configuration Files* section
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations only for application purposes.
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Expand Down Expand Up @@ -267,7 +273,10 @@ data-validation (--verbose or -v) (--log-level or -ll) validate schema
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
YAML Config File Path to be used for storing validations and other features.
See: *Running DVT with YAML Configuration Files* section
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations only for application purposes.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
[--filter-status or -fs STATUSES_LIST]
Expand Down Expand Up @@ -318,7 +327,10 @@ data-validation (--verbose or -v) (--log-level or -ll) validate custom-query col
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
YAML Config File Path to be used for storing validations and other features.
See: *Running DVT with YAML Configuration Files* section
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations only for application purposes.
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Expand Down Expand Up @@ -377,7 +389,10 @@ data-validation (--verbose or -v) (--log-level or -ll) validate custom-query row
[--service-account or -sa PATH_TO_SA_KEY]
Service account to use for BigQuery result handler output.
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
YAML Config File Path to be used for storing validations and other features.
See: *Running DVT with YAML Configuration Files* section
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations only for application purposes.
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Expand Down Expand Up @@ -426,7 +441,7 @@ The following command creates a YAML file for the validation of the
my_bq_conn -tbls bigquery-public-data.new_york_citibike.citibike_trips -c
citibike.yaml`.

The vaildation config file is saved to the GCS path specified by the `PSO_DV_CONFIG_HOME`
The validation config file is saved to the GCS path specified by the `PSO_DV_CONFIG_HOME`
env variable if that has been set; otherwise, it is saved to wherever the tool is run.

You can now edit the YAML file if, for example, the `new_york_citibike` table is
Expand Down Expand Up @@ -627,7 +642,7 @@ significant figure.

Once a calculated field is defined, it can be referenced by other calculated
fields at any "depth" or higher. Depth controls how many subqueries are executed
in the resulting query. For example, with the following YAML config...
in the resulting query. For example, with the following YAML config:

```yaml
- calculated_fields:
Expand All @@ -648,7 +663,7 @@ in the resulting query. For example, with the following YAML config...
depth: 1 # calculated one query above
```

is equivalent to the following SQL query...
is equivalent to the following SQL query:

```sql
SELECT
Expand Down
47 changes: 44 additions & 3 deletions data_validation/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,21 @@


def _get_arg_config_file(args):
"""Return String yaml config file path."""
"""Return String YAML config file path."""
if not args.config_file:
raise ValueError("YAML Config File was not supplied.")

return args.config_file


def _get_arg_config_file_json(args):
"""Return String JSON config file path."""
if not args.config_file_json:
raise ValueError("JSON Config File was not supplied.")

return args.config_file_json


def get_aggregate_config(args, config_manager: ConfigManager):
"""Return list of formated aggregation objects.
Expand Down Expand Up @@ -489,6 +497,26 @@ def convert_config_to_yaml(args, config_managers):
return yaml_config


def convert_config_to_json(config_managers) -> dict:
"""Return dict objects formatted for json validations.
JSON configs correspond to ConfigManager objects and therefore can only correspond to
one table validation.
Args:
config_managers (list[ConfigManager]): List of config manager instances.
"""

if len(config_managers) > 1:
raise ValueError(
"JSON configs can only be created for single table validations."
)
config_manager = config_managers[0]
json_config = config_manager.config
json_config[consts.CONFIG_SOURCE_CONN] = config_manager.get_source_connection()
json_config[consts.CONFIG_TARGET_CONN] = config_manager.get_target_connection()
return json_config


def run_validation(config_manager, dry_run=False, verbose=False):
"""Run a single validation.
Expand Down Expand Up @@ -552,7 +580,7 @@ def run_validations(args, config_managers):
for config_manager in config_managers:
if config_manager.config and consts.CONFIG_FILE in config_manager.config:
logging.info(
"Currently running the validation for yml file: %s",
"Currently running the validation for YAML file: %s",
config_manager.config[consts.CONFIG_FILE],
)
try:
Expand Down Expand Up @@ -580,6 +608,17 @@ def store_yaml_config_file(args, config_managers):
cli_tools.store_validation(config_file_path, yaml_configs)


def store_json_config_file(args, config_managers):
"""Build a JSON config file from the supplied configs.
Args:
config_managers (list[ConfigManager]): List of config manager instances.
"""
json_config = convert_config_to_json(config_managers)
config_file_path = _get_arg_config_file_json(args)
cli_tools.store_validation_json(config_file_path, json_config)


def partition_and_store_config_files(args: Namespace) -> None:
"""Build multiple YAML Config files using user specified partition logic
Expand All @@ -597,7 +636,7 @@ def partition_and_store_config_files(args: Namespace) -> None:

def run(args) -> None:
"""Splits execution into:
1. Build and save single Yaml Config file
1. Build and save single Config file (YAML or JSON)
2. Run Validations
Args:
Expand All @@ -609,6 +648,8 @@ def run(args) -> None:
config_managers = build_config_managers_from_args(args)
if args.config_file:
store_yaml_config_file(args, config_managers)
elif args.config_file_json:
store_json_config_file(args, config_managers)
else:
run_validations(args, config_managers)

Expand Down
15 changes: 13 additions & 2 deletions data_validation/cli_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,7 +321,7 @@ def _configure_raw_query(subparsers):


def _configure_validation_config_parser(subparsers):
"""Configure arguments to run a data validation YAML config."""
"""Configure arguments to run a data validation YAML config file."""
validation_config_parser = subparsers.add_parser(
"configs", help="Run validations stored in a YAML config file"
)
Expand Down Expand Up @@ -922,7 +922,12 @@ def _add_common_arguments(optional_arguments, required_arguments):
optional_arguments.add_argument(
"--config-file",
"-c",
help="Store the validation in the YAML Config File Path specified",
help="Store the validation config in the YAML File Path specified",
)
optional_arguments.add_argument(
"--config-file-json",
"-cj",
help="Store the validation config in the JSON File Path specified to be used for application use cases",
)
optional_arguments.add_argument(
"--format",
Expand Down Expand Up @@ -1074,6 +1079,12 @@ def store_validation(validation_file_name, yaml_config):
mgr.create_validation_yaml(validation_file_name, yaml_config)


def store_validation_json(validation_file_name, json_config):
"""Store the validation JSON config under the given name."""
mgr = state_manager.StateManager()
mgr.create_validation_json(validation_file_name, json_config)


def store_partition(target_file_path, yaml_config, target_folder_path=None):
"""Store the partition YAML config under the given name."""
mgr = state_manager.StateManager(target_folder_path)
Expand Down
1 change: 1 addition & 0 deletions data_validation/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
SECRET_MANAGER_PROJECT_ID = "secret_manager_project_id"
CONFIG = "config"
CONFIG_FILE = "config_file"
CONFIG_FILE_JSON = "config_file_json"
CONFIG_SOURCE_CONN_NAME = "source_conn_name"
CONFIG_TARGET_CONN_NAME = "target_conn_name"
CONFIG_SOURCE_CONN = "source_conn"
Expand Down
13 changes: 12 additions & 1 deletion data_validation/state_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,17 @@ def create_validation_yaml(self, name: str, yaml_config: Dict[str, str]):
yaml_config_str = dump(yaml_config, Dumper=Dumper)
self._write_file(validation_path, yaml_config_str)

def create_validation_json(self, name: str, json_config: Dict[str, str]):
"""Create a validation file and store the given config as JSON.
Args:
name (String): The name of the validation.
json_config (Dict): A dictionary with the validation details.
"""
validation_path = self._get_validation_path(name)
json_config_str = json.dumps(json_config)
self._write_file(validation_path, json_config_str)

def create_partition_yaml(self, target_file_path: str, yaml_config: Dict[str, str]):
"""Create a validation file and store the given config as YAML.
Expand All @@ -131,7 +142,7 @@ def get_validation_config(self, name: str, config_dir=None) -> Dict[str, str]:
"""Get a validation configuration from the expected file.
Args:
name: The name of the validation.
name: The name of the validation file.
Returns:
A dict of the validation values from the file.
"""
Expand Down
16 changes: 4 additions & 12 deletions samples/airflow/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,14 @@ By default, the DAG will output the results to BigQuery as a result handler.
### Instructions

1. Download the DAG file in this directory
2. Update the JSON configuration for your use case (update table names, etc.)
2. Get the JSON configuration for your use case, explained in the next section
3. Upload it to the DAGs folder in your Airflow environment

### Limitations
## JSON Configuration

The Airflow DAG expects the raw config JSON which is not the same as a YAML config converted to JSON.
The Airflow DAG expects a JSON config content which is not the same as a YAML config converted to JSON format. The parameters in a typical YAML config file for DVT are slightly different from the JSON config file version, which is generated after DVT parses the YAML.

Parameters in a typical YAML config file are slightly different from the raw JSON config,
which is generated after DVT parses the YAML. The [build_config_manager()](https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/config_manager.py#L429)
method generates the JSON config and should be used as a reference.

Our Cloud Run sample also expects a raw JSON config in the `'data'` variable shown
[here](https://github.com/GoogleCloudPlatform/professional-services-data-validator/tree/develop/samples/run#test-cloud-run-endpoint).
You can get the JSON content specific for your scenario by using our CLI and providing the argument to generate the JSON config file [`--config-file-json` or `-cj <filepath>.json`]. IMPORTANT: do not forget to make the necessary adjustments between JSON and Python objects, check [this link as a reference](https://python-course.eu/applications-python/json-and-python.php).

For example, the following YAML config is equivalent to the JSON config below, where the source param is written as `source_conn`.

Expand Down Expand Up @@ -58,6 +53,3 @@ validations:
]
}
```

For more implementation details, [this](https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/config_manager.py#L444)
is where the raw JSON config is generated in the DVT code.
2 changes: 2 additions & 0 deletions samples/airflow/dvt_airflow_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ def validation_function(project):

BQ_CONN = {"source_type": "BigQuery", "project_id": project}

# You can get the JSON content specific for your scenario by using our CLI and providing the argument to generate the JSON config file [`--config-file-json` or `-cj <filepath>.json`].
# IMPORTANT: do not forget to make the necessary adjustments between JSON and Python objects, check this link as a reference: https://python-course.eu/applications-python/json-and-python.php.
GROUPED_CONFIG_COUNT_VALID = {
# BigQuery Specific Connection Config
"source_conn": BQ_CONN,
Expand Down
4 changes: 3 additions & 1 deletion samples/functions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ cd ../../
```

### JSON Configuration
Below is an example of the JSON configuration that can be passed to the Cloud Function.

Below is an example of the JSON configuration that can be passed to the Cloud Function. You can get the JSON content specific for your scenario by using our CLI and providing the argument to generate the JSON config file [`--config-file-json` or `-cj <filepath>.json`]. IMPORTANT: do not forget to make the necessary adjustments between JSON and Python objects, check [this link as a reference](https://python-course.eu/applications-python/json-and-python.php).

```json
{
"config":{
Expand Down
5 changes: 3 additions & 2 deletions samples/run/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ gcloud run deploy --image gcr.io/${PROJECT_ID}/data-validation \

You can easily run a request via Python. For a quick test, we have provided this logic in `test.py` to run a validation against a public BigQuery table. The example is similar and also shows how you can forward results to BigQuery from the Cloud Run job:

```
```python
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
Expand Down Expand Up @@ -75,7 +75,8 @@ def get_cloud_run_url(service_name, project_id):

return re.findall("URL:.*\n", description)[0].split()[1].strip()

# You can get the JSON content specific for your scenario by using our CLI and providing the argument to generate the JSON config file [`--config-file-json` or `-cj <filepath>.json`].
# IMPORTANT: do not forget to make the necessary adjustments between JSON and Python objects, check this link as a reference: https://python-course.eu/applications-python/json-and-python.php.
data = {
"source_conn": {
"source_type": "BigQuery",
Expand Down
3 changes: 2 additions & 1 deletion samples/run/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ def get_cloud_run_url(service_name, project_id):

return re.findall("URL:.*\n", description)[0].split()[1].strip()


# You can get the JSON content specific for your scenario by using our CLI and providing the argument to generate the JSON config file [`--config-file-json` or `-cj <filepath>.json`].
# IMPORTANT: do not forget to make the necessary adjustments between JSON and Python objects, check this link as a reference: https://python-course.eu/applications-python/json-and-python.php.
data = {
"source_conn": {
"source_type": "BigQuery",
Expand Down
Empty file removed samples/tests/__init__.py
Empty file.
Loading

0 comments on commit d463038

Please sign in to comment.