Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support to generate a JSON config file only for applications purposes #1089

Merged
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ terraform.rc

# Custom
*.yaml
*.json
partitions_dir
setup.sh

Expand Down
14 changes: 12 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Before using this tool, you will need to create connections to the source and
target tables. Once the connections are created, you can run validations on
those tables. Validation results can be printed to stdout (default) or outputted
to BigQuery (recommended). DVT also allows you to save or edit validation
configurations in a YAML file. This is useful for running common validations or
configurations in a YAML or JSON file. This is useful for running common validations or
updating the configuration.

### Managing Connections
Expand Down Expand Up @@ -124,6 +124,8 @@ data-validation (--verbose or -v) (--log-level or -ll) validate column
See: *Filters* section
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations.
[--threshold or -th THRESHOLD]
Float value. Maximum pct_difference allowed for validation to be considered a success. Defaults to 0.0
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Expand Down Expand Up @@ -191,6 +193,8 @@ data-validation (--verbose or -v) (--log-level or -ll) validate row
See: *Filters* section
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations.
[--labels or -l KEY1=VALUE1,KEY2=VALUE2]
Comma-separated key value pair labels for the run.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Expand Down Expand Up @@ -268,6 +272,8 @@ data-validation (--verbose or -v) (--log-level or -ll) validate schema
Service account to use for BigQuery result handler output.
[--config-file or -c CONFIG_FILE]
YAML Config File Path to be used for storing validations.
[--config-file-json or -cj CONFIG_FILE_JSON]
JSON Config File Path to be used for storing validations.
[--format or -fmt] Format for stdout output. Supported formats are (text, csv, json, table).
Defaults to table.
[--filter-status or -fs STATUSES_LIST]
Expand Down Expand Up @@ -426,7 +432,7 @@ The following command creates a YAML file for the validation of the
my_bq_conn -tbls bigquery-public-data.new_york_citibike.citibike_trips -c
citibike.yaml`.

The vaildation config file is saved to the GCS path specified by the `PSO_DV_CONFIG_HOME`
The validation config file is saved to the GCS path specified by the `PSO_DV_CONFIG_HOME`
env variable if that has been set; otherwise, it is saved to wherever the tool is run.

You can now edit the YAML file if, for example, the `new_york_citibike` table is
Expand Down Expand Up @@ -472,6 +478,10 @@ In Cloud Run, the [job](https://cloud.google.com/run/docs/create-jobs) must be r

By default, each partition validation is retried up to 3 times if there is an error. In Kubernetes and Cloud Run, you can set the parallelism to the number you want. Keep in mind that if you are validating 1000's of partitions in parallel, you may find that setting the parallelism too high (say 100) may result in timeouts and slow down the validation.

### [TODO] JSON Configuration Files

Add

### Validation Reports

The result handlers tell DVT where to store the results of
Expand Down
47 changes: 44 additions & 3 deletions data_validation/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,13 +45,21 @@


def _get_arg_config_file(args):
"""Return String yaml config file path."""
"""Return String YAML config file path."""
if not args.config_file:
raise ValueError("YAML Config File was not supplied.")

return args.config_file


def _get_arg_config_file_json(args):
"""Return String JSON config file path."""
if not args.config_file_json:
raise ValueError("JSON Config File was not supplied.")

return args.config_file_json


def get_aggregate_config(args, config_manager: ConfigManager):
"""Return list of formated aggregation objects.

Expand Down Expand Up @@ -489,6 +497,26 @@ def convert_config_to_yaml(args, config_managers):
return yaml_config


def convert_config_to_json(config_managers) -> dict:
"""Return dict objects formatted for json validations.
JSON configs correspond to ConfigManager objects and therefore can only correspond to
one table validation.

Args:
config_managers (list[ConfigManager]): List of config manager instances.
"""

if len(config_managers) > 1:
raise ValueError(
"JSON configs can only be created for single table validations."
)
config_manager = config_managers[0]
json_config = config_manager.config
json_config[consts.CONFIG_SOURCE_CONN] = config_manager.get_source_connection()
json_config[consts.CONFIG_TARGET_CONN] = config_manager.get_target_connection()
return json_config


def run_validation(config_manager, dry_run=False, verbose=False):
"""Run a single validation.

Expand Down Expand Up @@ -552,7 +580,7 @@ def run_validations(args, config_managers):
for config_manager in config_managers:
if config_manager.config and consts.CONFIG_FILE in config_manager.config:
logging.info(
"Currently running the validation for yml file: %s",
"Currently running the validation for YAML file: %s",
config_manager.config[consts.CONFIG_FILE],
)
try:
Expand Down Expand Up @@ -580,6 +608,17 @@ def store_yaml_config_file(args, config_managers):
cli_tools.store_validation(config_file_path, yaml_configs)


def store_json_config_file(args, config_managers):
"""Build a JSON config file from the supplied configs.

Args:
config_managers (list[ConfigManager]): List of config manager instances.
"""
json_config = convert_config_to_json(config_managers)
config_file_path = _get_arg_config_file_json(args)
cli_tools.store_validation_json(config_file_path, json_config)


def partition_and_store_config_files(args: Namespace) -> None:
"""Build multiple YAML Config files using user specified partition logic

Expand All @@ -597,7 +636,7 @@ def partition_and_store_config_files(args: Namespace) -> None:

def run(args) -> None:
"""Splits execution into:
1. Build and save single Yaml Config file
1. Build and save single Config file (YAML or JSON)
2. Run Validations

Args:
Expand All @@ -609,6 +648,8 @@ def run(args) -> None:
config_managers = build_config_managers_from_args(args)
if args.config_file:
store_yaml_config_file(args, config_managers)
elif args.config_file_json:
store_json_config_file(args, config_managers)
else:
run_validations(args, config_managers)

Expand Down
11 changes: 11 additions & 0 deletions data_validation/cli_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -924,6 +924,11 @@ def _add_common_arguments(optional_arguments, required_arguments):
"-c",
help="Store the validation in the YAML Config File Path specified",
)
optional_arguments.add_argument(
"--config-file-json",
"-cj",
help="Store the validation in the JSON Config File Path specified",
)
optional_arguments.add_argument(
"--format",
"-fmt",
Expand Down Expand Up @@ -1074,6 +1079,12 @@ def store_validation(validation_file_name, yaml_config):
mgr.create_validation_yaml(validation_file_name, yaml_config)


def store_validation_json(validation_file_name, json_config):
"""Store the validation JSON config under the given name."""
mgr = state_manager.StateManager()
mgr.create_validation_json(validation_file_name, json_config)


def store_partition(target_file_path, yaml_config, target_folder_path=None):
"""Store the partition YAML config under the given name."""
mgr = state_manager.StateManager(target_folder_path)
Expand Down
1 change: 1 addition & 0 deletions data_validation/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
SECRET_MANAGER_PROJECT_ID = "secret_manager_project_id"
CONFIG = "config"
CONFIG_FILE = "config_file"
CONFIG_FILE_JSON = "config_file_json"
CONFIG_SOURCE_CONN_NAME = "source_conn_name"
CONFIG_TARGET_CONN_NAME = "target_conn_name"
CONFIG_SOURCE_CONN = "source_conn"
Expand Down
13 changes: 12 additions & 1 deletion data_validation/state_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,17 @@ def create_validation_yaml(self, name: str, yaml_config: Dict[str, str]):
yaml_config_str = dump(yaml_config, Dumper=Dumper)
self._write_file(validation_path, yaml_config_str)

def create_validation_json(self, name: str, json_config: Dict[str, str]):
"""Create a validation file and store the given config as JSON.

Args:
name (String): The name of the validation.
json_config (Dict): A dictionary with the validation details.
"""
validation_path = self._get_validation_path(name)
json_config_str = json.dumps(json_config)
self._write_file(validation_path, json_config_str)

def create_partition_yaml(self, target_file_path: str, yaml_config: Dict[str, str]):
"""Create a validation file and store the given config as YAML.

Expand All @@ -131,7 +142,7 @@ def get_validation_config(self, name: str, config_dir=None) -> Dict[str, str]:
"""Get a validation configuration from the expected file.

Args:
name: The name of the validation.
name: The name of the validation file.
Returns:
A dict of the validation values from the file.
"""
Expand Down
5 changes: 4 additions & 1 deletion noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,10 @@ def unit(session):
"--cov-config=.coveragerc",
"--cov-report=term",
os.path.join("tests", "unit"),
env={"PSO_DV_CONFIG_HOME": ""},
env={
"PSO_DV_CONFIG_HOME": "",
"PROJECT_ID": os.environ.get("PROJECT_ID", "pso-kokoro-resources"),
helensilva14 marked this conversation as resolved.
Show resolved Hide resolved
},
*session.posargs,
)

Expand Down
53 changes: 53 additions & 0 deletions tests/unit/test__main.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

from data_validation import cli_tools, consts
from data_validation import __main__ as main
from tests.system.data_sources.test_bigquery import BQ_CONN


TEST_CONN = '{"source_type":"Example"}'
Expand Down Expand Up @@ -87,6 +88,33 @@
"kube_completions": True,
"config_dir": "gs://pso-kokoro-resources/resources/test/unit/test__main/4partitions",
}
TEST_JSON_VALIDATION_CONFIG = {
consts.CONFIG_TYPE: "Column",
consts.CONFIG_SOURCE_CONN_NAME: "mock-conn",
consts.CONFIG_TARGET_CONN_NAME: "mock-conn",
consts.CONFIG_TABLE_NAME: "dvt_core_types",
consts.CONFIG_SCHEMA_NAME: "pso_data_validator",
consts.CONFIG_TARGET_SCHEMA_NAME: "pso_data_validator",
consts.CONFIG_TARGET_TABLE_NAME: "dvt_core_types",
consts.CONFIG_LABELS: [],
consts.CONFIG_THRESHOLD: 0.0,
consts.CONFIG_FORMAT: "table",
consts.CONFIG_RESULT_HANDLER: None,
consts.CONFIG_FILTERS: [],
consts.CONFIG_USE_RANDOM_ROWS: False,
consts.CONFIG_RANDOM_ROW_BATCH_SIZE: None,
consts.CONFIG_FILTER_STATUS: ["fail"],
consts.CONFIG_AGGREGATES: [
{
consts.CONFIG_SOURCE_COLUMN: None,
consts.CONFIG_TARGET_COLUMN: None,
consts.CONFIG_FIELD_ALIAS: "count",
consts.CONFIG_TYPE: "count",
},
],
consts.CONFIG_SOURCE_CONN: BQ_CONN,
consts.CONFIG_TARGET_CONN: BQ_CONN,
}


@mock.patch(
Expand Down Expand Up @@ -195,3 +223,28 @@ def test_config_runner_3(mock_args, mock_build, mock_run, caplog):
assert mock_run.call_args.args[0].config_dir is None
assert os.path.basename(mock_run.call_args.args[0].config_file) == "0002.yaml"
assert len(mock_run.call_args.args[1]) == 1


@mock.patch(
"data_validation.state_manager.StateManager.get_connection_config",
return_value=BQ_CONN,
)
def test_column_validation_convert_config_to_json(mock_conn):
parser = cli_tools.configure_arg_parser()
args = parser.parse_args(
[
"validate",
"column",
"-sc=mock-conn",
"-tc=mock-conn",
"-tbls=pso_data_validator.dvt_core_types",
"--filter-status=fail",
"--config-file-json=bq-column-validation.json",
]
)
config_managers = main.build_config_managers_from_args(args)
assert len(config_managers) == 1

json_config = main.convert_config_to_json(config_managers)
# assert structure
assert json_config == TEST_JSON_VALIDATION_CONFIG