Skip to content

Commit

Permalink
docs: Distributed DVT Cloud Run Jobs sample (#1133)
Browse files Browse the repository at this point in the history
* initial set up for Cloud Run Jobs quickstart

* docs: add cloud run jobs sample

* docs: clean up

* docs: updates from review
  • Loading branch information
nehanene15 committed May 20, 2024
1 parent 15c81cf commit f51f327
Show file tree
Hide file tree
Showing 15 changed files with 152 additions and 108 deletions.
16 changes: 7 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ For example, this flag can be used as follows:

Running DVT with YAML configuration files is the recommended approach if:
* you want to customize the configuration for any given validation OR
* you want to run DVT at scale (i.e. row validations across many partitions)
* you want to run DVT at scale (i.e. run multiple validations sequentially or in parallel)

We recommend generating YAML configs with the `--config-file <file-name>` flag when running a validation command, which supports
GCS and local paths.
Expand All @@ -451,7 +451,7 @@ data-validation (--verbose or -v) (--log-level or -ll) configs run
[--dry-run or -dr] If this flag is present, prints the source and target SQL generated in lieu of running the validation.
[--kube-completions or -kc]
Flag to indicate usage in Kubernetes index completion mode.
See *Scaling DVT to run 10's to 1000's of validations concurrently* section
See *Scaling DVT* section
```

```
Expand All @@ -469,19 +469,17 @@ View the complete YAML file for a Grouped Column validation on the
[Examples](https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/docs/examples.md#sample-yaml-config-grouped-column-validation) page.


#### Scaling DVT to run 10's to 1000's of validations concurrently
### Scaling DVT

As described above, you can run multiple validation configs sequentially with the `--config-dir` flag. To optimize the validation speed for large tables, you can run DVT concurrently using GKE Jobs ([Google Kubernetes Jobs](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-workloads-overview#batch_jobs)) or [Cloud Run Jobs](https://cloud.google.com/run/docs/create-jobs). If you are not familiar with Kubernetes or Cloud Run Jobs, see [Scaling DVT with Kubernetes](docs/internal/kubernetes_jobs.md) for a detailed overview.
You can scale DVT for large table validations by running the tool in a distributed manner. To optimize the validation speed for large tables, you can use GKE Jobs ([Google Kubernetes Jobs](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-workloads-overview#batch_jobs)) or [Cloud Run Jobs](https://cloud.google.com/run/docs/create-jobs). If you are not familiar with Kubernetes or Cloud Run Jobs, see [Scaling DVT with Distributed Jobs](https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/docs/internal/distributed_jobs.md) for a detailed overview.

In order to validate partitions concurrently, both the `--kube-completions` and `--config-dir` flags are required. The `--kube-completions` flag specifies that the validation is being run in indexed completion mode in Kubernetes or as multiple independent tasks in Cloud Run. The `--config-dir` flag will specify the directory with the YAML files to be executed in parallel. If you used `generate-table-partitions` to generate the YAMLs, this would be the directory where the partition files numbered `0000.yaml` to `<partition_num - 1>.yaml` are stored i.e (`my_config_dir/source_schema.source_table/`)

In order to run DVT in a container, you have to build a Docker image - [see instructions here](https://github.com/GoogleCloudPlatform/professional-services-data-validator/tree/develop/samples/docker#readme). You will also need to set the `PSO_DV_CONN_HOME` environment variable to point to a GCS directory where the connection configuration files are stored. In Cloud Run you can use the option `--set-env-vars` or `--update-env-vars` to pass [the environment variable](https://cloud.google.com/run/docs/configuring/services/environment-variables#setting). We recommend that you use the `bq-result-handler` to save your validation results to BigQuery.
We recommend first generating table partitions with the `generate-table-partitions` command for your large tables. Then, use Cloud Run or GKE to distribute validating each chunk in parallel. See the [Cloud Run Jobs Quickstart sample](https://github.com/GoogleCloudPlatform/professional-services-data-validator/tree/develop/samples/cloud_run_jobs) to get started.

The Cloud Run and Kubernetes pods must run in a network with access to the database servers. Every Cloud Run job is associated with a [service account](https://cloud.google.com/run/docs/securing/service-identity). You need to ensure that this service account has access to Google Cloud Storage (to read connection configuration and YAML files) and BigQuery (to publish results). If you are using Kubernetes, you will need to use a service account with the same privileges as mentioned for Cloud Run. In Kubernetes, you will need to set up [workload identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) so the DVT container can impersonate the service account.
When running DVT in a distributed fashion, both the `--kube-completions` and `--config-dir` flags are required. The `--kube-completions` flag specifies that the validation is being run in indexed completion mode in Kubernetes or as multiple independent tasks in Cloud Run. If the `-kc` option is used and you are not running in indexed mode, you will receive a warning and the container will process all the validations sequentially. If the `-kc` option is used and a config directory is not provided (i.e. a `--config-file` is provided instead), a warning is issued.

In Cloud Run, the [job](https://cloud.google.com/run/docs/create-jobs) must be run as multiple, independent tasks with the task count set to the number of partitions generated. In Kubernetes, set the number of completions to the number of partitions generated - see [Kubernetes Parallel Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs). The option `--kube-completions or -kc` tells DVT that many DVT containers are running in a Kubernetes cluster. Each DVT container only validates the specific partition YAML (based on the index assigned by Kubernetes control plane). If the `-kc` option is used and you are not running in indexed mode, you will receive a warning and the container will process all the validations sequentially. If the `-kc` option is used and a config directory is not provided (a `--config-file` is provided instead), a warning is issued.
The `--config-dir` flag will specify the directory with the YAML files to be executed in parallel. If you used `generate-table-partitions` to generate the YAMLs, this would be the directory where the partition files numbered `0000.yaml` to `<partition_num - 1>.yaml` are stored i.e (`gs://my_config_dir/source_schema.source_table/`). When creating your Cloud Run Job, set the number of tasks equal to the number of table partitions so the task index matches the YAML file to be validated. When executed, each Cloud Run task will validate a partition in parallel.

By default, each partition validation is retried up to 3 times if there is an error. In Kubernetes and Cloud Run, you can set the parallelism to the number you want. Keep in mind that if you are validating 1000's of partitions in parallel, you may find that setting the parallelism too high (say 100) may result in timeouts and slow down the validation.

### Validation Reports

Expand Down
4 changes: 2 additions & 2 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,10 +51,10 @@ data-validation generate-table-partitions \
--primary-keys tree_id \
--hash '*' \
--filters 'tree_id>3000' \
-cdir partitions_dir \
--config-dir partitions_dir \
--partition-num 200
````
Above command creates multiple partitions based on `--partition-key`. Number of generated configuration files is decided by `--partition-num`
Above command creates multiple partitions based on the primary key. Number of generated configuration files is decided by `--partition-num`

#### Run COUNT validations for all columns
````shell script
Expand Down
Original file line number Diff line number Diff line change
@@ -1,22 +1,28 @@
# Scaling Data Validation with Kubernetes Jobs
# Scaling Data Validation with Distributed Jobs

## Nature of Data Validation
Data Validation by nature is a batch process. We are presented with a set of arguments, the validation is performed, the results are provided and Data Validation completes. Data Validation can also take time (multiple secs, minutes) if a large amount of data needs to be validated.

Data Validation has the `generate-table-partitions` function that partitions a row validation into a specified number of smaller, equally sized validations. Using this feature, validation of large tables can be split into row validations of many partitions of the same tables. See [partition table PRD](partition_table_prd.md) for details on partitioning. This process generates a sequence of yaml files which can be used to validate the individual partitions.
Data Validation by nature is a batch process. We are presented with a set of arguments, the validation is performed, the results are provided and Data Validation completes. Data Validation can also take time (multiple secs, minutes) if a large amount of data needs to be validated.

Data Validation has the `generate-table-partitions` function that partitions a row validation into a specified number of smaller, equally sized validations. Using this feature, validation of large tables can be split into row validations of many partitions of the same tables. See [partition table PRD](partition_table_prd.md) for details on partitioning. This process generates a sequence of yaml files which can be used to validate the individual partitions.

## GKE and Cloud Run Jobs

## Kubernetes Workloads
Kubernetes supports different types of workloads including a few batch workload types. The Job workload is a batch workload that retries execution until a specified number of them successfully complete. If a row validation has been split into `n` partitions, then we need to validate each partition and merge the results of the validation. Using Kubernetes Jobs we need to successfully run `n` completions of the job, as long as we guarantee that each completion is associated with a different partition. Since release 1.21, Kubernetes provides a type of job management called indexed completions that supports the Data Validation use case. A Kubernetes job can use multiple parallel worker processes. Each worker process has an index number that the control plane sets which identifies which part of the overall task (i.e. which partition) to work on. Cloud Run uses slightly different terminology for a similar model - referring to each job completion as a task and all the tasks together as the job. The index is available in the environment variable `JOB_COMPLETION_INDEX` (in Cloud Run the environment variable is `CLOUD_RUN_TASK_INDEX`). An explanation of this is provided in [Introducing Indexed Jobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/#:~:text=Indexed%20%3A%20the%20Job%20is%20considered,and%20the%20JOB_COMPLETION_INDEX%20environment%20variable).

Indexed completion mode supports partitioned yaml files generated by the `generate-table-partitions` command if each worker process runs only the yaml file corresponding to its index. The`--kube-completions` or `-kc` flag when running configs from a directory indicates that the validation is running in indexed jobs mode on Kubernetes or Cloud Run. This way, each container only processes the specific validation yaml file corresponding to its index.
Indexed completion mode supports partitioned yaml files generated by the `generate-table-partitions` command if each worker process runs only the yaml file corresponding to its index. The`--kube-completions` or `-kc` flag when running configs from a directory indicates that the validation is running in indexed jobs mode on Kubernetes or Cloud Run. This way, each container only processes the specific validation yaml file corresponding to its index.

### Passing database connection parameters

Usually, DVT stores database connection parameters in the `$HOME/.config/google-pso-data-validator` directory. If the environment variable `PSO_DV_CONN_HOME` is set, the database connection information can be fetched from this location. When running DVT in a container, there are two options regarding where to locate the database connection parameters.
1. Build the container image with the database connection parameters. This is not recommended because the container image is specific to the databases being used OR
2. Build a DVT container image as specified in the [Docker samples directory](../../samples/docker/README.md). When running the container, pass the `PSO_DV_CONN_HOME` environment variable to point to a GCS location containing the connection configuration. This is the suggested approach.

## Future Work
### Use GCP secret manager to store database connection configuration
DVT currently uses the GCP secret manager to secure each element of the database connection configuration. The current implementation stores the name of the secret for each element (host, port, user, password etc) in the database configuration file. When the connection needs to be made, the values are retrieved from the secret manager and used to connect to the database. While this is secure, database configurations are still stored in the filesystem.
1. Build the container image with the database connection parameters. This is not recommended because the container image would be limited to the databases being used.
1. Build a DVT container image as specified in the [Cloud Run Jobs samples directory](../../samples/cloud_run_jobs/README.md). When creating a job, specify the `PSO_DV_CONN_HOME` environment variable to point to a GCS location containing the connection configuration. This is the suggested approach.

## Future Work

### Use GCP Secret Manager to store database connection configuration

DVT currently uses the GCP Secret Manager to secure each element of the database connection configuration. The current implementation stores the name of the secret for each element (host, port, user, password etc) in the database configuration file. When the connection needs to be made, the values are retrieved from the Secret Manager and used to connect to the database. While this is secure, connection configurations referencing secrets are still stored in the file system.

Another way to use the secret manager is to store the complete connection configuration as the value of the secret. Then while specifying the connection, we can also indicate to DVT to look for the connection configuration in the secret manager rather than the file system. Since most DVT commands connect to one or more databases, nearly every DVT command can take optional arguments that tells DVT to look for the connection configuration in the secret manager.
Another way to use the Secret Manager is to store the complete connection configuration as the value of the secret. Then, while specifying the connection, DVT can look for the connection configuration in the Secret Manager rather than the file system.
Loading

0 comments on commit f51f327

Please sign in to comment.