-
Notifications
You must be signed in to change notification settings - Fork 112
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added a Implementation Note on how to scale DVT with Kubernetes
Shortened the option to 2 character code -kc
- Loading branch information
1 parent
0f22e62
commit 7d0ab39
Showing
3 changed files
with
20 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Scaling Data Validation with Kubernetes Jobs | ||
|
||
## Nature of Data Validation | ||
Data Validation by nature is a batch process. We are presented with a set of arguments, the validation is performed, the results are provided and Data Validation completes. Data Validation can also take time (multiple secs, minutes) if a large amount of data that needs to be validated. | ||
|
||
Data Validation has the `generate-table-partitions` function that partitions a row validation into a specified number of smaller, equally sized validations. Using this feature, validation of two large tables can be split into a number of row validations of partitions of the same tables. See [partition table PRD](partition_table_prd.md) for details on partitioning. This process generates a sequence of yaml files which can be used to validate the individual partitions. | ||
|
||
## Kubernetes Workloads | ||
Kubernetes supports different types of workloads including a few batch workload types. The Job workload is a batch workload that retries execution until a specified number of them successfully complete. If a row validation has been split into `n` partitions, then we need to validate each partition and merge the results of the validation. Using Kubernetes Jobs we need to successfully run `n` completions of the job, as long as we guarantee that each completion is associated with a different partition. Kubernetes provides a type of job management called indexed completions that supports the Data Validation use case. A Kubernetes job can use multiple parallel worker processes. Each worker process has an index number that the control plane sets which identifies which part of of the overall task (i.e. which partition) to work on. The index is available in the environment variable `JOB_COMPLETION_INDEX` (in cloud run the environment variable is `CLOUD_RUN_TASK_INDEX`). An explanation of this is provided in [Introducing Indexed Jobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/#:~:text=Indexed%20%3A%20the%20Job%20is%20considered,and%20the%20JOB_COMPLETION_INDEX%20environment%20variable). | ||
|
||
Indexed completion mode supports partitioned yaml files generated by `generate-table-partitions` in Data Validation, if each worker process ran only the yaml file corresponding to its index. I have an introduced an optional parameter `--kube-completions` or `-kc`. When this flag is used with `data-validation configs run` with a config directory and the container runs in indexed jobs mode, each container only processes the specific validation yaml file corresponding to its index. If the flag is used `data-validation configs run` with a config directory and DVT is not running in indexed jobs mode, a warning is issued. In all other instances, this flag is ignored. | ||
### IAM Permissions | ||
### Passing database connection parameters | ||
DVT stores database connection parameters are saved in `$HOME/.config/google-pso-data-validator` directory with passwords in raw text. This can be insecure and does not support rotation of passwords. A better approach would be use the (GCP) Secret Manager and retrieve it just when we connect to the database. DVT uses the Secret Manager for retrieving secrets and stores them in the `.config` directory when the connections are added. | ||
|
||
I am proposing a simple change. Whenever a connection parameter is specified, allow the user to optionally specify a secret manager (provider, project-id). If a secret manager is specified, then DVT retrieves the connection information directly from the secret manager at the time of creating the connection. With this change, DVT can be run in a container in Cloud Run or Kubernetes fetching the connection information from the GCP Secret Manager. Cloud Run currently has a limitation that multiple secrets [cannot be mounted at the same path](https://cloud.google.com/run/docs/configuring/services/secrets#disallowed_paths_and_limitations). Since DVT requires connections to two different databases the connection info being mounted in the same directory, i.e. `$HOME/.config/google-pso-data-validator`, DVT cannot effectively run within Cloud Run. | ||
## Future Work |