Added a Implementation Note on how to scale DVT with Kubernetes

Shortened the option to 2 character code -kc
GoogleCloudPlatform · Oct 14, 2023 · 7d0ab39 · 7d0ab39
1 parent 0f22e62
commit 7d0ab39
Show file tree

Hide file tree

Showing 3 changed files with 20 additions and 3 deletions.
diff --git a/data_validation/__main__.py b/data_validation/__main__.py
@@ -303,7 +303,7 @@ def config_runner(args):
             config_managers = build_config_managers_from_yaml(args, config_file_path)
         else: 
             if args.kube_completions :
-                logging.warning("--kube-completions or -kubecomp specified, however not running in Kubernetes Job completion, check your command line")
+                logging.warning("--kube-completions or -kc specified, however not running in Kubernetes Job completion, check your command line")
             mgr = state_manager.StateManager(file_system_root_path=args.config_dir)
             config_file_names = mgr.list_validations_in_dir(args.config_dir)
             config_managers = []

diff --git a/data_validation/cli_tools.py b/data_validation/cli_tools.py
@@ -345,9 +345,9 @@ def _configure_validation_config_parser(subparsers):
     )
     run_parser.add_argument(
         "--kube-completions",
-        "-kubecomp",
+        "-kc",
         action="store_true",
-        help="When validating multiple table partitions generated by generate-table-partitions, use this flag to tell Kubernetes",
+        help="When validating multiple table partitions generated by generate-table-partitions, using DVT in Kubernetes in index completion mode use this flag so that all the validations are completed",
     )
 
     get_parser = configs_subparsers.add_parser(

diff --git a/docs/internal/kubernetes_jobs.md b/docs/internal/kubernetes_jobs.md
@@ -0,0 +1,17 @@
+# Scaling Data Validation with Kubernetes Jobs 
+
+## Nature of Data Validation
+Data Validation by nature is a batch process. We are presented with a set of arguments, the validation is performed, the results are provided and Data Validation completes. Data Validation can also take time (multiple secs, minutes) if a large amount of data that needs to be validated. 
+
+Data Validation has the `generate-table-partitions` function that partitions a row validation into a specified number of smaller, equally sized validations. Using this feature, validation of two large tables can be split into a number of row validations of partitions of the same tables. See [partition table PRD](partition_table_prd.md) for details on partitioning. This process generates a sequence of yaml files which can be used to validate the individual partitions. 
+
+## Kubernetes Workloads
+Kubernetes supports different types of workloads including a few batch workload types. The Job workload is a batch workload that retries execution until a specified number of them successfully complete. If a row validation has been split into `n` partitions, then we need to validate each partition and merge the results of the validation. Using Kubernetes Jobs we need to successfully run `n` completions of the job, as long as we guarantee that each completion is associated with a different partition. Kubernetes provides a type of job management called indexed completions that supports the Data Validation use case. A Kubernetes job can use multiple parallel worker processes. Each worker process has an index number that the control plane sets which identifies which part of of the overall task (i.e. which partition) to work on. The index is available in the environment variable `JOB_COMPLETION_INDEX` (in cloud run the environment variable is `CLOUD_RUN_TASK_INDEX`). An explanation of this is provided in [Introducing Indexed Jobs](https://kubernetes.io/blog/2021/04/19/introducing-indexed-jobs/#:~:text=Indexed%20%3A%20the%20Job%20is%20considered,and%20the%20JOB_COMPLETION_INDEX%20environment%20variable).
+
+Indexed completion mode supports partitioned yaml files generated by `generate-table-partitions` in Data Validation, if each worker process ran only the yaml file corresponding to its index. I have an introduced an optional parameter `--kube-completions` or `-kc`. When this flag is used with `data-validation configs run` with a config directory and the container runs in indexed jobs mode, each container only processes the specific validation yaml file corresponding to its index. If the flag is used `data-validation configs run` with a config directory and DVT is not running in indexed jobs mode, a warning is issued. In all other instances, this flag is ignored.
+### IAM Permissions
+### Passing database connection parameters
+DVT stores database connection parameters are saved in `$HOME/.config/google-pso-data-validator` directory with passwords in raw text. This can be insecure and does not support rotation of passwords. A better approach would be use the (GCP) Secret Manager and retrieve it just when we connect to the database. DVT uses the Secret Manager for retrieving secrets and stores them in the `.config` directory when the connections are added.
+
+I am proposing a simple change. Whenever a connection parameter is specified, allow the user to optionally specify a secret manager (provider, project-id). If a secret manager is specified, then DVT retrieves the connection information directly from the secret manager at the time of creating the connection. With this change, DVT can be run in a container in Cloud Run or Kubernetes fetching the connection information from the GCP Secret Manager. Cloud Run currently has a limitation that multiple secrets [cannot be mounted at the same path](https://cloud.google.com/run/docs/configuring/services/secrets#disallowed_paths_and_limitations). Since DVT requires connections to two different databases the connection info being mounted in the same directory, i.e. `$HOME/.config/google-pso-data-validator`, DVT cannot effectively run within Cloud Run.
+## Future Work