Skip to content

Commit

Permalink
docs: Add sample shell script and documentation to execute validation…
Browse files Browse the repository at this point in the history
…s at a BigQuery dataset level (#910)

* Create bq-dataset-level-validation.sh

* Add script and README for BQ dataset level validation

* Apply suggestions from code review

Co-authored-by: Neha Nene <[email protected]>

* Add temp file deletion. Update variable names and README

---------

Co-authored-by: Neha Nene <[email protected]>
  • Loading branch information
helensilva14 and nehanene15 committed Aug 2, 2023
1 parent e1d590b commit a84da45
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 0 deletions.
40 changes: 40 additions & 0 deletions samples/bq_utils/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Helper scripts for BigQuery validations

## Dataset-level

We do not natively support validations of an entire BQ dataset. This is a workaround to execute this task.

This script will run validations on all the BigQuery tables within a provided dataset **as long as the table names match between source and target datasets.**

**IMPORTANT:** The script will only run column and schema validations for BigQuery source and target databases.

1. Enter the directory:

```shell script
cd samples/bq_utils/
```

1. Grant execution permissions to file:

```shell script
chmod u+x bq-dataset-level-validation.sh
```

1. To run a validation, execute the script by passing the following parameters:

```shell script
./bq-dataset-level-validation.sh [SOURCE_BQ_PROJECT] [SOURCE_BQ_DATASET] [TARGET_BQ_PROJECT] [TARGET_BQ_DATASET] [FULLNAME_BQ_RESULT_HANDLER] <OPTIONAL ARGUMENTS>
```

Like this example:

```shell script
./bq-dataset-level-validation.sh your-project dataset1 your-project dataset2 your-project.pso_data_validator.results
```

(Optional) Add an optional filter. Assume all your tables have a partition timestamp and you want to perform a validation within a specific timeframe. You can add the filter as an optional argument:

```shell script
./bq-dataset-level-validation.sh your-project dataset1 your-project dataset2 your-project.pso_data_validator.results "--filters 'partitionTs BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -3 DAY) AND CURRENT_TIMESTAMP()'"
```

27 changes: 27 additions & 0 deletions samples/bq_utils/bq-dataset-level-validation.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# get all tables from source dataset and save them in a temporary CSV file
bq ls --max_results 100 $1:$2 | tail -n +3 | tr -s ' ' | cut -d' ' -f2 > source_tables.csv

# create BQ connection for source
data-validation connections add \
--connection-name my_bq_conn_source BigQuery \
--project-id $1

# create BQ connection for target
data-validation connections add \
--connection-name my_bq_conn_target BigQuery \
--project-id $3

input="./source_tables.csv"
# perform both column and schema validation for every table in the given dataset
while IFS= read -r table
do
command="data-validation validate column -sc my_bq_conn_source -tc my_bq_conn_target -bqrh $5 -tbls $1.$2.$table=$3.$4.$table ${@:6}"
eval "$command"

command="data-validation validate schema -sc my_bq_conn_source -tc my_bq_conn_target -bqrh $5 -tbls $1.$2.$table=$3.$4.$table"
eval "$command"
done < "$input"

# delete the temporary file
rm $input

0 comments on commit a84da45

Please sign in to comment.