Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Status when source and target agg values are 0 #393

Merged
merged 52 commits into from
Mar 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
a6d5521
adding addons for impala hive hashing functions
renzokuken Feb 25, 2022
82941c2
fix: adding nvl function for hive
nehanene15 Mar 3, 2022
9444f5d
fix: import ops
nehanene15 Mar 3, 2022
0b8fd47
fix: import fixed_arity
nehanene15 Mar 3, 2022
a5e218c
merge develop
nehanene15 Mar 3, 2022
babb571
override class
nehanene15 Mar 3, 2022
e98b7f5
override class
nehanene15 Mar 3, 2022
72f1675
override class
nehanene15 Mar 3, 2022
2d2c5ed
override class
nehanene15 Mar 3, 2022
f844b0c
move logic to ibis_addon
nehanene15 Mar 3, 2022
8f1b4ee
replacing isnull with nvl
nehanene15 Mar 3, 2022
fb81cba
adding ifnull
nehanene15 Mar 3, 2022
4d440a4
adding isnull
nehanene15 Mar 3, 2022
42fd92f
adding unaryop import
nehanene15 Mar 3, 2022
a8532e1
adding nvl function
nehanene15 Mar 3, 2022
e3cfefc
trying FillNa
nehanene15 Mar 4, 2022
b1fe078
test FillNa
nehanene15 Mar 4, 2022
6c8be5c
test FillNa
nehanene15 Mar 4, 2022
c4acd0b
missing import
nehanene15 Mar 4, 2022
8927348
debug
nehanene15 Mar 4, 2022
1e38473
debug
nehanene15 Mar 4, 2022
0634c8c
debug
nehanene15 Mar 4, 2022
3d43857
debug
nehanene15 Mar 4, 2022
55ed779
debug
nehanene15 Mar 4, 2022
45586c4
debug
nehanene15 Mar 4, 2022
ef5655b
debug
nehanene15 Mar 4, 2022
6213e00
debug
nehanene15 Mar 4, 2022
68221b9
debug
nehanene15 Mar 4, 2022
6f550b7
debug
nehanene15 Mar 4, 2022
5081748
debug
nehanene15 Mar 4, 2022
cc8329c
debug
nehanene15 Mar 7, 2022
fc9b6b9
debug
nehanene15 Mar 8, 2022
27ebe06
debug
nehanene15 Mar 8, 2022
179960d
debug
nehanene15 Mar 8, 2022
af39f87
debug
nehanene15 Mar 8, 2022
c965dcc
debug
nehanene15 Mar 8, 2022
f0bee9c
debug
nehanene15 Mar 8, 2022
e4d3236
debug
nehanene15 Mar 8, 2022
1ac5d90
debug
nehanene15 Mar 8, 2022
787a6be
debug
nehanene15 Mar 8, 2022
499c585
debug
nehanene15 Mar 8, 2022
2d31103
debug
nehanene15 Mar 8, 2022
de5c097
debug
nehanene15 Mar 9, 2022
2f2e135
docs: updates
nehanene15 Mar 14, 2022
54fa810
update docs
nehanene15 Mar 14, 2022
bd6bb02
support tinyint and smallint types
nehanene15 Mar 18, 2022
91b24ad
fix: doc updates, status bug fix
nehanene15 Mar 18, 2022
c2f757d
merge conflicts
nehanene15 Mar 18, 2022
c3a0fe6
fix:merge conflicts
nehanene15 Mar 18, 2022
4363054
fix: lint
nehanene15 Mar 18, 2022
190d1a6
fix: update unit test
nehanene15 Mar 18, 2022
1810fd8
lint
nehanene15 Mar 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 17 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data Validation Tool

The Data Validation Tool (DVT) is an open sourced Python CLI tool based on the
The Data Validation Tool (Beta) is an open sourced Python CLI tool based on the
[Ibis framework](https://ibis-project.org/docs/tutorial/01-Introduction-to-Ibis.html)
that compares heterogeneous data source tables with multi-leveled validation
functions.
Expand All @@ -9,7 +9,7 @@ Data validation is a critical step in a Data Warehouse, Database or Data Lake
migration project, where structured or semi-structured data from both the source
and the destination tables are compared to ensure they are matched and correct
after each migration step (e.g. data and schema migration, SQL script
translation, ETL migration, etc.). The Data Validation Tool provides an
translation, ETL migration, etc.). The Data Validation Tool (DVT) provides an
automated and repeatable solution to perform this task.

DVT supports the following validation types: * Table level * Table row count *
Expand Down Expand Up @@ -136,11 +136,20 @@ used to run powerful validations without writing any queries.

#### Row Validations

(Note: Row hash validation is currently only supported for BigQuery, Teradata, and Imapala/Hive)

Below is the command syntax for row validations. In order to run row level
validations you need to pass a `--primary-key` flag which defines what field(s)
the validation will be compared along, as well as a `--comparison-fields` flag
which specifies the values (e.g. columns) whose raw values will be compared
based on the primary key join. Additionally you can use
the validation will be compared on, as well as either the `--comparison-fields` flag
or the `--hash` flag.

The `--comparison-fields` flag specifies the values (e.g. columns) whose raw values will be compared
based on the primary key join. The `--hash` flag will run a checksum across all columns in
the table. This will include casting to string, sanitizing the data, concatenating, and finally
hashing the row. To exclude columns from the checksum, use the YAML config to customize the validation.


Additionally you can use
[Calculated Fields](#calculated-fields) to compare derived values such as string
counts and hashes of multiple columns.

Expand All @@ -156,12 +165,12 @@ data-validation (--verbose or -v) validate row
Comma separated list of tables in the form schema.table=target_schema.target_table
Target schema name and table name are optional.
i.e 'bigquery-public-data.new_york_citibike.citibike_trips'
[--primary-keys or -pk PRIMARY_KEYS]
--primary-keys or -pk PRIMARY_KEYS
Comma separated list of columns to use as primary keys
[--comparison-fields or -fields comparison-fields]
--comparison-fields or -comp-fields FIELDS
Comma separated list of columns to compare. Can either be a physical column or an alias
See: *Calculated Fields* section for details
[--hash COLUMNS] Comma separated list of columns to perform a hash operation on or * for all columns
--hash '*' '*' to hash all columns. To exclude columns, use the YAML config.
[--bq-result-handler or -bqrh PROJECT_ID.DATASET.TABLE]
BigQuery destination for validation results. Defaults to stdout.
See: *Validation Reports* section
Expand Down
11 changes: 10 additions & 1 deletion data_validation/combiner.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,8 @@ def _calculate_difference(field_differences, datatype, validation, is_value_comp
)
else:
difference = (target_value - source_value).cast("float64")
pct_difference = (

pct_difference_nonzero = (
ibis.literal(100.0)
* difference
/ (
Expand All @@ -126,6 +127,14 @@ def _calculate_difference(field_differences, datatype, validation, is_value_comp
).cast("float64")
).cast("float64")

# Considers case that source and target agg values can both be 0
pct_difference = (
ibis.case()
.when(difference == ibis.literal(0), ibis.literal(0).cast("float64"))
.else_(pct_difference_nonzero)
.end()
)

th_diff = (pct_difference.abs() - pct_threshold).cast("float64")
status = (
ibis.case()
Expand Down
2 changes: 1 addition & 1 deletion data_validation/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@
RESULT_TYPE_TARGET = "target"

# Ibis Object Info
NUMERIC_DATA_TYPES = ["float64", "int32", "int64", "decimal"]
NUMERIC_DATA_TYPES = ["float64", "int8", "int16", "int32", "int64", "decimal"]

FORMAT_TYPES = ["csv", "json", "table", "text"]

Expand Down
8 changes: 4 additions & 4 deletions docs/connections.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,10 +268,10 @@ Please note that for Group By validations, the following property must be set in

`set hive:hive.groupby.orderby.position.alias=true`

If you are running Hive on Dataproc, you will also need to run
`pip install ibis-framework[impala]`

Currently only INT, BIGINT, FLOAT, and DOUBLE data types are supported for Hive aggregation.
If you are running Hive on Dataproc, you will also need to install the following:
```
pip install ibis-framework[impala]
```

```
{
Expand Down
3 changes: 1 addition & 2 deletions tests/unit/test_data_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
# limitations under the License.

import json
import numpy
import pandas
import pytest
import random
Expand Down Expand Up @@ -501,7 +500,7 @@ def test_zero_both_values(module_under_test, fs):
col_a_result_df = result_df[result_df.validation_name == "count_col_a"]
col_a_pct_diff = col_a_result_df.pct_difference.values[0]

assert numpy.isnan(col_a_pct_diff)
assert col_a_pct_diff == 0.0


def test_status_success_validation(module_under_test, fs):
Expand Down
2 changes: 0 additions & 2 deletions third_party/ibis/ibis_addon/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@
# from third_party.ibis.ibis_snowflake.compiler import SnowflakeExprTranslator
# from third_party.ibis.ibis_oracle.compiler import OracleExprTranslator <<<<<< DB2


class BitXor(Reduction):
"""Aggregate bitwise XOR operation."""

Expand Down Expand Up @@ -124,7 +123,6 @@ def format_hashbytes_teradata(translator, expr):
else:
raise ValueError(f"unexpected value for 'how': {how}")


def format_hashbytes_hive(translator, expr):
arg, how = expr.op().args
compiled_arg = translator.translate(arg)
Expand Down
1 change: 0 additions & 1 deletion third_party/ibis/ibis_impala/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@

_impala_to_ibis_type = udf._impala_to_ibis_type


def impala_connect(
host=None,
port=10000,
Expand Down