You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ends up with a Memory issue, when I tried to do hash-based row validation for the 59 million row table using the DVT. the error message looked like this below
"numpy.core._exceptions.MeomoryError: Unable to allocate 247. TiB for an array with shape (33977224160767,) and data type int64"
Do we have any suggestion/practice when we need to do the actual row-to-row validation between the source and target system which have million/billion rows?
I am aware of running the row hash validation for only the random/stratified sample set from the table. But I am looking for ways to validate the whole table.
The text was updated successfully, but these errors were encountered:
Memory constraint is the biggest blocker for whole table validations on large tables. The alternative is to either add more memory to the machine that is running DVT, or filter the table and validate chunks at a time.
We are working on Issue #619 that would support creating table partitions based on a numeric partition key to address this constraint. We recently released support for running multiple YAMLs from a directory (PR #654 ) so each YAML can represent a portion of the table and then be run sequentially. For example, the first config filters on id >= 0 and id < 10; the second config filters on id >= 10 and id < 20; and so on.
This has been merged with PR #653
Note that it only supports a numeric monotonically increasing key for V1. We will be working on supporting other partitioning keys as well.
I ends up with a Memory issue, when I tried to do hash-based row validation for the 59 million row table using the DVT. the error message looked like this below
"numpy.core._exceptions.MeomoryError: Unable to allocate 247. TiB for an array with shape (33977224160767,) and data type int64"
Do we have any suggestion/practice when we need to do the actual row-to-row validation between the source and target system which have million/billion rows?
I am aware of running the row hash validation for only the random/stratified sample set from the table. But I am looking for ways to validate the whole table.
The text was updated successfully, but these errors were encountered: