-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large table row validation partitioning for memory optimization #598
Comments
A simpler solution may be to provide an option to data-validation --splits=5 or something equivalent telling the data validator to split the validation dataset into 5 parts and validate one part at a time. If the users want to do distributed solutions - they can do that themselves. To split the dataset, I used the following statement in Oracle - can do similar things in Postgres and BigQuery I am sure. The min and max from the sql statement below provide the filters for each split. It works great with a single primary key. It may also work for a composite primary key
|
Yeah, that is the idea behind the partitioning. We may support a command such as Looks like Ibis support NTile here: https://github.com/ibis-project/ibis/blob/0f748e08d6af1ef760b0191e5cdd0ae0170fff64/ibis/expr/operations/analytic.py#L211 |
Closed with PR #653 |
For large table row validations, DVT can run into memory/CPU limitations due to the nature of a row by row comparison. DVT should have an option to partition a table using filters (WHERE condition) and spawn validation configs for each of the partitions. This way, a distributed system can validate a table partition rather than a whole table at a row level.
For the first iteration of this issue, we can write sample code that shows how to filter a table based on a numeric PK and run validations based on those filters.
The text was updated successfully, but these errors were encountered: