Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large table row validation partitioning for memory optimization #598

Closed
nehanene15 opened this issue Oct 6, 2022 · 3 comments
Closed

Large table row validation partitioning for memory optimization #598

nehanene15 opened this issue Oct 6, 2022 · 3 comments
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.

Comments

@nehanene15
Copy link
Collaborator

nehanene15 commented Oct 6, 2022

For large table row validations, DVT can run into memory/CPU limitations due to the nature of a row by row comparison. DVT should have an option to partition a table using filters (WHERE condition) and spawn validation configs for each of the partitions. This way, a distributed system can validate a table partition rather than a whole table at a row level.

For the first iteration of this issue, we can write sample code that shows how to filter a table based on a numeric PK and run validations based on those filters.

@nehanene15 nehanene15 changed the title Large table row validation partitioning for scale Large table row validation partitioning for memory optimization Oct 6, 2022
@nehanene15 nehanene15 added type: feature request 'Nice-to-have' improvement, new feature or different behavior or design. priority: p1 High priority. Fix may be included in the next release. labels Oct 6, 2022
@nehanene15 nehanene15 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. and removed priority: p1 High priority. Fix may be included in the next release. labels Oct 14, 2022
@sundar-mudupalli-work
Copy link
Contributor

sundar-mudupalli-work commented Nov 23, 2022

A simpler solution may be to provide an option to data-validation --splits=5 or something equivalent telling the data validator to split the validation dataset into 5 parts and validate one part at a time. If the users want to do distributed solutions - they can do that themselves. To split the dataset, I used the following statement in Oracle - can do similar things in Postgres and BigQuery I am sure. The min and max from the sql statement below provide the filters for each split. It works great with a single primary key. It may also work for a composite primary key

SELECT Min(pk),
       Max(pk),
       Count(*),
       nt
FROM   (SELECT pk,
               Ntile(5)
                 over (
                   ORDER BY pk ASC) nt
        FROM   my_table)
GROUP  BY nt
ORDER  BY nt ASC

@nehanene15
Copy link
Collaborator Author

Yeah, that is the idea behind the partitioning. We may support a command such as data-validation generate-batches that creates the YAML configs for each filter and dumps them into a directory. The user can then easily run a for-loop over each config in the folder like so: data-validation configs run config_1.yaml or choose to distribute across nodes.

Looks like Ibis support NTile here: https://github.com/ibis-project/ibis/blob/0f748e08d6af1ef760b0191e5cdd0ae0170fff64/ibis/expr/operations/analytic.py#L211
Which can help construct the query to generate the batches

@nehanene15
Copy link
Collaborator Author

Closed with PR #653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

4 participants