Large table row validation partitioning for memory optimization #598

nehanene15 · 2022-10-06T16:21:09Z

For large table row validations, DVT can run into memory/CPU limitations due to the nature of a row by row comparison. DVT should have an option to partition a table using filters (WHERE condition) and spawn validation configs for each of the partitions. This way, a distributed system can validate a table partition rather than a whole table at a row level.

For the first iteration of this issue, we can write sample code that shows how to filter a table based on a numeric PK and run validations based on those filters.

sundar-mudupalli-work · 2022-11-23T05:18:34Z

A simpler solution may be to provide an option to data-validation --splits=5 or something equivalent telling the data validator to split the validation dataset into 5 parts and validate one part at a time. If the users want to do distributed solutions - they can do that themselves. To split the dataset, I used the following statement in Oracle - can do similar things in Postgres and BigQuery I am sure. The min and max from the sql statement below provide the filters for each split. It works great with a single primary key. It may also work for a composite primary key

SELECT Min(pk),
       Max(pk),
       Count(*),
       nt
FROM   (SELECT pk,
               Ntile(5)
                 over (
                   ORDER BY pk ASC) nt
        FROM   my_table)
GROUP  BY nt
ORDER  BY nt ASC

nehanene15 · 2022-11-28T19:05:06Z

Yeah, that is the idea behind the partitioning. We may support a command such as data-validation generate-batches that creates the YAML configs for each filter and dumps them into a directory. The user can then easily run a for-loop over each config in the folder like so: data-validation configs run config_1.yaml or choose to distribute across nodes.

Looks like Ibis support NTile here: https://github.com/ibis-project/ibis/blob/0f748e08d6af1ef760b0191e5cdd0ae0170fff64/ibis/expr/operations/analytic.py#L211
Which can help construct the query to generate the batches

nehanene15 · 2023-01-25T15:30:02Z

Closed with PR #653

nehanene15 changed the title ~~Large table row validation partitioning for scale~~ Large table row validation partitioning for memory optimization Oct 6, 2022

nehanene15 added type: feature request 'Nice-to-have' improvement, new feature or different behavior or design. priority: p1 High priority. Fix may be included in the next release. labels Oct 6, 2022

nehanene15 mentioned this issue Oct 13, 2022

Allow DVT to run multiple YAMLs in parallel up to specified concurrency #608

Closed

nehanene15 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. and removed priority: p1 High priority. Fix may be included in the next release. labels Oct 14, 2022

nehanene15 assigned renzokuken and ngdav Oct 20, 2022

renzokuken mentioned this issue Oct 21, 2022

Determine the method for generating batches efficiently #619

Closed

nehanene15 unassigned renzokuken and ngdav Nov 22, 2022

nehanene15 closed this as completed Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large table row validation partitioning for memory optimization #598

Large table row validation partitioning for memory optimization #598

nehanene15 commented Oct 6, 2022 •

edited

Loading

sundar-mudupalli-work commented Nov 23, 2022 •

edited

Loading

nehanene15 commented Nov 28, 2022

nehanene15 commented Jan 25, 2023

Large table row validation partitioning for memory optimization #598

Large table row validation partitioning for memory optimization #598

Comments

nehanene15 commented Oct 6, 2022 • edited Loading

sundar-mudupalli-work commented Nov 23, 2022 • edited Loading

nehanene15 commented Nov 28, 2022

nehanene15 commented Jan 25, 2023

nehanene15 commented Oct 6, 2022 •

edited

Loading

sundar-mudupalli-work commented Nov 23, 2022 •

edited

Loading