Filters are not applied before generating partitions #950

florisvink · 2023-08-22T07:44:11Z

In the documentation is states that the generate partitions command works the same way as validate rows. It seems however that custom filters or a custom sql query source/target are not applied first when generating partitions and so partitions will be generated based on the full source- and target table. In our situation we want to validate the last 6 months of data in a multi billion row dataset so we run the partition generation like:

data-validation generate-table-partitions 
    -sc rs -tc bq \
    -tbls huge_source_table=huge_target_table \ 
    --primary-keys source_order_id,platform_id,order_transaction,source_customer_id
    --comparison-fields restaurant,branch,is_chain,chain_name,account_id,... \
    --partition-num 1000 --filter-status fail \
    --bq-result-handler bq_result_table \
    --filter "order_date_time > '2023-02-15' and platform_id=9"
    -cdir /targetdir_for_partitions

we got partitioned configs like:

...
  filters:
  - source: order_date_time > '2023-02-15' and platform_id=9
    target: order_date_time > '2023-02-15' and platform_id=9
    type: custom
  - source: source_order_id < '447191446' OR source_order_id = '447191446' AND (platform_id
      < 9 OR platform_id = 9 AND (order_transaction < 'Transaction' OR order_transaction
      = 'Transaction' AND (source_customer_id < '1772063')))
    target: source_order_id < '447191446' OR source_order_id = '447191446' AND (platform_id
      < 9 OR platform_id = 9 AND (order_transaction < 'Transaction' OR order_transaction
      = 'Transaction' AND (source_customer_id < '1772063')))
    type: custom
...

As you can see there are two filters:

one for the filtering (order_date_time > '2023-02-15' and platform_id=9)
one for partition windows (source_order_id < '447191446' OR source_order_id = '447191446' .... )
effectively resulting in a partition with very few rows (sometimes even none at all).

I think this is caused by the filter not being in place when running the windowing-query for partition generation

The text was updated successfully, but these errors were encountered:

…bis to turn table expressions into SQL statements. This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed a problem with datetime being rendered by compile - see https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

and #950

* fix: Change in filter tag creation * fix: Change in test files to integrate double quotes in string * fix: Minor Fix * fix: Fixing lint error * fix: Fixing linting issue * fix: typo * Fix: fixing date issue and removing string logic and its tests * fix: suggested changes * feat: Adding integration tests for BQ, Hive, Teradata * Update partition_table_prd.md potential way to address #923 and #950 * Fixed bug with specifying a config file in GCS - this did not work earlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup * Updated function to create the filter logic by using the feature of ibis to turn table expressions into SQL statements. This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed a problem with datetime being rendered by compile - see https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f * Changes to fix issues 945 and 950, updated documentation, internal PRD and test cases. Need to check that everything works. * Back out changes made for Kubernetes and by earlier attempts to fix #945 and #950 * Backout some changes that were made in other branches * Modified tests since the where clause is now syntactically slightly different, though semantically the same Cleaned up tests and partition_builder to pass flake8 and black * Fixed issue with SQL Server test * Fix string comparison issues * Fixed flake8 issues * Update Snowflake partition filter and confirmed that it works in local test --------- Co-authored-by: Piyush Sarraf <[email protected]>

* Update partition_table_prd.md potential way to address #923 and #950 * Fixed bug with specifying a config file in GCS - this did not work earlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup * Added a Implementation Note on how to scale DVT with Kubernetes Shortened the option to 2 character code -kc * Linted files and added unit tests for config_runner changes. * Updated README on how to run multiple instances concurrently. * Updated README.md * Some Lint fixes * Updated tests to mock build_config_managers_from_yaml. * Fixed reference to directory. * Update README.md Co-authored-by: Neha Nene <[email protected]> * Update README.md Co-authored-by: Neha Nene <[email protected]> * Update README.md Co-authored-by: Neha Nene <[email protected]> * Update docs/internal/kubernetes_jobs.md Co-authored-by: Neha Nene <[email protected]> * Updated docs * Update README.md Co-authored-by: Neha Nene <[email protected]> * Some more doc changes. * Final changes ? * Final typos --------- Co-authored-by: Neha Nene <[email protected]>

helensilva14 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Aug 24, 2023

sundar-mudupalli-work added a commit that referenced this issue Aug 24, 2023

Update partition_table_prd.md potential way to address #923 and #950

7ab2be7

sundar-mudupalli-work mentioned this issue Aug 30, 2023

fix: generate-table-partitions- fixes Issue 945 and Issue 950 #962

Merged

sundar-mudupalli-work added a commit that referenced this issue Aug 30, 2023

Back out changes made for Kubernetes and by earlier attempts to fix #945

0d8296c

and #950

nehanene15 assigned sundar-mudupalli-work Aug 30, 2023

helensilva14 mentioned this issue Aug 30, 2023

generate-table-partitions generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

Open

sundar-mudupalli-work closed this as completed in #962 Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filters are not applied before generating partitions #950

Filters are not applied before generating partitions #950

florisvink commented Aug 22, 2023

Filters are not applied before generating partitions #950

Filters are not applied before generating partitions #950

Comments

florisvink commented Aug 22, 2023