Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filters are not applied before generating partitions #950

Closed
florisvink opened this issue Aug 22, 2023 · 0 comments · Fixed by #962
Closed

Filters are not applied before generating partitions #950

florisvink opened this issue Aug 22, 2023 · 0 comments · Fixed by #962
Assignees
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@florisvink
Copy link

In the documentation is states that the generate partitions command works the same way as validate rows. It seems however that custom filters or a custom sql query source/target are not applied first when generating partitions and so partitions will be generated based on the full source- and target table. In our situation we want to validate the last 6 months of data in a multi billion row dataset so we run the partition generation like:

data-validation generate-table-partitions 
    -sc rs -tc bq \
    -tbls huge_source_table=huge_target_table \ 
    --primary-keys source_order_id,platform_id,order_transaction,source_customer_id
    --comparison-fields restaurant,branch,is_chain,chain_name,account_id,... \
    --partition-num 1000 --filter-status fail \
    --bq-result-handler bq_result_table \
    --filter "order_date_time > '2023-02-15' and platform_id=9"
    -cdir /targetdir_for_partitions

we got partitioned configs like:

...
  filters:
  - source: order_date_time > '2023-02-15' and platform_id=9
    target: order_date_time > '2023-02-15' and platform_id=9
    type: custom
  - source: source_order_id < '447191446' OR source_order_id = '447191446' AND (platform_id
      < 9 OR platform_id = 9 AND (order_transaction < 'Transaction' OR order_transaction
      = 'Transaction' AND (source_customer_id < '1772063')))
    target: source_order_id < '447191446' OR source_order_id = '447191446' AND (platform_id
      < 9 OR platform_id = 9 AND (order_transaction < 'Transaction' OR order_transaction
      = 'Transaction' AND (source_customer_id < '1772063')))
    type: custom
...

As you can see there are two filters:

  • one for the filtering (order_date_time > '2023-02-15' and platform_id=9)
  • one for partition windows (source_order_id < '447191446' OR source_order_id = '447191446' .... )
    effectively resulting in a partition with very few rows (sometimes even none at all).

I think this is caused by the filter not being in place when running the windowing-query for partition generation

@helensilva14 helensilva14 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Aug 24, 2023
sundar-mudupalli-work added a commit that referenced this issue Aug 29, 2023
…bis to turn table expressions into SQL statements.

This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed
a problem with datetime being rendered by compile - see
https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f
sundar-mudupalli-work added a commit that referenced this issue Aug 31, 2023
* fix: Change in filter tag creation

* fix: Change in test files to integrate double quotes in string

* fix: Minor Fix

* fix: Fixing lint error

* fix: Fixing linting issue

* fix: typo

* Fix: fixing date issue and removing string logic and its tests

* fix: suggested changes

* feat: Adding integration tests for BQ, Hive, Teradata

* Update partition_table_prd.md potential way to address #923 and #950

* Fixed bug with specifying a config file in GCS - this did not work earlier
Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index.
Tested in a non Kubernetes setup

* Updated function to create the filter logic by using the feature of ibis to turn table expressions into SQL statements.
This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed
a problem with datetime being rendered by compile - see
https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

* Changes to fix issues 945 and 950, updated documentation, internal PRD and test cases. Need to check that everything works.

* Back out changes made for Kubernetes and by earlier attempts to fix #945 and #950

* Backout some changes that were made in other branches

* Modified tests since the where clause is now syntactically slightly different, though semantically the same
Cleaned up tests and partition_builder to pass flake8 and black

* Fixed issue with SQL Server test

* Fix string comparison issues

* Fixed flake8 issues

* Update Snowflake partition filter and confirmed that it works in local test

---------

Co-authored-by: Piyush Sarraf <[email protected]>
sundar-mudupalli-work added a commit that referenced this issue Jan 8, 2024
* Update partition_table_prd.md potential way to address #923 and #950

* Fixed bug with specifying a config file in GCS - this did not work earlier
Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index.
Tested in a non Kubernetes setup

* Added a Implementation Note on how to scale DVT with Kubernetes
Shortened the option to 2 character code -kc

* Linted files and added unit tests for config_runner changes.

* Updated README on how to run multiple instances concurrently.

* Updated README.md

* Some Lint fixes

* Updated tests to mock build_config_managers_from_yaml.

* Fixed reference to directory.

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Update docs/internal/kubernetes_jobs.md

Co-authored-by: Neha Nene <[email protected]>

* Updated docs

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Some more doc changes.

* Final changes ?

* Final typos

---------

Co-authored-by: Neha Nene <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
3 participants