Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate-table-partitions generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

Open
florisvink opened this issue Jul 31, 2023 · 15 comments · Fixed by #962
Assignees
Labels
priority: p3 Lowest priority. This can be something deprioritized or not time sensitive. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.

Comments

@florisvink
Copy link

The following command:

data-validation generate-table-partitions -sc rs -tc bq -tbls source.table=target.table --primary-keys _hash_key,_hash_diff,_creation_timestamp,source_last_update_date,delivery_area_id --partition-num 164 --filter-status fail --bq-result-handler pso_data_validator.results -cdir ~/

generates 164 partitioned configs as expected but the none of those can be run using data-validation configs run -c ~/config_X.yaml throwing the follow:

07/31/2023 09:37:36 AM-ERROR: Error (psycopg2.errors.SyntaxError) syntax error at or near "10"
LINE 3: ...2f114e622b' AND (_creation_timestamp > 2023-05-16 10:48:06.8...

It seems the filter: <sql_where_clause> part of the YALM contains timestamp values without the required quotes (see _creation_timestamp below).

  filters:
  - source: (_hash_key > '04b0a057a5ceccd4fff06984b5ed8e0ba1e90591' OR _hash_key =
      '04b0a057a5ceccd4fff06984b5ed8e0ba1e90591' AND (_hash_diff > 'cba27aebaf720f51987e0d3087213bd50fc3a70c'
      OR _hash_diff = 'cba27aebaf720f51987e0d3087213bd50fc3a70c' AND (_creation_timestamp
      > 2023-04-18 10:39:12.808967 OR _creation_timestamp = 2023-04-18 10:39:12.808967 ...
@helensilva14 helensilva14 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 High priority. Fix may be included in the next release. labels Jul 31, 2023
@helensilva14
Copy link
Contributor

Hi @florisvink, thanks for reporting this! Could you please help us understand how much this is affecting you at the moment? We did a triage for this problem and we will tackle it as soon as we can, since we just merged some updates for this feature on #922.

@sundar-mudupalli-work
Copy link
Contributor

The problem is in two static methods _less_than_value and. _geq_value in https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/partition_builder.py. They both have the line

value0 = "'" + values[0] + "'" if isinstance(values[0], str) else str(values[0])

This works correctly if the type is either string or numeric (real, integer etc). It does not work when the value is of type Boolean, datetime or interval(?). The fix is to change this to cover those use cases. While the code change is straightforward, we also need to come up with a test case(s) with primary keys with those column types and test across all the different databases.

Sundar Mudupalli

@sundar-mudupalli-work sundar-mudupalli-work added the good first issue Good issue for new DVT contributors label Aug 1, 2023
@florisvink
Copy link
Author

Hi @helensilva14, thanks for your reply. We have a workaround for the issue on short term (a regex string-replace on all partitioned config files fixing the quotes for datetime values). I'll keep an eye on updates in the develop branch since I understand this can be resolved fairly easy. Let me know if I can be of any help

@helensilva14 helensilva14 self-assigned this Aug 2, 2023
@florisvink
Copy link
Author

I also found out that this quoting problem appears when a STRING-field is part of the primary key and it contains a quote itself:

See 'Papa John's' example in error below:

Error (psycopg2.errors.SyntaxError) syntax error at or near "s"
LINE 3: ...'2837642736423' AND (chain_name < 'Papa John's' OR chai..

@piyushsarraf
Copy link
Contributor

Hi @florisvink @helensilva14
I have raised PR #945 while solves the problem of timestamp and string with quotes. But I was facing issue with date datatype, for some reasons the pandas dataframe is treating it as timestamp datatype and adds a timestamp formatting to it, like this (col_date >= '2023-07-03 00:00:00') AND (col_date < '2023-07-07 00:00:00'). I am in the process to solve this.

@florisvink
Copy link
Author

That's very helpful @piyushsarraf . I was fixing those issues now by hand as well by adding an explicit type-cast on date-time columns.

@florisvink
Copy link
Author

Many thanks for working on this!
I also find that FLOAT/NUMERIC fields can cause issues as being part of a primary key since they can be null and they appear in the filters as nan

Example:

  filters:
  - source: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    target: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    type: custom

(see (amount < nan))

@piyushsarraf piyushsarraf self-assigned this Aug 23, 2023
@nehanene15 nehanene15 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. and removed priority: p1 High priority. Fix may be included in the next release. labels Aug 24, 2023
@helensilva14
Copy link
Contributor

helensilva14 commented Aug 30, 2023

Hi @florisvink! This new PR #962 is the most up-to-date one to fix the filter issue (#950) and our testings are almost finished. But it is addressing this TIMESTAMP issue, partially because of a dependency version we currently use which is
SQLAlchemy==1.4.49. This is the whole picture:

SQLAlchemy==1.4.49 (current version)

  • we use it because it's the latest that supports Snowflake (one of our recent data source additions)
  • primary keys: don't support TIMESTAMP (sqlalchemy.exc.CompileError: No literal value renderer is available for literal value "datetime.datetime(2023, 8, 27, 15, 0)" with datatype DATETIME) or DATE (same error, changing only the literal value to 2023, 8, 23, 0, 0)
  • filters: support DATE, but not TIMESTAMP

SQLAlchemy==2.0.20 (recommended and tested version that works!)

  • do not support Snowflake (what we understand is not a problem for you, because of your specific Redshift usage)
  • primary keys: support both TIMESTAMP, DATE
  • filters: support DATE, but not TIMESTAMP (ValueError: Unable to parse filter arguments)

Steps that will be needed after #962 is merged:

  1. Go to your virtual env running DVT and checkout to develop branch
  2. Execute pip uninstall SQLAlchemy==1.4.49 && pip install SQLAlchemy==2.0.20
  3. Ignore the below incompatibility error, since it doesn't affect Redshift on DVT as far as we know

image

  1. Then you can execute your validation command as stated above for what 2.0.20 supports!

Let us know if you need direct assistance from us for any of this!

@helensilva14 helensilva14 removed the good first issue Good issue for new DVT contributors label Aug 30, 2023
sundar-mudupalli-work added a commit that referenced this issue Aug 31, 2023
* fix: Change in filter tag creation

* fix: Change in test files to integrate double quotes in string

* fix: Minor Fix

* fix: Fixing lint error

* fix: Fixing linting issue

* fix: typo

* Fix: fixing date issue and removing string logic and its tests

* fix: suggested changes

* feat: Adding integration tests for BQ, Hive, Teradata

* Update partition_table_prd.md potential way to address #923 and #950

* Fixed bug with specifying a config file in GCS - this did not work earlier
Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index.
Tested in a non Kubernetes setup

* Updated function to create the filter logic by using the feature of ibis to turn table expressions into SQL statements.
This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed
a problem with datetime being rendered by compile - see
https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

* Changes to fix issues 945 and 950, updated documentation, internal PRD and test cases. Need to check that everything works.

* Back out changes made for Kubernetes and by earlier attempts to fix #945 and #950

* Backout some changes that were made in other branches

* Modified tests since the where clause is now syntactically slightly different, though semantically the same
Cleaned up tests and partition_builder to pass flake8 and black

* Fixed issue with SQL Server test

* Fix string comparison issues

* Fixed flake8 issues

* Update Snowflake partition filter and confirmed that it works in local test

---------

Co-authored-by: Piyush Sarraf <[email protected]>
@florisvink
Copy link
Author

florisvink commented Sep 1, 2023

After pulling latest develop and upgrading SQLAlchemy as described above I got the following error when generating partitions:

Command:

generate-table-partitions -sc rs -tc bq -tbls dwh.fact_order_action=just-data-warehouse.core_dwh.fact_order_action --primary-keys order_id --comparison-fields order_id,agent_id,order_date_time,order_confirmed_date_time,order_action_timestamp,order_action_id,order_action_rank,before_confirmation,confirm_type_id,confirm_category_id,is_preorder,restaurant_id,country_id,platform_id,department_id --partition-num 3 --filter-status fail --filters="order_action_timestamp > '2023-02-15'" -l partition_id=20230901_fact_order_action --bq-result-handler just-data-preprod.pso_data_validator.results_dim --format csv -cdir /Users/floris.vink/Documents/dvt-gen-config/dvt_configs

Result:

[SQL: show standard_conforming_strings]
(Background on this error at: https://sqlalche.me/e/20/f405)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/floris.vink/Documents/professional-services-data-validator/main.py", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/__main__.py", line 584, in main
    partition_and_store_config_files(args)
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/__main__.py", line 507, in partition_and_store_config_files
    config_managers = build_config_managers_from_args(args, consts.ROW_VALIDATION)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/__main__.py", line 272, in build_config_managers_from_args
    pre_build_configs_list = cli_tools.get_pre_build_configs(args, validate_cmd)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/cli_tools.py", line 1273, in get_pre_build_configs
    source_client = clients.get_data_client(mgr.get_connection_config(args.source_conn))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/clients.py", line 253, in get_data_client
    raise exceptions.DataClientConnectionFailure(msg)
data_validation.exceptions.DataClientConnectionFailure: Connection Type "Redshift" could not connect: (psycopg2.errors.UndefinedObject) unrecognized configuration parameter "standard_conforming_strings"

[SQL: show standard_conforming_strings]

@helensilva14 helensilva14 reopened this Sep 1, 2023
@nehanene15
Copy link
Collaborator

@florisvink
Unfortunately, PR #962 introduced the usage of Ibis's to_sql() method which compiles sqlalchemy queries. The to_sql() function only supports timestamps/date/datetime literal rendering starting in version 2.0: https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

The 'standard_conforming_strings' error is because the redshift sqlalchemy connector doesn't support sqlalchemy versions 2.0: sqlalchemy-redshift/sqlalchemy-redshift#264

Due to this package dependency conflict, there isn't a way forward to support Redshift using the to_sql() method. This would probably require reverting the change and adding a condition to format timestamps.

@florisvink
Copy link
Author

florisvink commented Sep 6, 2023

When I used the latest develop but left SQLAlchemy at 1.4.49 it seemed to work with the exception quoting issues in the filters YAML like so:

  filters:
  - source: orderdatetime > '2023-02-15'
    target: orderdatetime > '2023-02-15'
    type: custom
  - source: ' orderdatetime > ''2023-02-15'' AND orderid >= ''6dff79b1c7ffe28f29a44f9476a26b0e53edffa76d1fe06a8f7ce6252a7338df''
      AND orderid < ''6e288c21a16d73502110194c2fad6010868a1d58caf3c0e382b4b0a7262bef56'''
    target: ' ( orderdatetime > ''2023-02-15'' ) AND ( `orderid` >= ''6dff79b1c7ffe28f29a44f9476a26b0e53edffa76d1fe06a8f7ce6252a7338df''
      ) AND ( `orderid` < ''6e288c21a16d73502110194c2fad6010868a1d58caf3c0e382b4b0a7262bef56''
      )'
  • issue 1: single quote for both YAML string bounds and SQL string
  • issue 2: double single quotes in SQL

I guess I could try to string-replace the '' with ", or didn't that work in RedShift? Just trying to find a quick workaround. Or could you use double quotes for the YAML? Then I'm able to hack (string-replace) the double single quotes to normal single quotes

@florisvink
Copy link
Author

Sorry, ignore comment above. Although it looks wrong it does actually works with double single quotes.

@sundar-mudupalli-work
Copy link
Contributor

@florisvink

As Neha mentioned, we are not able to support timestamp in a composite primary key because RedShift does not work with SQLAlchemy 2.0. I am not sure you need to specify timestamp as a part of the composite primary key.

In a traditional transactional databases a primary key (or composite primary key) is used to uniquely identify a row. This is how we use the term Primary Keys in the data validator. Our documentation says the keys uniquely identify the row. Generally using a timestamp in composite primary key in a transactional database is not a good idea as discussed here

In AWS Redshift, the primary keys are informational only, they are used for query optimization as mentioned in the documentation

Therefore, the only columns you need to mention in the primary key for data validator are the keys that uniquely identify the row. Hopefully you don't need to specify timestamp and therefore this is not an issue for you. At least that is my hope.

Sundar Mudupalli

@helensilva14 helensilva14 removed the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Sep 12, 2023
@helensilva14 helensilva14 added type: feature request 'Nice-to-have' improvement, new feature or different behavior or design. priority: p3 Lowest priority. This can be something deprioritized or not time sensitive. and removed priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Sep 12, 2023
@helensilva14
Copy link
Contributor

@florisvink Unfortunately, PR #962 introduced the usage of Ibis's to_sql() method which compiles sqlalchemy queries. The to_sql() function only supports timestamps/date/datetime literal rendering starting in version 2.0: https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

The 'standard_conforming_strings' error is because the redshift sqlalchemy connector doesn't support sqlalchemy versions 2.0: sqlalchemy-redshift/sqlalchemy-redshift#264

Due to this package dependency conflict, there isn't a way forward to support Redshift using the to_sql() method. This would probably require reverting the change and adding a condition to format timestamps.

Hi everyone! Updating this issue to reinforce that we encountered this problem above, but since @florisvink is not blocked anymore, I changed it to a feature request for us to get back here to test the feature when the sqlalchemy-redshift/sqlalchemy-redshift#264 gets solved and Redshift starts to support SQLAlchemy 2.0+.

Thanks a lot everyone for all the effort!

@florisvink
Copy link
Author

Being able to cast values while generating partitions would also solve a lot of problems while generating partitions. For instance we would have be able to cast a DATE to a string which when sorted would have the same behaviour

sundar-mudupalli-work added a commit that referenced this issue Jan 8, 2024
* Update partition_table_prd.md potential way to address #923 and #950

* Fixed bug with specifying a config file in GCS - this did not work earlier
Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index.
Tested in a non Kubernetes setup

* Added a Implementation Note on how to scale DVT with Kubernetes
Shortened the option to 2 character code -kc

* Linted files and added unit tests for config_runner changes.

* Updated README on how to run multiple instances concurrently.

* Updated README.md

* Some Lint fixes

* Updated tests to mock build_config_managers_from_yaml.

* Fixed reference to directory.

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Update docs/internal/kubernetes_jobs.md

Co-authored-by: Neha Nene <[email protected]>

* Updated docs

* Update README.md

Co-authored-by: Neha Nene <[email protected]>

* Some more doc changes.

* Final changes ?

* Final typos

---------

Co-authored-by: Neha Nene <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 Lowest priority. This can be something deprioritized or not time sensitive. type: feature request 'Nice-to-have' improvement, new feature or different behavior or design.
Projects
None yet
6 participants
@florisvink @nehanene15 @helensilva14 @piyushsarraf @sundar-mudupalli-work and others