`generate-table-partitions` generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

florisvink · 2023-07-31T07:41:58Z

The following command:

data-validation generate-table-partitions -sc rs -tc bq -tbls source.table=target.table --primary-keys _hash_key,_hash_diff,_creation_timestamp,source_last_update_date,delivery_area_id --partition-num 164 --filter-status fail --bq-result-handler pso_data_validator.results -cdir ~/

generates 164 partitioned configs as expected but the none of those can be run using data-validation configs run -c ~/config_X.yaml throwing the follow:

07/31/2023 09:37:36 AM-ERROR: Error (psycopg2.errors.SyntaxError) syntax error at or near "10"
LINE 3: ...2f114e622b' AND (_creation_timestamp > 2023-05-16 10:48:06.8...

It seems the filter: <sql_where_clause> part of the YALM contains timestamp values without the required quotes (see _creation_timestamp below).

  filters:
  - source: (_hash_key > '04b0a057a5ceccd4fff06984b5ed8e0ba1e90591' OR _hash_key =
      '04b0a057a5ceccd4fff06984b5ed8e0ba1e90591' AND (_hash_diff > 'cba27aebaf720f51987e0d3087213bd50fc3a70c'
      OR _hash_diff = 'cba27aebaf720f51987e0d3087213bd50fc3a70c' AND (_creation_timestamp
      > 2023-04-18 10:39:12.808967 OR _creation_timestamp = 2023-04-18 10:39:12.808967 ...

The text was updated successfully, but these errors were encountered:

helensilva14 · 2023-08-01T19:50:40Z

Hi @florisvink, thanks for reporting this! Could you please help us understand how much this is affecting you at the moment? We did a triage for this problem and we will tackle it as soon as we can, since we just merged some updates for this feature on #922.

sundar-mudupalli-work · 2023-08-01T22:23:26Z

The problem is in two static methods _less_than_value and. _geq_value in https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/partition_builder.py. They both have the line

value0 = "'" + values[0] + "'" if isinstance(values[0], str) else str(values[0])

This works correctly if the type is either string or numeric (real, integer etc). It does not work when the value is of type Boolean, datetime or interval(?). The fix is to change this to cover those use cases. While the code change is straightforward, we also need to come up with a test case(s) with primary keys with those column types and test across all the different databases.

Sundar Mudupalli

florisvink · 2023-08-02T09:25:22Z

Hi @helensilva14, thanks for your reply. We have a workaround for the issue on short term (a regex string-replace on all partitioned config files fixing the quotes for datetime values). I'll keep an eye on updates in the develop branch since I understand this can be resolved fairly easy. Let me know if I can be of any help

florisvink · 2023-08-02T14:53:44Z

I also found out that this quoting problem appears when a STRING-field is part of the primary key and it contains a quote itself:

See 'Papa John's' example in error below:

Error (psycopg2.errors.SyntaxError) syntax error at or near "s"
LINE 3: ...'2837642736423' AND (chain_name < 'Papa John's' OR chai..

piyushsarraf · 2023-08-17T10:15:44Z

Hi @florisvink @helensilva14
I have raised PR #945 while solves the problem of timestamp and string with quotes. But I was facing issue with date datatype, for some reasons the pandas dataframe is treating it as timestamp datatype and adds a timestamp formatting to it, like this (col_date >= '2023-07-03 00:00:00') AND (col_date < '2023-07-07 00:00:00'). I am in the process to solve this.

florisvink · 2023-08-17T21:10:41Z

That's very helpful @piyushsarraf . I was fixing those issues now by hand as well by adding an explicit type-cast on date-time columns.

florisvink · 2023-08-21T11:15:39Z

Many thanks for working on this!
I also find that FLOAT/NUMERIC fields can cause issues as being part of a primary key since they can be null and they appear in the filters as nan

Example:

  filters:
  - source: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    target: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    type: custom

(see (amount < nan))

helensilva14 · 2023-08-30T21:53:47Z

Hi @florisvink! This new PR #962 is the most up-to-date one to fix the filter issue (#950) and our testings are almost finished. But it is addressing this TIMESTAMP issue, partially because of a dependency version we currently use which is
SQLAlchemy==1.4.49. This is the whole picture:

SQLAlchemy==1.4.49 (current version)

we use it because it's the latest that supports Snowflake (one of our recent data source additions)
primary keys: don't support TIMESTAMP (sqlalchemy.exc.CompileError: No literal value renderer is available for literal value "datetime.datetime(2023, 8, 27, 15, 0)" with datatype DATETIME) or DATE (same error, changing only the literal value to 2023, 8, 23, 0, 0)
filters: support DATE, but not TIMESTAMP

SQLAlchemy==2.0.20 (recommended and tested version that works!)

do not support Snowflake (what we understand is not a problem for you, because of your specific Redshift usage)
primary keys: support both TIMESTAMP, DATE
filters: support DATE, but not TIMESTAMP (ValueError: Unable to parse filter arguments)

Steps that will be needed after #962 is merged:

Go to your virtual env running DVT and checkout to develop branch
Execute pip uninstall SQLAlchemy==1.4.49 && pip install SQLAlchemy==2.0.20
Ignore the below incompatibility error, since it doesn't affect Redshift on DVT as far as we know

Then you can execute your validation command as stated above for what 2.0.20 supports!

Let us know if you need direct assistance from us for any of this!

* fix: Change in filter tag creation * fix: Change in test files to integrate double quotes in string * fix: Minor Fix * fix: Fixing lint error * fix: Fixing linting issue * fix: typo * Fix: fixing date issue and removing string logic and its tests * fix: suggested changes * feat: Adding integration tests for BQ, Hive, Teradata * Update partition_table_prd.md potential way to address #923 and #950 * Fixed bug with specifying a config file in GCS - this did not work earlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup * Updated function to create the filter logic by using the feature of ibis to turn table expressions into SQL statements. This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed a problem with datetime being rendered by compile - see https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f * Changes to fix issues 945 and 950, updated documentation, internal PRD and test cases. Need to check that everything works. * Back out changes made for Kubernetes and by earlier attempts to fix #945 and #950 * Backout some changes that were made in other branches * Modified tests since the where clause is now syntactically slightly different, though semantically the same Cleaned up tests and partition_builder to pass flake8 and black * Fixed issue with SQL Server test * Fix string comparison issues * Fixed flake8 issues * Update Snowflake partition filter and confirmed that it works in local test --------- Co-authored-by: Piyush Sarraf <[email protected]>

florisvink · 2023-09-01T08:35:35Z

After pulling latest develop and upgrading SQLAlchemy as described above I got the following error when generating partitions:

Command:

generate-table-partitions -sc rs -tc bq -tbls dwh.fact_order_action=just-data-warehouse.core_dwh.fact_order_action --primary-keys order_id --comparison-fields order_id,agent_id,order_date_time,order_confirmed_date_time,order_action_timestamp,order_action_id,order_action_rank,before_confirmation,confirm_type_id,confirm_category_id,is_preorder,restaurant_id,country_id,platform_id,department_id --partition-num 3 --filter-status fail --filters="order_action_timestamp > '2023-02-15'" -l partition_id=20230901_fact_order_action --bq-result-handler just-data-preprod.pso_data_validator.results_dim --format csv -cdir /Users/floris.vink/Documents/dvt-gen-config/dvt_configs

Result:

[SQL: show standard_conforming_strings]
(Background on this error at: https://sqlalche.me/e/20/f405)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/floris.vink/Documents/professional-services-data-validator/main.py", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/__main__.py", line 584, in main
    partition_and_store_config_files(args)
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/__main__.py", line 507, in partition_and_store_config_files
    config_managers = build_config_managers_from_args(args, consts.ROW_VALIDATION)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/__main__.py", line 272, in build_config_managers_from_args
    pre_build_configs_list = cli_tools.get_pre_build_configs(args, validate_cmd)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/cli_tools.py", line 1273, in get_pre_build_configs
    source_client = clients.get_data_client(mgr.get_connection_config(args.source_conn))
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/floris.vink/Documents/professional-services-data-validator/data_validation/clients.py", line 253, in get_data_client
    raise exceptions.DataClientConnectionFailure(msg)
data_validation.exceptions.DataClientConnectionFailure: Connection Type "Redshift" could not connect: (psycopg2.errors.UndefinedObject) unrecognized configuration parameter "standard_conforming_strings"

[SQL: show standard_conforming_strings]

nehanene15 · 2023-09-05T20:57:09Z

@florisvink
Unfortunately, PR #962 introduced the usage of Ibis's to_sql() method which compiles sqlalchemy queries. The to_sql() function only supports timestamps/date/datetime literal rendering starting in version 2.0: https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

The 'standard_conforming_strings' error is because the redshift sqlalchemy connector doesn't support sqlalchemy versions 2.0: sqlalchemy-redshift/sqlalchemy-redshift#264

Due to this package dependency conflict, there isn't a way forward to support Redshift using the to_sql() method. This would probably require reverting the change and adding a condition to format timestamps.

florisvink · 2023-09-06T09:37:43Z

When I used the latest develop but left SQLAlchemy at 1.4.49 it seemed to work with the exception quoting issues in the filters YAML like so:

  filters:
  - source: orderdatetime > '2023-02-15'
    target: orderdatetime > '2023-02-15'
    type: custom
  - source: ' orderdatetime > ''2023-02-15'' AND orderid >= ''6dff79b1c7ffe28f29a44f9476a26b0e53edffa76d1fe06a8f7ce6252a7338df''
      AND orderid < ''6e288c21a16d73502110194c2fad6010868a1d58caf3c0e382b4b0a7262bef56'''
    target: ' ( orderdatetime > ''2023-02-15'' ) AND ( `orderid` >= ''6dff79b1c7ffe28f29a44f9476a26b0e53edffa76d1fe06a8f7ce6252a7338df''
      ) AND ( `orderid` < ''6e288c21a16d73502110194c2fad6010868a1d58caf3c0e382b4b0a7262bef56''
      )'

issue 1: single quote for both YAML string bounds and SQL string
issue 2: double single quotes in SQL

I guess I could try to string-replace the '' with ", or didn't that work in RedShift? Just trying to find a quick workaround. Or could you use double quotes for the YAML? Then I'm able to hack (string-replace) the double single quotes to normal single quotes

florisvink · 2023-09-06T10:17:14Z

Sorry, ignore comment above. Although it looks wrong it does actually works with double single quotes.

sundar-mudupalli-work · 2023-09-06T21:58:48Z

@florisvink

As Neha mentioned, we are not able to support timestamp in a composite primary key because RedShift does not work with SQLAlchemy 2.0. I am not sure you need to specify timestamp as a part of the composite primary key.

In a traditional transactional databases a primary key (or composite primary key) is used to uniquely identify a row. This is how we use the term Primary Keys in the data validator. Our documentation says the keys uniquely identify the row. Generally using a timestamp in composite primary key in a transactional database is not a good idea as discussed here

In AWS Redshift, the primary keys are informational only, they are used for query optimization as mentioned in the documentation

Therefore, the only columns you need to mention in the primary key for data validator are the keys that uniquely identify the row. Hopefully you don't need to specify timestamp and therefore this is not an issue for you. At least that is my hope.

Sundar Mudupalli

helensilva14 · 2023-09-12T21:16:17Z

@florisvink Unfortunately, PR #962 introduced the usage of Ibis's to_sql() method which compiles sqlalchemy queries. The to_sql() function only supports timestamps/date/datetime literal rendering starting in version 2.0: https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

The 'standard_conforming_strings' error is because the redshift sqlalchemy connector doesn't support sqlalchemy versions 2.0: sqlalchemy-redshift/sqlalchemy-redshift#264

Due to this package dependency conflict, there isn't a way forward to support Redshift using the to_sql() method. This would probably require reverting the change and adding a condition to format timestamps.

Hi everyone! Updating this issue to reinforce that we encountered this problem above, but since @florisvink is not blocked anymore, I changed it to a feature request for us to get back here to test the feature when the sqlalchemy-redshift/sqlalchemy-redshift#264 gets solved and Redshift starts to support SQLAlchemy 2.0+.

Thanks a lot everyone for all the effort!

florisvink · 2023-09-13T13:01:40Z

Being able to cast values while generating partitions would also solve a lot of problems while generating partitions. For instance we would have be able to cast a DATE to a string which when sorted would have the same behaviour

* Update partition_table_prd.md potential way to address #923 and #950 * Fixed bug with specifying a config file in GCS - this did not work earlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup * Added a Implementation Note on how to scale DVT with Kubernetes Shortened the option to 2 character code -kc * Linted files and added unit tests for config_runner changes. * Updated README on how to run multiple instances concurrently. * Updated README.md * Some Lint fixes * Updated tests to mock build_config_managers_from_yaml. * Fixed reference to directory. * Update README.md Co-authored-by: Neha Nene <[email protected]> * Update README.md Co-authored-by: Neha Nene <[email protected]> * Update README.md Co-authored-by: Neha Nene <[email protected]> * Update docs/internal/kubernetes_jobs.md Co-authored-by: Neha Nene <[email protected]> * Updated docs * Update README.md Co-authored-by: Neha Nene <[email protected]> * Some more doc changes. * Final changes ? * Final typos --------- Co-authored-by: Neha Nene <[email protected]>

helensilva14 assigned sundar-mudupalli-work Jul 31, 2023

helensilva14 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 High priority. Fix may be included in the next release. labels Jul 31, 2023

sundar-mudupalli-work added the good first issue Good issue for new DVT contributors label Aug 1, 2023

helensilva14 self-assigned this Aug 2, 2023

piyushsarraf mentioned this issue Aug 17, 2023

fix: Change in filter tag creation in generate-table-partitions #945

Closed

piyushsarraf self-assigned this Aug 23, 2023

nehanene15 unassigned helensilva14 Aug 24, 2023

nehanene15 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. and removed priority: p1 High priority. Fix may be included in the next release. labels Aug 24, 2023

sundar-mudupalli-work added a commit that referenced this issue Aug 24, 2023

Update partition_table_prd.md potential way to address #923 and #950

7ab2be7

helensilva14 removed the good first issue Good issue for new DVT contributors label Aug 30, 2023

helensilva14 linked a pull request Aug 30, 2023 that will close this issue

fix: generate-table-partitions- fixes Issue 945 and Issue 950 #962

Merged

sundar-mudupalli-work closed this as completed in #962 Aug 31, 2023

helensilva14 reopened this Sep 1, 2023

helensilva14 removed the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`generate-table-partitions` generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

`generate-table-partitions` generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

florisvink commented Jul 31, 2023

helensilva14 commented Aug 1, 2023

sundar-mudupalli-work commented Aug 1, 2023

florisvink commented Aug 2, 2023

florisvink commented Aug 2, 2023

piyushsarraf commented Aug 17, 2023

florisvink commented Aug 17, 2023

florisvink commented Aug 21, 2023

helensilva14 commented Aug 30, 2023 •

edited

Loading

florisvink commented Sep 1, 2023 •

edited

Loading

nehanene15 commented Sep 5, 2023

florisvink commented Sep 6, 2023 •

edited

Loading

florisvink commented Sep 6, 2023

sundar-mudupalli-work commented Sep 6, 2023

helensilva14 commented Sep 12, 2023

florisvink commented Sep 13, 2023

generate-table-partitions generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

generate-table-partitions generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

Comments

florisvink commented Jul 31, 2023

helensilva14 commented Aug 1, 2023

sundar-mudupalli-work commented Aug 1, 2023

florisvink commented Aug 2, 2023

florisvink commented Aug 2, 2023

piyushsarraf commented Aug 17, 2023

florisvink commented Aug 17, 2023

florisvink commented Aug 21, 2023

helensilva14 commented Aug 30, 2023 • edited Loading

SQLAlchemy==1.4.49 (current version)

SQLAlchemy==2.0.20 (recommended and tested version that works!)

florisvink commented Sep 1, 2023 • edited Loading

nehanene15 commented Sep 5, 2023

florisvink commented Sep 6, 2023 • edited Loading

florisvink commented Sep 6, 2023

sundar-mudupalli-work commented Sep 6, 2023

helensilva14 commented Sep 12, 2023

florisvink commented Sep 13, 2023

`generate-table-partitions` generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

`generate-table-partitions` generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

helensilva14 commented Aug 30, 2023 •

edited

Loading

florisvink commented Sep 1, 2023 •

edited

Loading

florisvink commented Sep 6, 2023 •

edited

Loading