fix: Change in filter tag creation in generate-table-partitions #945

piyushsarraf · 2023-08-17T10:08:17Z

Fixes issue generate-table-partitions generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923.
Changed the quotes to " " for string datatype, to handle quotes in string value.

piyushsarraf · 2023-08-17T10:08:36Z

/gcbrun

sundar-mudupalli-work · 2023-08-18T05:03:34Z

Piyush,

Thank you for working on this. I did some investigation and here are my findings (based on postgres)

Postgres has time, date, timestamp and interval date/time types and the time and timestamp type can be with or without timezone.
The interval type is not supported by ibis postgres extension - any table with a column with this type will throw an error. I will file a bug on this.
The time type shows up in pandas as an object. So I am not sure we can support this type as a primary key column
The date and timestamp type are both stored as datetime objects in pandas. While I used isinstance, I am now learning there is a better way to check the type. The recommended way for datetime type is to use pandas.api.types.is_datetime64_any_dtype and for strings is pandas.api.types.is_string_dtype
So if you have a date object in pandas, you have to cast the key as timestamp, so like "cast(" + keys[0] + "as timestamp)"
While you are making this change, can you please check how things will work if a column is boolean ?
Thanks.

Sundar Mudupalli

piyushsarraf · 2023-08-18T08:59:18Z

/gcbrun

* fix: support casting PK's to varchar for TD char support * fix: update so that this only applies to ComparisonField casts and doesn't affect regular row validation casting

piyushsarraf · 2023-08-18T09:58:10Z

/gcbrun

piyushsarraf · 2023-08-18T14:00:36Z

/gcbrun

piyushsarraf · 2023-08-18T15:08:28Z

/gcbrun

piyushsarraf · 2023-08-18T15:16:36Z

/gcbrun

sundar-mudupalli-work · 2023-08-18T19:50:33Z

Hi,

This fix only support timestamp fields, but not date fields. We need a fix that supports both - as the customer has already reported the issue with the date field as well. Please see my comments on how to fix the date field.

Thanks.

Sundar Mudupalli

sundar-mudupalli-work · 2023-08-18T21:58:23Z

Looking at the email trail, the customer reported 3 issues: 1. timestamp (this is bug you fixed), 2. Date (this is the bug they are editing the yaml to be cast( as timestamp) and 3. escaping the quote character in the string. The third one is hard to fix and requires a redesign I think. Can you fix the second one as indicated in the notes to the PR. Thanks.

florisvink · 2023-08-21T11:13:18Z

Many thanks for working on this!
I also find that FLOAT/NUMERIC fields can cause issues as being part of a primary key since they can be null and they appear in the filters as nan

Example:

  filters:
  - source: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    target: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    type: custom

(see (amount < nan))

florisvink · 2023-08-21T13:07:06Z

data_validation/partition_builder.py

+        if isinstance(values[0], str):
+            value0 = '"' + values[0] + '"'
+        elif isinstance(values[0], pd.Timestamp):
+            value0 = "'" + str(values[0]) + "'"
+        else:
+            value0 = str(values[0])
+


double-quotes (") seem to cause problems in RedShift:

< SELECT count(*) FROM some_table WHERE some_field > 'some_value' > OK < SELECT count(*) FROM some_table WHERE some_field > "some_value" > psycopg2.errors.UndefinedColumn: column "some_value" does not exist in some_table

piyushsarraf · 2023-08-21T15:49:49Z

/gcbrun

sundar-mudupalli-work · 2023-08-21T22:42:02Z

Many thanks for working on this! I also find that FLOAT/NUMERIC fields can cause issues as being part of a primary key since they can be null and they appear in the filters as nan

Example:

  filters:
  - source: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    target: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    type: custom

(see (amount < nan))

Floris,

It is strange that the database is allowing you to store null values in a primary key. Which database is this?

Per Oracle (and likely other databases) - A primary key defines the set of columns that uniquely identifies rows in a table. When you create a primary key constraint, none of the columns included in the primary key can have NULL constraints; that is, they must not permit NULL values.. Please see Oracle docs

Sundar Mudupalli

sundar-mudupalli-work

Lines 87 -90 read

      elif isinstance(values[0], pd.Timestamp):
            if primary_key != "cast(" + keys[0] + " as timestamp)":
                primary_key = "cast(" + keys[0] + " as timestamp)"
            value0 = "'" + str(values[0]) + "'"

Should instead be

      elif isinstance(values[0], pd.Timestamp):
            primary_key = "cast(" + keys[0] + " as timestamp)"
            value0 = "'" + str(values[0]) + "'"

It is possible i am misunderstanding your code - so help me here. Similar comment on lines 122-125.

The customer is also saying that the use of " does not work correctly in RedShift. So the following line (121) may create a problem.

          value0 = '"' + values[0] + '"'

Perhaps not use " and use ' instead ?

Also you will need to build a test to demonstrate that primary date and timestamp keys work correctly - I suggest you create a function very similar to test_postgres_generate_table_partitions for each of the 8 data sources. I left the SQL code to generate the table for the postgres test in postgres_test_tables.sql. You can do something similar for all 8 tests.

Thank you.

Sundar Mudupalli

florisvink · 2023-08-22T07:07:42Z

It is strange that the database is allowing you to store null values in a primary key. Which database is this?

Hi @sundar-mudupalli-work. In this case the null value originates from a column that is part of a composite (combined) primary key. Apart from that: In BigQuery (and I think in most analytics databases) a unique primary key is not a necessity (other than relations databases where indeed primary keys have a unique constraint)

piyushsarraf · 2023-08-23T11:15:40Z

/gcbrun

piyushsarraf · 2023-08-23T11:16:37Z

/gcbrun

nehanene15 · 2023-08-24T16:48:55Z

@sundar-mudupalli-work @piyushsarraf I think rather than adding integration tests, this would be better suited for unit tests for the function PartitionBuilder. _less_than_value(). It won't be scalable to create a new integration test + table for every unique primary key data type.

@florisvink As for the 'nan' issue if the PK were to be 'nan' would you want the entire condition AND (amount < nan) to be removed? In that case we can return an empty string "" from the _less_than_value function? Feel free to comment or add your suggestions to this PR.

florisvink · 2023-08-25T16:50:52Z

As for the 'nan' issue if the PK were to be 'nan' would you want the entire condition AND (amount < nan) to be removed? In that case we can return an empty string "" from the _less_than_value function? Feel free to comment or add your suggestions to this PR.

Yeah I think that makes sense. It could to unexpected results (partitions with 'too' many rows..) but at least it's better than creating corrupt SQL.

Please not it's not only 'greater than' statement. I also see filters like: (userid < nan OR userid = nan AND (restaurantid < '1204085'))

…rlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup

…rofessional-services-data-validator into K8-indexed

…fessional-services-data-validator into issue923

sundar-mudupalli-work · 2023-08-27T23:11:36Z

@florisvink - let us move the Null / Nan discussion to Issue 951 opened specifically for that issue - I have responded there.

Please keep other formatting issues - Date, Timestamps, String etc - here - we are working on resolving them ASAP.

Thank you for your patience.

Sundar Mudupalli

…bis to turn table expressions into SQL statements. This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed a problem with datetime being rendered by compile - see https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f

and #950

florisvink · 2023-08-30T15:52:32Z

Apart from being able to apply a filter to partition generation it would also be helpful to be able to typecast fields used in the primary key before the partitions gets generated. As far as I know these settings are only editable in the YAML, but not as CLI arguments running generate-table-partitions

I often see errors like cannot cast "2023-02-15 00:00.00" literal to date while validating generated partitions. These errors originate from the partition-filter in the YAML.

fix: Change in filter tag creation

eb71dac

piyushsarraf requested a review from a team as a code owner August 17, 2023 10:08

piyushsarraf mentioned this pull request Aug 17, 2023

generate-table-partitions generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923

Open

nehanene15 requested a review from sundar-mudupalli-work August 17, 2023 12:50

Piyush Sarraf added 2 commits August 18, 2023 08:53

fix: Change in test files to integrate double quotes in string

8252e8b

fix: Minor Fix

5c15379

nehanene15 and others added 2 commits August 18, 2023 14:45

fix: Support casting TD PKs to VARCHAR (#946)

2171532

* fix: support casting PK's to varchar for TD char support * fix: update so that this only applies to ComparisonField casts and doesn't affect regular row validation casting

fix: Fixing lint error

e69f756

release-please bot and others added 2 commits August 18, 2023 18:09

chore(develop): release 4.1.0 (#939)

b73c747

fix: Fixing linting issue

0ce6881

fix: typo

6864008

florisvink suggested changes Aug 21, 2023

View reviewed changes

helensilva14 added the priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. label Aug 21, 2023

Fix: fixing date issue and removing string logic and its tests

191cdfc

sundar-mudupalli-work requested changes Aug 21, 2023

View reviewed changes

fix: suggested changes

fb01960

feat: Adding integration tests for BQ, Hive, Teradata

d2995e6

Update partition_table_prd.md potential way to address #923 and #950

7ab2be7

sundar-mudupalli-work added 3 commits August 26, 2023 00:39

Fixed bug with specifying a config file in GCS - this did not work ea…

a39b9d8

…rlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup

Merge branch 'K8-indexed' of https://github.com/GoogleCloudPlatform/p…

6bb41e2

…rofessional-services-data-validator into K8-indexed

Merge branch 'issue923' of https://github.com/GoogleCloudPlatform/pro…

3ac9649

…fessional-services-data-validator into issue923

sundar-mudupalli-work added a commit that referenced this pull request Aug 30, 2023

Back out changes made for Kubernetes and by earlier attempts to fix #945

0d8296c

and #950

sundar-mudupalli-work closed this in c53f2fc Aug 31, 2023

sundar-mudupalli-work mentioned this pull request Aug 9, 2024

generate-partitions does not render strings correctly for teradata backend. #1221

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Change in filter tag creation in generate-table-partitions #945

fix: Change in filter tag creation in generate-table-partitions #945

piyushsarraf commented Aug 17, 2023 •

edited

Loading

piyushsarraf commented Aug 17, 2023

sundar-mudupalli-work commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

sundar-mudupalli-work commented Aug 18, 2023

sundar-mudupalli-work commented Aug 18, 2023

florisvink commented Aug 21, 2023 •

edited

Loading

florisvink Aug 21, 2023

piyushsarraf commented Aug 21, 2023

sundar-mudupalli-work commented Aug 21, 2023

sundar-mudupalli-work left a comment

florisvink commented Aug 22, 2023

piyushsarraf commented Aug 23, 2023

piyushsarraf commented Aug 23, 2023

nehanene15 commented Aug 24, 2023

florisvink commented Aug 25, 2023

sundar-mudupalli-work commented Aug 27, 2023

florisvink commented Aug 30, 2023 •

edited

Loading

fix: Change in filter tag creation in generate-table-partitions #945

fix: Change in filter tag creation in generate-table-partitions #945

Conversation

piyushsarraf commented Aug 17, 2023 • edited Loading

piyushsarraf commented Aug 17, 2023

sundar-mudupalli-work commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

piyushsarraf commented Aug 18, 2023

sundar-mudupalli-work commented Aug 18, 2023

sundar-mudupalli-work commented Aug 18, 2023

florisvink commented Aug 21, 2023 • edited Loading

florisvink Aug 21, 2023

Choose a reason for hiding this comment

piyushsarraf commented Aug 21, 2023

sundar-mudupalli-work commented Aug 21, 2023

sundar-mudupalli-work left a comment

Choose a reason for hiding this comment

florisvink commented Aug 22, 2023

piyushsarraf commented Aug 23, 2023

piyushsarraf commented Aug 23, 2023

nehanene15 commented Aug 24, 2023

florisvink commented Aug 25, 2023

sundar-mudupalli-work commented Aug 27, 2023

florisvink commented Aug 30, 2023 • edited Loading

piyushsarraf commented Aug 17, 2023 •

edited

Loading

florisvink commented Aug 21, 2023 •

edited

Loading

florisvink commented Aug 30, 2023 •

edited

Loading