Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Change in filter tag creation in generate-table-partitions #945

Closed
wants to merge 16 commits into from

Conversation

piyushsarraf
Copy link
Contributor

@piyushsarraf piyushsarraf commented Aug 17, 2023

@piyushsarraf piyushsarraf requested a review from a team as a code owner August 17, 2023 10:08
@piyushsarraf
Copy link
Contributor Author

/gcbrun

@sundar-mudupalli-work
Copy link
Contributor

Piyush,

Thank you for working on this. I did some investigation and here are my findings (based on postgres)

  1. Postgres has time, date, timestamp and interval date/time types and the time and timestamp type can be with or without timezone.
  2. The interval type is not supported by ibis postgres extension - any table with a column with this type will throw an error. I will file a bug on this.
  3. The time type shows up in pandas as an object. So I am not sure we can support this type as a primary key column
  4. The date and timestamp type are both stored as datetime objects in pandas. While I used isinstance, I am now learning there is a better way to check the type. The recommended way for datetime type is to use pandas.api.types.is_datetime64_any_dtype and for strings is pandas.api.types.is_string_dtype
  5. So if you have a date object in pandas, you have to cast the key as timestamp, so like "cast(" + keys[0] + "as timestamp)"
  6. While you are making this change, can you please check how things will work if a column is boolean ?
    Thanks.

Sundar Mudupalli

@piyushsarraf
Copy link
Contributor Author

/gcbrun

nehanene15 and others added 2 commits August 18, 2023 14:45
* fix: support casting PK's to varchar for TD char support

* fix: update so that this only applies to ComparisonField casts and doesn't affect regular row validation casting
@piyushsarraf
Copy link
Contributor Author

/gcbrun

@piyushsarraf
Copy link
Contributor Author

/gcbrun

@piyushsarraf
Copy link
Contributor Author

/gcbrun

1 similar comment
@piyushsarraf
Copy link
Contributor Author

/gcbrun

@sundar-mudupalli-work
Copy link
Contributor

Hi,

This fix only support timestamp fields, but not date fields. We need a fix that supports both - as the customer has already reported the issue with the date field as well. Please see my comments on how to fix the date field.

Thanks.

Sundar Mudupalli

@sundar-mudupalli-work
Copy link
Contributor

Looking at the email trail, the customer reported 3 issues: 1. timestamp (this is bug you fixed), 2. Date (this is the bug they are editing the yaml to be cast( as timestamp) and 3. escaping the quote character in the string. The third one is hard to fix and requires a redesign I think. Can you fix the second one as indicated in the notes to the PR. Thanks.

@florisvink
Copy link

florisvink commented Aug 21, 2023

Many thanks for working on this!
I also find that FLOAT/NUMERIC fields can cause issues as being part of a primary key since they can be null and they appear in the filters as nan

Example:

  filters:
  - source: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    target: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    type: custom

(see (amount < nan))

Comment on lines 84 to 90
if isinstance(values[0], str):
value0 = '"' + values[0] + '"'
elif isinstance(values[0], pd.Timestamp):
value0 = "'" + str(values[0]) + "'"
else:
value0 = str(values[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-quotes (") seem to cause problems in RedShift:

< SELECT count(*) FROM some_table WHERE some_field > 'some_value'
> OK

< SELECT count(*) FROM some_table WHERE some_field > "some_value"
> psycopg2.errors.UndefinedColumn: column "some_value" does not exist in some_table

@helensilva14 helensilva14 added the priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. label Aug 21, 2023
@piyushsarraf
Copy link
Contributor Author

/gcbrun

@sundar-mudupalli-work
Copy link
Contributor

Many thanks for working on this! I also find that FLOAT/NUMERIC fields can cause issues as being part of a primary key since they can be null and they appear in the filters as nan

Example:

  filters:
  - source: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    target: (opportunity_id > '132252' OR opportunity_id = '132252' AND (stagename
      > 'In Progress' OR stagename = 'In Progress' AND (stage_change_datetime > '2023-01-02
      11:09:20' OR stage_change_datetime = '2023-01-02 11:09:20' AND (chain_name >
      '(Not available)' OR chain_name = '(Not available)' AND (amount >= nan)))))
      AND (opportunity_id < '134269' OR opportunity_id = '134269' AND (stagename <
      'Onboarding' OR stagename = 'Onboarding' AND (stage_change_datetime < '2020-06-29
      15:04:37' OR stage_change_datetime = '2020-06-29 15:04:37' AND (chain_name <
      '241 Pizza' OR chain_name = '241 Pizza' AND (amount < nan)))))
    type: custom

(see (amount < nan))

Floris,

It is strange that the database is allowing you to store null values in a primary key. Which database is this?

Per Oracle (and likely other databases) - A primary key defines the set of columns that uniquely identifies rows in a table. When you create a primary key constraint, none of the columns included in the primary key can have NULL constraints; that is, they must not permit NULL values.. Please see Oracle docs

Sundar Mudupalli

Copy link
Contributor

@sundar-mudupalli-work sundar-mudupalli-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 87 -90 read

      elif isinstance(values[0], pd.Timestamp):
            if primary_key != "cast(" + keys[0] + " as timestamp)":
                primary_key = "cast(" + keys[0] + " as timestamp)"
            value0 = "'" + str(values[0]) + "'"

Should instead be

      elif isinstance(values[0], pd.Timestamp):
            primary_key = "cast(" + keys[0] + " as timestamp)"
            value0 = "'" + str(values[0]) + "'"

It is possible i am misunderstanding your code - so help me here. Similar comment on lines 122-125.

The customer is also saying that the use of " does not work correctly in RedShift. So the following line (121) may create a problem.

          value0 = '"' + values[0] + '"'

Perhaps not use " and use ' instead ?

Also you will need to build a test to demonstrate that primary date and timestamp keys work correctly - I suggest you create a function very similar to test_postgres_generate_table_partitions for each of the 8 data sources. I left the SQL code to generate the table for the postgres test in postgres_test_tables.sql. You can do something similar for all 8 tests.

Thank you.

Sundar Mudupalli

@florisvink
Copy link

It is strange that the database is allowing you to store null values in a primary key. Which database is this?

Hi @sundar-mudupalli-work. In this case the null value originates from a column that is part of a composite (combined) primary key. Apart from that: In BigQuery (and I think in most analytics databases) a unique primary key is not a necessity (other than relations databases where indeed primary keys have a unique constraint)

@piyushsarraf
Copy link
Contributor Author

/gcbrun

@piyushsarraf
Copy link
Contributor Author

/gcbrun

@nehanene15
Copy link
Collaborator

@sundar-mudupalli-work @piyushsarraf I think rather than adding integration tests, this would be better suited for unit tests for the function PartitionBuilder. _less_than_value(). It won't be scalable to create a new integration test + table for every unique primary key data type.

@florisvink As for the 'nan' issue if the PK were to be 'nan' would you want the entire condition AND (amount < nan) to be removed? In that case we can return an empty string "" from the _less_than_value function? Feel free to comment or add your suggestions to this PR.

@florisvink
Copy link

As for the 'nan' issue if the PK were to be 'nan' would you want the entire condition AND (amount < nan) to be removed? In that case we can return an empty string "" from the _less_than_value function? Feel free to comment or add your suggestions to this PR.

Yeah I think that makes sense. It could to unexpected results (partitions with 'too' many rows..) but at least it's better than creating corrupt SQL.

Please not it's not only 'greater than' statement. I also see filters like: (userid < nan OR userid = nan AND (restaurantid < '1204085'))

…rlier

Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index.
Tested in a non Kubernetes setup
@sundar-mudupalli-work
Copy link
Contributor

@florisvink - let us move the Null / Nan discussion to Issue 951 opened specifically for that issue - I have responded there.

Please keep other formatting issues - Date, Timestamps, String etc - here - we are working on resolving them ASAP.

Thank you for your patience.

Sundar Mudupalli

…bis to turn table expressions into SQL statements.

This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed
a problem with datetime being rendered by compile - see
https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f
@florisvink
Copy link

florisvink commented Aug 30, 2023

Apart from being able to apply a filter to partition generation it would also be helpful to be able to typecast fields used in the primary key before the partitions gets generated. As far as I know these settings are only editable in the YAML, but not as CLI arguments running generate-table-partitions

I often see errors like cannot cast "2023-02-15 00:00.00" literal to date while validating generated partitions. These errors originate from the partition-filter in the YAML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants