-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Change in filter tag creation in generate-table-partitions #945
Conversation
/gcbrun |
Piyush, Thank you for working on this. I did some investigation and here are my findings (based on postgres)
Sundar Mudupalli |
/gcbrun |
* fix: support casting PK's to varchar for TD char support * fix: update so that this only applies to ComparisonField casts and doesn't affect regular row validation casting
/gcbrun |
/gcbrun |
/gcbrun |
1 similar comment
/gcbrun |
Hi, This fix only support timestamp fields, but not date fields. We need a fix that supports both - as the customer has already reported the issue with the date field as well. Please see my comments on how to fix the date field. Thanks. Sundar Mudupalli |
Looking at the email trail, the customer reported 3 issues: 1. timestamp (this is bug you fixed), 2. Date (this is the bug they are editing the yaml to be cast( as timestamp) and 3. escaping the quote character in the string. The third one is hard to fix and requires a redesign I think. Can you fix the second one as indicated in the notes to the PR. Thanks. |
Many thanks for working on this! Example:
(see |
data_validation/partition_builder.py
Outdated
if isinstance(values[0], str): | ||
value0 = '"' + values[0] + '"' | ||
elif isinstance(values[0], pd.Timestamp): | ||
value0 = "'" + str(values[0]) + "'" | ||
else: | ||
value0 = str(values[0]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double-quotes ("
) seem to cause problems in RedShift:
< SELECT count(*) FROM some_table WHERE some_field > 'some_value'
> OK
< SELECT count(*) FROM some_table WHERE some_field > "some_value"
> psycopg2.errors.UndefinedColumn: column "some_value" does not exist in some_table
/gcbrun |
Floris, It is strange that the database is allowing you to store null values in a primary key. Which database is this? Per Oracle (and likely other databases) - Sundar Mudupalli |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 87 -90 read
elif isinstance(values[0], pd.Timestamp):
if primary_key != "cast(" + keys[0] + " as timestamp)":
primary_key = "cast(" + keys[0] + " as timestamp)"
value0 = "'" + str(values[0]) + "'"
Should instead be
elif isinstance(values[0], pd.Timestamp):
primary_key = "cast(" + keys[0] + " as timestamp)"
value0 = "'" + str(values[0]) + "'"
It is possible i am misunderstanding your code - so help me here. Similar comment on lines 122-125.
The customer is also saying that the use of "
does not work correctly in RedShift. So the following line (121) may create a problem.
value0 = '"' + values[0] + '"'
Perhaps not use "
and use '
instead ?
Also you will need to build a test to demonstrate that primary date and timestamp keys work correctly - I suggest you create a function very similar to test_postgres_generate_table_partitions
for each of the 8 data sources. I left the SQL code to generate the table for the postgres test in postgres_test_tables.sql. You can do something similar for all 8 tests.
Thank you.
Sundar Mudupalli
Hi @sundar-mudupalli-work. In this case the null value originates from a column that is part of a composite (combined) primary key. Apart from that: In BigQuery (and I think in most analytics databases) a unique primary key is not a necessity (other than relations databases where indeed primary keys have a unique constraint) |
/gcbrun |
/gcbrun |
@sundar-mudupalli-work @piyushsarraf I think rather than adding integration tests, this would be better suited for unit tests for the function PartitionBuilder. _less_than_value(). It won't be scalable to create a new integration test + table for every unique primary key data type. @florisvink As for the 'nan' issue if the PK were to be 'nan' would you want the entire condition |
Yeah I think that makes sense. It could to unexpected results (partitions with 'too' many rows..) but at least it's better than creating corrupt SQL. Please not it's not only 'greater than' statement. I also see filters like: |
…rlier Added functionality to support Kubernetes Indexed jobs - which when provided with a directory will only run the job corresponding to the index. Tested in a non Kubernetes setup
@florisvink - let us move the Null / Nan discussion to Issue 951 opened specifically for that issue - I have responded there. Please keep other formatting issues - Date, Timestamps, String etc - here - we are working on resolving them ASAP. Thank you for your patience. Sundar Mudupalli |
…bis to turn table expressions into SQL statements. This addresses bugs #945 and #950. Unfortunately, we depend on the version of sqlalchemy being 2.0 or later which has fixed a problem with datetime being rendered by compile - see https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-206ec1f2af3a0c93785758c723ba356f
Apart from being able to apply a filter to partition generation it would also be helpful to be able to typecast fields used in the primary key before the partitions gets generated. As far as I know these settings are only editable in the YAML, but not as CLI arguments running I often see errors like |
generate-table-partitions
generates invalid SQL in filter clause when a timestamp field is part of the primary-key #923." "
for string datatype, to handle quotes in string value.