Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate schema: TIMESTAMP WITH TIME ZONE fails for Oracle/PostgreSQL validation #706

Closed
nj1973 opened this issue Feb 6, 2023 · 13 comments · Fixed by #919
Closed

validate schema: TIMESTAMP WITH TIME ZONE fails for Oracle/PostgreSQL validation #706

nj1973 opened this issue Feb 6, 2023 · 13 comments · Fixed by #919
Assignees
Labels
priority: p2 Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup)

Comments

@nj1973
Copy link
Contributor

nj1973 commented Feb 6, 2023

Oracle test table:

create table dvt_test.tab_tstz (
id number(2) not null, col_tstz timestamp(0) with time zone
);
ALTER TABLE dvt_test.tab_tstz ADD PRIMARY KEY (id);
insert into dvt_test.tab_tstz values 
(1,to_timestamp_tz('1973-10-31 23:22:21 +00:00','YYYY-MM-DD HH24:MI:SS TZH:TZM'));
insert into dvt_test.tab_tstz values 
(2,to_timestamp_tz('1973-10-31 23:22:21 +02:00','YYYY-MM-DD HH24:MI:SS TZH:TZM'));
commit;

PostgreSQL test table:

create table dvt_test.tab_tstz (
id decimal(2) not null, col_tstz timestamp(0) with time zone
);
ALTER TABLE dvt_test.tab_tstz ADD PRIMARY KEY (id);
insert into dvt_test.tab_tstz values (1,timestamp'1973-10-31 23:22:21 +00:00');
insert into dvt_test.tab_tstz values (2,timestamp'1973-10-31 23:22:21 +02:00');

DVT command:

data-validation validate schema \
-sc ora_conn -tc pg_conn \
-tbls dvt_test.tab_tstz=dvt_test.tab_tstz

Output (relevant columns only):

╤══════════════════╤══════════════════╤═════════════════╤═════════════════╤═══════╕
│source_column_name│target_column_name│ source_agg_value│ target_agg_value│status │
╪══════════════════╪══════════════════╪═════════════════╪═════════════════╪═══════╡
│id                │id                │ decimal(2, 0)   │ decimal(2, 0)   │success│
┼──────────────────┼──────────────────┼─────────────────┼─────────────────┼───────┤
│col_tstz          │col_tstz          │ timestamp       │ timestamp('UTC')│ fail  │
╧══════════════════╧══════════════════╧═════════════════╧═════════════════╧═══════╛

Try to use --allow-list workaround but validation output is the same as above:

data-validation validate schema \
-sc ora_conn -tc pg_conn \
-tbls dvt_test.tab_tstz=dvt_test.tab_tstz \
--allow-list 'timestamp:timestamp('UTC')'
@nehanene15
Copy link
Collaborator

This should be resolved with Issue #710 - the allowlist will now accept the workaround.

The 'timestamp' data type compared to 'timestamp with timezone' won't be considered a success off the bat since the data types technically are stored differently.

@nj1973
Copy link
Contributor Author

nj1973 commented Feb 9, 2023

In the test tables col_tstz is a timestamp(0) with time zone in both Oracle and PostgreSQL. I think we're losing the time zone when interacting with Oracle.
I tried to use --allow-list as a workaround - but I think there still may be an underlying issue with Oracle TIMESTAMP WITH TIME ZONE.

@nehanene15 nehanene15 reopened this Feb 9, 2023
@nehanene15
Copy link
Collaborator

I see - in that case I think the issue is that we don't support the timestamp with time zone data type in our Oracle Ibis connector. Here we register the TIMESTAMP data type. We should add support for timezone if provided - something like this in third_party/ibis/ibis_oracle/client.py:

@dt.dtype.register(OracleDialect_cx_oracle, sa.dialects.oracle.TIMESTAMP)
def sa_oracle_TIMESTAMP(_, satype, nullable=True):
    if satype.timezone:
          return dt.Timestamp(timestamp=satype.timezone)
    return dt.Timestamp(nullable=nullable)

@nj1973 nj1973 self-assigned this Feb 13, 2023
@nj1973
Copy link
Contributor Author

nj1973 commented Feb 13, 2023

I've been doing some testing of this and notice that my test case passes for Oracle vs BigQuery, both of these systems don't pickup the time zone.
So, if I "fix" Oracle to match PostgreSQL then I'll potentially break some other comparisons.
I wonder if it's correct to drop the time zone because (I think) all comparisons are executed in UTC anyway?
Either that or I need to check all engines to ensure they all pick up time zone.
Probably I need to do that anyway so we fully understand which comparisons work and which don't.

@nj1973
Copy link
Contributor Author

nj1973 commented Feb 27, 2023

Surveyed validate schema agg_value by engine.

BigQuery     timestamp
MySQL        timestamp     
Oracle       timestamp
PostgreSQL   timestamp('UTC')
SQL Server   NotImplementedError
Teradata     timestamp

So it looks like the correct course of action is remove the time zone information from PostgreSQL and not add it Oracle. @nehanene15 - any thoughts?

Also need a new issue for SQL Server support.

@nj1973
Copy link
Contributor Author

nj1973 commented Mar 3, 2023

I'm no longer convinced regarding my statement "the correct course of action is remove the time zone information from PostgreSQL". I've been doing some column and row validation testing with time zoned data and am seeing mismatches there. If we don't know the column is a time zoned timestamp then we don't know to normalise it to UTC in the select phase.
So this needs yet more thought.

@nehanene15 nehanene15 added the priority: p2 Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup) label Mar 28, 2023
@nj1973
Copy link
Contributor Author

nj1973 commented Jul 10, 2023

Work on this is currently blocked by #317

@nj1973
Copy link
Contributor Author

nj1973 commented Jul 10, 2023

TIMESTAMP WITH LOCAL TIME ZONE is also unsupported:

professional-services-data-validator/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/oracle/base.py:1765: 
SAWarning: Did not recognize type 'TIMESTAMP WITH LOCAL TIME ZONE' of column 'col_tsltz'

@nehanene15
Copy link
Collaborator

This is also an issue for Snowflake.

@nj1973
Copy link
Contributor Author

nj1973 commented Jul 26, 2023

I've pushed fixes for schema validation to this branch for:

  • Oracle
  • BigQuery
  • Teradata

MySQL and SQL Server were fixed by the Ibis upgrade.

It looks like Snowflake might be fixed in the latest Ibis so I can duplicate code from there.

    "TIMESTAMP_LTZ": dt.timestamp,
    "TIMESTAMP_TZ": dt.Timestamp("UTC"),
    "TIMESTAMP_NTZ": dt.timestamp,

@nehanene15 - do you know if there is already another work stream looking at implementing latest Snowflake code?

@nj1973
Copy link
Contributor Author

nj1973 commented Jul 26, 2023

I've kept this issue focused on schema validation and spun off a new issue for column and row validation: #917

@nj1973
Copy link
Contributor Author

nj1973 commented Jul 27, 2023

Issue #907 is implementing column and row validations for Snowflake. Might be better to hold off on working on Snowflake for schema validation until that issue is complete.

@nj1973
Copy link
Contributor Author

nj1973 commented Jul 27, 2023

Change of plan. We'll continue with the PR for this issue without Snowflake support and add schema validation support for Snowflake TIMESTAMP_TZ to the related issue #916.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Medium priority. Fix may not be included in next release (e.g. minor documentation, cleanup)
Projects
None yet
2 participants