Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oracle to Postgres - Row hash validation fails when Oracle charset is not unicode #875

Closed
ncalero-uy opened this issue May 30, 2023 · 0 comments
Assignees

Comments

@ncalero-uy
Copy link

Summary

I am doing data validation from Oracle to PostgreSQL using row hash.
My Oracle database is remote, using WE8MSWIN1252 charset, while PostgreSQL is local to the Linux machine running DVT, using UTF8 in both Postgres and Linux locale.

The SQL used to compute the hash values in the source database is not converting the data to UTF8, as done in the target to calculate the same hash, which produces an error because the hashes do not match.
After troubleshooting the problem in the code, I was able to fix it by modifying the code that computes the hash in the source.

Below is an example with the data, the issue, the troubleshooting steps, and the proposed solution.
It uses one table, TEST_TBL, with one row having a multibyte char (å) in a column named data_col.

The error

[postgres@pghost:~/professional-services-data-validator]$ data-validation validate row -sc Ora_c -tc pgdb -tbls myschema.TEST_TBL=myschema.test_tbl --filter-status fail --primary-keys col1 --hash data_col --filters "col1='0135'"
╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │ source_agg_value                                                 │ target_agg_value                                                 │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ hash__all         │ Row               │ myschema.TEST_TBL   │ hash__all            │ 3d3e80848ea8b627897248b1ba15fcc833322ad4c9633704ebf8db9d8305dc8e │ 402aadf57ffaa856084c873e667d860ceba6609fff393c5251b8362fa6cc1482 │                  │ fail                │ 47aa6b10-3b20-4c29-8bab-79e603be1921 │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

Validations in source (Oracle)

[postgres@pghost:\~/professional-services-data-validator]$ export NLS_LANG=american_america.AL32UTF8
(venv)
[postgres@pghost:\~/professional-services-data-validator]$ sqlplus /@MYORADB

SQL\*Plus: Release 19.0.0.0.0 - Production on Tue May 23 18:52:12 2023
Version 19.18.0.0.0

Copyright (c) 1982, 2022, Oracle.  All rights reserved.

Last Successful login time: Tue May 23 2023 18:40:22 +02:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.16.0.0.0

SQL> select data_col, upper(data_col)
           ,dump(upper(data_col), 1016)
           ,lower(standard_hash(upper(data_col), 'SHA256'))
           ,lower(standard_hash(upper(convert(data_col,'UTF8')), 'SHA256'))
           ,lower(standard_hash(convert(upper(data_col),'UTF8'), 'SHA256'))
from TEST_TBL
where col1=0135;  

data_col
----------------------------------------------------------------------------------
UPPER(data_col)
----------------------------------------------------------------------------------
DUMP(UPPER(data_col),1016)
----------------------------------------------------------------------------------
LOWER(STANDARD_HASH(UPPER(data_col),'SHA256'))
----------------------------------------------------------------------------------
LOWER(STANDARD_HASH(UPPER(CONVERT(data_col,'UTF8')),'SHA256'))
----------------------------------------------------------------------------------
LOWER(STANDARD_HASH(CONVERT(UPPER(data_col),'UTF8'),'SHA256'))
----------------------------------------------------------------------------------
Råde
RÅDE
Typ=1 Len=4 CharacterSet=WE8MSWIN1252: 52,c5,44,45
3d3e80848ea8b627897248b1ba15fcc833322ad4c9633704ebf8db9d8305dc8e
aeb248a08f60c0076228d29b94b69f35bb47674e79d64dd6d9c53dd636a14f59
402aadf57ffaa856084c873e667d860ceba6609fff393c5251b8362fa6cc1482

SQL> select a.value||'_'||b.value||'.'||c.value as NLS_LANG_SERVER
     from nls_database_parameters a, nls_database_parameters b, nls_database_parameters c
     where a.parameter='NLS_LANGUAGE' and b.parameter='NLS_TERRITORY'
       and c.parameter='NLS_CHARACTERSET';

NLS_LANG_SERVER
---------------------------------------------------------------------------------
AMERICAN_AMERICA.WE8MSWIN1252

SQL> select UNISTR('\20AC') from dual;

UNIS
----
€

Validations in target (PostgreSQL)

[postgres@pghost:\~/professional-services-data-validator]$ psql
psql (15.2)
Type "help" for help.

pgdb=# select data_col
pgdb-#   ,encode(sha256(convert_to(upper(rtrim(coalesce(data_col, 'NULL'))), 'UTF8')), 'hex') AS hash1
pgdb-#   ,encode(sha256(convert_to(upper(data_col), 'UTF8')), 'hex') AS hash2
pgdb-#   ,encode(sha256(convert_to(upper(data_col), 'WIN1252')), 'hex') AS hash3
pgdb-# from myschema.test_tbl
pgdb-# where col1='0135';
 data_col | hash1                                                            | hash2                                                            | hash3
----------+------------------------------------------------------------------+------------------------------------------------------------------+------------------------------------------------------------------
 Råde     | 
 402aadf57ffaa856084c873e667d860ceba6609fff393c5251b8362fa6cc1482 | 402aadf57ffaa856084c873e667d860ceba6609fff393c5251b8362fa6cc1482 | 3d3e80848ea8b627897248b1ba15fcc833322ad4c9633704ebf8db9d8305dc8e
(1 row)

pgdb=# \d+ myschema.test_tbl
                                                  Table "myschema.test_tbl"
 Column   |         Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
----------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 col1     | character varying(4)  |           | not null |         | extended |             |              |
 data_col | character varying(35) |           |          |         | extended |             |              |
Indexes:
    "test_tbl_pkey" PRIMARY KEY, btree (col1)
Access method: heap

pgdb=# \l
                                                       List of databases
      Name       |  Owner   | Encoding |   Collate   |    Ctype    | ICU Locale | Locale Provider |     Access privileges
-----------------+----------+----------+-------------+-------------+------------+-----------------+----------------------------
 pgdb            | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |            | libc            | =Tc/postgres              +
...

Code changed to fix the problem:

I have added a line before calling the hash function to convert the data to UTF8 (convert function), and also modified the hash call to use that modified source data instead of the original one. This is the code:

[postgres@pghost:\~/professional-services-data-validator]$ grep -A7 sa_format_hashbytes_oracle third_party/ibis/ibis_addon/operations.py
def sa_format_hashbytes_oracle(translator, expr):
    arg, how = expr.op().args
    compiled_arg = translator.translate(arg)
    ## Changes start
    convert = sa.func.convert(compiled_arg, sa.sql.literal_column("'UTF8'"))
    hash_func = sa.func.standard_hash(convert, sa.sql.literal_column("'SHA256'"))
    #hash_func = sa.func.standard_hash(compiled_arg, sa.sql.literal_column("'SHA256'"))
    ## Changes end
    return sa.func.lower(hash_func)
--

Test after the code was changed:

[postgres@pghost:\~/professional-services-data-validator]$ data-validation validate row -sc Ora_c -tc pgdb -tbls myschema.TEST_TBL=myschema.test_tbl --filter-status fail --primary-keys col1 --hash data_col --filters "col1='0135'"
╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤════════════════════╤════════════════════╤══════════════════╤═════════════════════╤══════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │ source_agg_value   │ target_agg_value   │ pct_difference   │ validation_status   │ run_id   │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪════════════════════╪════════════════════╪══════════════════╪═════════════════════╪══════════╡
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧════════════════════╧════════════════════╧══════════════════╧═════════════════════╧══════════╛
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants