Custom-Query failure for boolean datatype columns #905

abhilash-JET · 2023-07-18T07:24:52Z

Hi Team,

I have a failure using the DVT tool for random row validation using custom - query method. I have generated the configuration file using below command where the column is_top_placement is boolean datatype.

  data-validation validate custom-query row \
  --source-query-file sq_rm.sql \
  --target-query-file tq_rm.sql \
  -sc rs \
  -tc bq \
  --hash '*' \
  --primary-keys _rest_hash_key,visit_dt,is_top_placement,outcode,app_type \
  -rbs 100 \
  -c cq_rm.yaml

when I trigger the command to run the validation of the above generated YAML file , I run into issues as shown below.

data-validation configs run -c ./cq_rm.yaml

Error: My validation runs fine when I exclude the boolean datatype column from the configuration file.

07/17/2023 02:38:35 PM-INFO: Currently running the validation for yml file: ./cq_rm.yaml
07/17/2023 02:44:03 PM-ERROR: Error (psycopg2.errors.CannotCoerce) cannot cast type boolean to character varying

Please do let me know if more details are required to debug further.

The text was updated successfully, but these errors were encountered:

sreeti-JET · 2023-07-18T08:31:02Z

Hello Team,
Adding to the above, as we know schema and count are default validation for all the tables deployed in PROD. When I check the result table, I see schema validation has passed for table "just-data-warehouse.staging_sa.tenbis_res". Hence when I try to run random row validation for this table, it fails with issue as datatype mismatch. I believe it should not happen when it has passed the schema validation.

Command to run the file:
data-validation configs run -c test.yaml

ERROR Message:
07/17/2023 01:18:03 PM-ERROR: Error Arguments with datatype float64 and string are not comparable occurred while running config file test.yaml. Skipping it for now.

sundar-mudupalli-work · 2023-07-18T11:30:35Z

Thank you for reaching out and good to know that the customer wants to use DVT for the comparison. Can you provide the following information to help us further

Thank you for filing an issue, it allows us to collect all the information in one place, prioritize it and use it in the future for training and tracking. Here is additional information that would be helpful.

Run the custom-query with --verbose option as in 'data-validation (--verbose or -v)'.
Schema of the source and target tables.
the yaml configuration file that you generated.

These can all be attached to the issue.

Taking a quick look - my question is - do you really need boolean as one of the primary keys ? As explained in Primary Keys, you only need that column if it is required to uniquely identify the row in the custom query.

Thanks.

Sundar Mudupalli

abhilash-JET · 2023-07-18T12:15:22Z

Attaching schema details
Redshift Schema

Big Query Schema

abhilash-JET · 2023-07-18T12:17:18Z

Regarding the usage of Primary-key of a boolean type, this was only way to define the composite primary keys to define the uniqueness of the data , but also I have seen that without the column being a primary key if column is boolean it has failed for some other table during row validation using hash_all method.

abhilash-JET · 2023-07-18T13:14:31Z

Please find the attached config file generated

cq_rm_issue.txt

nehanene15 · 2023-07-18T14:30:52Z

Can you provide the verbose output generated via data-validation -v validate custom query row ...?
My guess is that we're hitting the Redshift error here: https://repost.aws/questions/QUkjLOYKCJSauoX5NlvPcIbg/how-to-prevent-redshift-from-converting-boolean-to-varchar-when-creating-table-as-query-result

By default, DVT casts columns to string so we can then use it in a hash or concatenate the columns.

abhilash-JET · 2023-07-18T15:03:42Z

Please find the attached verbose output.
cq_rm_verbose.txt

abhilash-JET · 2023-07-18T15:05:11Z

I do not see any difference in the output file generated before and after using verbose. Please find the command used below.

  data-validation -v validate custom-query row \
  --source-query-file sq_rm.sql \
  --target-query-file tq_rm.sql \
  -sc rs \
  -tc bq \
  --hash '*' \
  --primary-keys _rest_hash_key,visit_dt,is_top_placement,outcode,app_type \
  -rbs 100 \
  -c cq_rm_verbose.txt

nehanene15 · 2023-07-18T15:06:55Z

Since you're saving to a config file, you will need to apply -v when running the validation i.e data-validation -v configs run -c cq_rm_verbose.txt

abhilash-JET · 2023-07-18T16:04:43Z

Please find the output of the above asked command.
verbose_output.log

nehanene15 · 2023-07-18T20:51:08Z

Seems like we're casting to TEXT with Redshift and there's a Redshift issue when casting a BOOL column to TEXT.
This might be solved with a cast to VARCHAR instead, but we should manually test with Redshift to confirm that it would fix the issue:
I.e. CAST(t6.is_top_placement AS TEXT) versus CAST(t6.is_top_placement AS VARCHAR)

@abhilash-JET Could you test on Redshift and confirm?

abhilash-JET · 2023-07-19T09:09:37Z

I tried casting the boolean field with TEXT and VARCHAR , it's throwing up error for both.

abhilash-JET · 2023-07-19T10:56:25Z

@nehanene15

Issue 2
As suggested by Zain reporting another issue where we are seeing failure in hash_all validation , but we see that source and target is exactly having same data.
its looks like the special characters are treated differently.

Target data

Source data

sreeti-JET · 2023-07-19T11:22:35Z

why are we sharing email id over github? Can't this be an issue of GDPR?

abhilash-JET · 2023-07-19T11:35:09Z

@sreeti-JET can you please add the example by scraping the data ? I have deleted the data

sundar-mudupalli-work · 2023-07-20T04:20:56Z

Abhilash-JET,

If you are not able to cast BOOLEAN to TEXT/VARCHAR, then it is a problem. DVT out of the box will not work for BOOLEAN type. This may be a problem specific to AWS RedShift. I am able to convert Boolean to TEXT with PostgreSQL (CloudSQL).

There is an alternative - and that is to convert the BOOLEAN type to TEXT/VARCHAR yourself in your query. Since the BOOLEAN type has only 3 possible values, this is fortunately not that difficult. The first is the SQL that you provided. The second is the modified SQL where the BOOLEAN type is converted to TEXT/VARCHAR. Can you try this and see if it works

SELECT 
  DISTINCT visit_dt, 
  visit_dt_utc, 
  sourceplatform, 
  _rest_hash_key, 
  total_click_count, 
  source_restaurant_id, 
  country, 
  outcode, 
  app_type, 
  is_top_placement 
FROM 
  public.justeat_pp_restaurantmenuviewed 
WHERE 
  visit_dt_utc > '2023-07-13'

SELECT 
  DISTINCT visit_dt, 
  visit_dt_utc, 
  sourceplatform, 
  _rest_hash_key, 
  total_click_count, 
  source_restaurant_id, 
  country, 
  outcode, 
  app_type, 
  case is_top_placement
    when true then 'true'
    when false then 'false'
    else Null END as is_top_placement_text 
FROM 
  public.justeat_pp_restaurantmenuviewed 
WHERE 
  visit_dt_utc > '2023-07-13'

You can specify is_top_placement_text as one of the primary keys.

Let us know how that works out.

Sundar Mudupalli

sreeti-JET · 2023-07-20T14:15:18Z

Please find error log in the attached.
errorlog.sql.zip

helensilva14 · 2023-07-20T18:15:31Z

Hi @sreeti-JET @abhilash-JET! Could you please try to use the CASE block for both source (Redshift) and target (BQ) queries? Otherwise we will still find the Arguments with datatype string and boolean are not comparable error

 case is_top_placement
    when true then 'true'
    when false then 'false'
    else null END as is_top_placement_text

And make sure to change the primary keys list to --primary-keys _rest_hash_key,visit_dt,is_top_placement_text,outcode,app_type

sundar-mudupalli-work · 2023-07-21T23:35:38Z

Hi,

As Helen mentioned you need to make the change on both ends. I am attaching files that created the tables with the same schema, populated them with 4 rows and using your queries modified with the changes suggested. These run correctly in our environment with the 3.20 release. This may not work if you just cloned the development branch (see issue #909). The sequence of commands we used are shown below.

You will need to edit the files, change their names to .sql and update the commands to conform to your file and table names, redshift and bq locations:

red_rm.txt
create_tbl_rd.txt
bq_rm.txt
create_tbl_bq.txt

bq query --use_legacy_sql=false < create_tbl_bq.sql

psql -U xxxxxx -d <database-name> -h xxxxxxxxxxxxxx.us-west-2.redshift.amazonaws.com -p 5439 -f create_tbl_rd.sql

data-validation validate custom-query row -tc=bq -sc=Redshift_CONN_mudupalli --target-query-file ../scripts/bq_rm.sql --source-query-file ../scripts/red_rm.sql -pk=_rest_hash_key,is_top_placement_text,outcode,app_type --hash '*'

Please try it out and let us know what you find out.

sreeti-JET · 2023-07-25T14:32:21Z

This is working as expected.Thanks team for your help.Could you please let us know when the below 2 issues will be fixed?

hash_diff generating differently in RS and BQ when we have special character.
datatype mismatch during validation "error as datatype float64 and string are not comparable"

nehanene15 · 2023-07-25T15:31:19Z

@sreeti-JET Thanks for the update. In that case, we will close this issue.

As for your other issues,

What special characters cause the mismatch? If you try running the SQL commands generated with data-validation -v validate row... on each respective database, are the results different? This may be expected if the hash value of the special characters don't match on RS versus BQ. After testing, if the hash values do match, please open a separate issue for this with the details.
Please open a separate issue for this with the following details: Tool version, command used to reproduce, the data types you are comparing, and stack trace.

Thanks.

nehanene15 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Jul 18, 2023

helensilva14 assigned sundar-mudupalli-work and helensilva14 Jul 20, 2023

helensilva14 mentioned this issue Jul 21, 2023

Check Redshift to BQ row validation with CASE statement after Ibis upgrade #909

Closed

nehanene15 closed this as completed Jul 25, 2023

sundar-mudupalli-work mentioned this issue Jul 26, 2023

Add integration test suite for Redshift & check if Teradata's suite is up-to-date #915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom-Query failure for boolean datatype columns #905

Custom-Query failure for boolean datatype columns #905

abhilash-JET commented Jul 18, 2023

sreeti-JET commented Jul 18, 2023

sundar-mudupalli-work commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

nehanene15 commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

nehanene15 commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

nehanene15 commented Jul 18, 2023

abhilash-JET commented Jul 19, 2023

abhilash-JET commented Jul 19, 2023 •

edited

Loading

sreeti-JET commented Jul 19, 2023

abhilash-JET commented Jul 19, 2023

sundar-mudupalli-work commented Jul 20, 2023

sreeti-JET commented Jul 20, 2023

helensilva14 commented Jul 20, 2023 •

edited

Loading

sundar-mudupalli-work commented Jul 21, 2023 •

edited by helensilva14

Loading

sreeti-JET commented Jul 25, 2023

nehanene15 commented Jul 25, 2023

Custom-Query failure for boolean datatype columns #905

Custom-Query failure for boolean datatype columns #905

Comments

abhilash-JET commented Jul 18, 2023

sreeti-JET commented Jul 18, 2023

sundar-mudupalli-work commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

nehanene15 commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

nehanene15 commented Jul 18, 2023

abhilash-JET commented Jul 18, 2023

nehanene15 commented Jul 18, 2023

abhilash-JET commented Jul 19, 2023

abhilash-JET commented Jul 19, 2023 • edited Loading

sreeti-JET commented Jul 19, 2023

abhilash-JET commented Jul 19, 2023

sundar-mudupalli-work commented Jul 20, 2023

sreeti-JET commented Jul 20, 2023

helensilva14 commented Jul 20, 2023 • edited Loading

sundar-mudupalli-work commented Jul 21, 2023 • edited by helensilva14 Loading

sreeti-JET commented Jul 25, 2023

nehanene15 commented Jul 25, 2023

abhilash-JET commented Jul 19, 2023 •

edited

Loading

helensilva14 commented Jul 20, 2023 •

edited

Loading

sundar-mudupalli-work commented Jul 21, 2023 •

edited by helensilva14

Loading