Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom-Query failure for boolean datatype columns #905

Closed
abhilash-JET opened this issue Jul 18, 2023 · 21 comments
Closed

Custom-Query failure for boolean datatype columns #905

abhilash-JET opened this issue Jul 18, 2023 · 21 comments
Assignees
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@abhilash-JET
Copy link

Hi Team,

I have a failure using the DVT tool for random row validation using custom - query method. I have generated the configuration file using below command where the column is_top_placement is boolean datatype.

  data-validation validate custom-query row \
  --source-query-file sq_rm.sql \
  --target-query-file tq_rm.sql \
  -sc rs \
  -tc bq \
  --hash '*' \
  --primary-keys _rest_hash_key,visit_dt,is_top_placement,outcode,app_type \
  -rbs 100 \
  -c cq_rm.yaml

when I trigger the command to run the validation of the above generated YAML file , I run into issues as shown below.

data-validation configs run -c ./cq_rm.yaml

Error: My validation runs fine when I exclude the boolean datatype column from the configuration file.

07/17/2023 02:38:35 PM-INFO: Currently running the validation for yml file: ./cq_rm.yaml
07/17/2023 02:44:03 PM-ERROR: Error (psycopg2.errors.CannotCoerce) cannot cast type boolean to character varying

Please do let me know if more details are required to debug further.

@sreeti-JET
Copy link

Hello Team,
Adding to the above, as we know schema and count are default validation for all the tables deployed in PROD. When I check the result table, I see schema validation has passed for table "just-data-warehouse.staging_sa.tenbis_res". Hence when I try to run random row validation for this table, it fails with issue as datatype mismatch. I believe it should not happen when it has passed the schema validation.

Command to run the file:
data-validation configs run -c test.yaml

ERROR Message:
07/17/2023 01:18:03 PM-ERROR: Error Arguments with datatype float64 and string are not comparable occurred while running config file test.yaml. Skipping it for now.

@sundar-mudupalli-work
Copy link
Contributor

Thank you for reaching out and good to know that the customer wants to use DVT for the comparison. Can you provide the following information to help us further

Thank you for filing an issue, it allows us to collect all the information in one place, prioritize it and use it in the future for training and tracking. Here is additional information that would be helpful.

  1. Run the custom-query with --verbose option as in 'data-validation (--verbose or -v)'.
  2. Schema of the source and target tables.
  3. the yaml configuration file that you generated.

These can all be attached to the issue.

Taking a quick look - my question is - do you really need boolean as one of the primary keys ? As explained in Primary Keys, you only need that column if it is required to uniquely identify the row in the custom query.

Thanks.

Sundar Mudupalli

@abhilash-JET
Copy link
Author

Attaching schema details
Redshift Schema
Screenshot 2023-07-18 at 2 13 08 PM

Big Query Schema
Screenshot 2023-07-18 at 2 12 51 PM

@abhilash-JET
Copy link
Author

Regarding the usage of Primary-key of a boolean type, this was only way to define the composite primary keys to define the uniqueness of the data , but also I have seen that without the column being a primary key if column is boolean it has failed for some other table during row validation using hash_all method.

@abhilash-JET
Copy link
Author

Please find the attached config file generated

cq_rm_issue.txt

@nehanene15
Copy link
Collaborator

Can you provide the verbose output generated via data-validation -v validate custom query row ...?
My guess is that we're hitting the Redshift error here: https://repost.aws/questions/QUkjLOYKCJSauoX5NlvPcIbg/how-to-prevent-redshift-from-converting-boolean-to-varchar-when-creating-table-as-query-result

By default, DVT casts columns to string so we can then use it in a hash or concatenate the columns.

@abhilash-JET
Copy link
Author

Please find the attached verbose output.
cq_rm_verbose.txt

@abhilash-JET
Copy link
Author

I do not see any difference in the output file generated before and after using verbose. Please find the command used below.

  data-validation -v validate custom-query row \
  --source-query-file sq_rm.sql \
  --target-query-file tq_rm.sql \
  -sc rs \
  -tc bq \
  --hash '*' \
  --primary-keys _rest_hash_key,visit_dt,is_top_placement,outcode,app_type \
  -rbs 100 \
  -c cq_rm_verbose.txt

@nehanene15
Copy link
Collaborator

Since you're saving to a config file, you will need to apply -v when running the validation i.e data-validation -v configs run -c cq_rm_verbose.txt

@abhilash-JET
Copy link
Author

Please find the output of the above asked command.
verbose_output.log

@nehanene15 nehanene15 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Jul 18, 2023
@nehanene15
Copy link
Collaborator

Seems like we're casting to TEXT with Redshift and there's a Redshift issue when casting a BOOL column to TEXT.
This might be solved with a cast to VARCHAR instead, but we should manually test with Redshift to confirm that it would fix the issue:
I.e. CAST(t6.is_top_placement AS TEXT) versus CAST(t6.is_top_placement AS VARCHAR)

@abhilash-JET Could you test on Redshift and confirm?

@abhilash-JET
Copy link
Author

I tried casting the boolean field with TEXT and VARCHAR , it's throwing up error for both.

Screenshot 2023-07-19 at 11 07 52 AM

@abhilash-JET
Copy link
Author

abhilash-JET commented Jul 19, 2023

@nehanene15

Issue 2
As suggested by Zain reporting another issue where we are seeing failure in hash_all validation , but we see that source and target is exactly having same data.
its looks like the special characters are treated differently.

Target data

Source data

Screenshot 2023-07-19 at 10 56 37

@sreeti-JET
Copy link

why are we sharing email id over github? Can't this be an issue of GDPR?

@abhilash-JET
Copy link
Author

@sreeti-JET can you please add the example by scraping the data ? I have deleted the data

@sundar-mudupalli-work
Copy link
Contributor

Abhilash-JET,

If you are not able to cast BOOLEAN to TEXT/VARCHAR, then it is a problem. DVT out of the box will not work for BOOLEAN type. This may be a problem specific to AWS RedShift. I am able to convert Boolean to TEXT with PostgreSQL (CloudSQL).

There is an alternative - and that is to convert the BOOLEAN type to TEXT/VARCHAR yourself in your query. Since the BOOLEAN type has only 3 possible values, this is fortunately not that difficult. The first is the SQL that you provided. The second is the modified SQL where the BOOLEAN type is converted to TEXT/VARCHAR. Can you try this and see if it works

SELECT 
  DISTINCT visit_dt, 
  visit_dt_utc, 
  sourceplatform, 
  _rest_hash_key, 
  total_click_count, 
  source_restaurant_id, 
  country, 
  outcode, 
  app_type, 
  is_top_placement 
FROM 
  public.justeat_pp_restaurantmenuviewed 
WHERE 
  visit_dt_utc > '2023-07-13'

SELECT 
  DISTINCT visit_dt, 
  visit_dt_utc, 
  sourceplatform, 
  _rest_hash_key, 
  total_click_count, 
  source_restaurant_id, 
  country, 
  outcode, 
  app_type, 
  case is_top_placement
    when true then 'true'
    when false then 'false'
    else Null END as is_top_placement_text 
FROM 
  public.justeat_pp_restaurantmenuviewed 
WHERE 
  visit_dt_utc > '2023-07-13'

You can specify is_top_placement_text as one of the primary keys.

Let us know how that works out.

Sundar Mudupalli

@sreeti-JET
Copy link

Please find error log in the attached.
errorlog.sql.zip

@helensilva14
Copy link
Contributor

helensilva14 commented Jul 20, 2023

Hi @sreeti-JET @abhilash-JET! Could you please try to use the CASE block for both source (Redshift) and target (BQ) queries? Otherwise we will still find the Arguments with datatype string and boolean are not comparable error

 case is_top_placement
    when true then 'true'
    when false then 'false'
    else null END as is_top_placement_text

And make sure to change the primary keys list to --primary-keys _rest_hash_key,visit_dt,is_top_placement_text,outcode,app_type

@sundar-mudupalli-work
Copy link
Contributor

sundar-mudupalli-work commented Jul 21, 2023

Hi,

As Helen mentioned you need to make the change on both ends. I am attaching files that created the tables with the same schema, populated them with 4 rows and using your queries modified with the changes suggested. These run correctly in our environment with the 3.20 release. This may not work if you just cloned the development branch (see issue #909). The sequence of commands we used are shown below.

You will need to edit the files, change their names to .sql and update the commands to conform to your file and table names, redshift and bq locations:

red_rm.txt
create_tbl_rd.txt
bq_rm.txt
create_tbl_bq.txt

bq query --use_legacy_sql=false < create_tbl_bq.sql
psql -U xxxxxx -d <database-name> -h xxxxxxxxxxxxxx.us-west-2.redshift.amazonaws.com -p 5439 -f create_tbl_rd.sql
data-validation validate custom-query row -tc=bq -sc=Redshift_CONN_mudupalli --target-query-file ../scripts/bq_rm.sql --source-query-file ../scripts/red_rm.sql -pk=_rest_hash_key,is_top_placement_text,outcode,app_type --hash '*'

Please try it out and let us know what you find out.

@sreeti-JET
Copy link

This is working as expected.Thanks team for your help.Could you please let us know when the below 2 issues will be fixed?

  1. hash_diff generating differently in RS and BQ when we have special character.
  2. datatype mismatch during validation "error as datatype float64 and string are not comparable"

@nehanene15
Copy link
Collaborator

@sreeti-JET Thanks for the update. In that case, we will close this issue.

As for your other issues,

  1. What special characters cause the mismatch? If you try running the SQL commands generated with data-validation -v validate row... on each respective database, are the results different? This may be expected if the hash value of the special characters don't match on RS versus BQ. After testing, if the hash values do match, please open a separate issue for this with the details.

  2. Please open a separate issue for this with the following details: Tool version, command used to reproduce, the data types you are comparing, and stack trace.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants