-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation throws pandas merge error when source data is empty #1006
Comments
@SwatiT17 Were we able to confirm this is due to lack of data in the dataframes? |
More context: This is due to the source data frame being empty. DVT should add better error handling for the following use cases:
We can add the checks once we connect to the data frames in Ibis here or in generate_report() in combiner.py |
Hi @SwatiT17! Could you please provide updates here about the progress for this issue? Thanks a lot! |
@helensilva14 Is the above fix for DVT error handling ready to use for me to validate? Currently, as mentioned above the issue arises when there is no data in the source i.e. Empty Source dataframe |
@SwatiT17, not yet, I just wanted to check if your team had other updates about it. I think @Raniksingh is taking a look at it, otherwise I can also jump on it on Friday. |
I tested with the empty source table and I didn't get the error as mentioned above. For further analysis, I tested with same connection for source and target with Bigquery connection. I didn't get the error as well. |
@SwatiT17 could you provide me the schema of the tables for which the issue occurred. Will use them for testing. |
This error may occur when using the "generate-table-partitions" command |
I have the same issue when comparing Teradata and Bigquery with custom query and both are empty. |
Thanks a lot @sunyanyong and @paolocanaletti for letting us this problem is still happening. Quoting @nehanene15's comment to recap this issue with our developer team. |
I'm unable to reproduce this issue similar to @dipintimanandhar. I tried with BQ to BQ custom query validation with an empty source table, empty target table, and both empty tables. This error occurring when both tables exist but have no data in them, correct? Can you provide an example of the custom query you are running and the schema of the table? |
The issue I have is between TD and BQ, not BQ to BQ and using a custom query.
|
Can you provide the full stack trace? I suspect it has to do with table schemas since the error reported is trying to merge a 'float64' column with 'object' in pandas. I wasn't able to reproduce with TD to BQ custom query. I tried with TD to BQ with the following TD table definition (NO data added): DVT command: source.sql target.sql Result, Latest DVT version: |
if you suspect the issue is with float I am not getting why you only use varchar and integer in your test. these are the datatype in the table TD conversion from Timestamp to datetime is mandatory because in this case the source timestamp is not UTC but contains the value with my timezone. and the error is Log
|
When instead I compare table that are not empty empty but the custom query return an empty result in both the table, although the tables contains only string/varchar then I am not able to get the run_id from the resulting dataframe using
is it normal? |
I did test with float, just didn't write out all my test cases in the comment. Target (BQ) SQL: DVT command: Can you provide the custom queries themselves and the DVT command? And can you confirm the tables are completely empty with 0 rows? We may need to get on a call to debug this. For your second comment, yes this is normal. If no data matches the custom query criteria, you will get an empty dataframe with no run_id. |
I'm able to reproduce this by adding a composite PK with datetime column as a PK: The issue here is a mismatch in data types inferred by pandas for the primary key column(s). In this case, BQ inferred an 'datetime[64]' data type whereas TD inferred an 'object' data type. I was able to fix this issue for date columns in the TD backend. @paolocanaletti |
While executing row validation using custom query, validation is failing with an unexpected error message
Error You are trying to merge on object and float64 columns. If you wish to proceed you should use pd.concat occurred while running config file dvt_configs/test.yaml. Skipping it for now.
Command
data-validation -ll ERROR validate custom-query row -sc td_conn -tc bq_conn -sqf dvt_configs/src.sql -tqf dvt_configs/tgt.sql --primary-keys pk -comp-fields=col1,col2,col3,col4,col5 -bqrh test.dvt.bq_validation -c dvt_configs/test.yaml
src.sql
select col1,col2,col3,col4,col5 from src_tbl where pk in(123)
tgt.sql
select col1,col2,col3,col4,col5 from tgt_tbl where pk in(123)
The text was updated successfully, but these errors were encountered: