Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on Snowflake binary primary key random row validations #1146

Closed
nj1973 opened this issue May 28, 2024 · 4 comments · Fixed by #1152
Closed

Error on Snowflake binary primary key random row validations #1146

nj1973 opened this issue May 28, 2024 · 4 comments · Fixed by #1152
Assignees
Labels
priority: p1 High priority. Fix may be included in the next release.

Comments

@nj1973
Copy link
Contributor

nj1973 commented May 28, 2024

When validating tables using random rows we build a list of primary key literal values and then inject that list into the validation queries as an IN list, for example:

FROM `db-01.dvt_test.dvt_num_pk` t6
WHERE t6.`ID` IN (1,20,100,500,123456)

For tables with binary primary keys we cannot inject binary literals into a SQL statement and therefore convert the binary values to hex first and cast them back to binary in SQL, for example:

FROM `db-01.dvt_test.dvt_bin_pk` t7
WHERE t7.`BINARY_ID` IN (FROM_HEX('4456542d6b65792d34'), FROM_HEX('4456542d6b65792d313132'), 
  FROM_HEX('4456542d6b65792d3236'), FROM_HEX('4456542d6b65792d3130'), FROM_HEX('4456542d6b65792d3531'))

We have an issue on Snowflake for binary primary keys when the random row sample size is > 50:

SQL compilation error: In-list contains more than 50 non-constant values

We need a workaround for Snowflake engines.

cc: @henrihem

@nj1973 nj1973 self-assigned this May 28, 2024
@nj1973 nj1973 added the priority: p1 High priority. Fix may be included in the next release. label May 28, 2024
@nj1973 nj1973 linked a pull request May 30, 2024 that will close this issue
@nj1973
Copy link
Contributor Author

nj1973 commented May 31, 2024

Hi @henrihem, the fix to this issue should break up the problem Snowflake IN clauses into a series of IN/OR combinations. I was unable to find a way to use array manipulation like you suggested via the framework we use. Plus this solution is applicable to other SQL engines should we discover similar limitation elsewhere. Let me know if this fix does not resolve your issue.

@henrihem
Copy link

henrihem commented Jun 3, 2024

Hi @nj1973

There are a few bugs in this fix. I'll report those below.

Once I was able to bypass those bugs, the fix for the described issue works as intended. Was able to run random row validations with 1000 rows. So the fix itself seems to work.

Issue 1

venv/lib/python3.10/site-packages/data_validation/clients.py", line 27, in <module>
    from third_party.ibis.ibis_cloud_spanner.api import spanner_connect
ModuleNotFoundError: No module named 'third_party'

When clean installing from develop-branch to venv, there is issue that third_party is missing completely. Was able to add the third_party manually and after that is works

Issue 2

Not sure if this is issue or just intended limitation for random row validations, but with this fix the limit of random rows looks to be around 5000 and 10 000 rows.
With 5000 random rows it works just fine, but with 10 000 rows it throws:

RecursionError: maximum recursion depth exceeded while calling a Python object

If there is some maximum amount of random rows, that should be documented.

@helensilva14
Copy link
Contributor

Hi @henrihem, thanks a lot for your detailed comment and continuous collaboration with us!

Issue 1 was a side-effect of a configuration file switch that we attempted but we already reverted it at #1156. You should not face it anymore, very much appreciated for letting us know.

And about Issue 2 we just merged a bug fix at #1158 that @nj1973 greatly solved. If you could please take a look and try to test it would be great, thank you!

@nj1973
Copy link
Contributor Author

nj1973 commented Jun 5, 2024

Sorry, to be clear, my bug fix does not solve issue 2. I just noticed a problem with my original fix while investigating it.

We are hitting a Python limit due to the way Ibis builds OR clauses and the high number of them due to the 50 item IN list limit on Snowflake when the items are expressions. We won't have this problem for regular IN lists, only ones on binary columns. I think we might have to accept this is a limitation and think about how we document it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 High priority. Fix may be included in the next release.
Projects
None yet
3 participants