-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990
Comments
This will involve updating the following line to reference VARCHAR(max) instead: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/third_party/ibis/ibis_mssql/datatypes.py#L34 |
I've attempted to reproduce this with these columns:
And I do not see any CAST to VARCHAR in a row validation SQL:
|
Hi @nj1973 Attaching the ddl which has source and target table definitions. Fields like Description , shortDescription is where we saw issues. Also attaching command i used and the source and target queries |
I've been able to reproduce the problem, thanks for the DDL.
If I validate The presence of the NOT NULL constraint is triggering the problem, presumably because the data type is then prefixed with |
Using |
The cast is bypassed for the not null column in this code in
In the test case both This makes me think we actually have two problems here:
|
I think @renzokuken was on the right lines with the change on this branch based on this comment in SQL Alchemy code in
Therefore it is strange that the change is not having the desired effect. In However, our casts appear to go through I've not yet figured out why this is. |
I have added data to my test case and a BigQuery table:
BigQuery:
I then ran a row validation and it succeeded, with data > 100 characters:
And concat shows the full data returned:
I do still see incorrect SQL on screen, |
I've added a SQL Server specific cast override which ensures we include the MAX keyword. But seeing as I cannot reproduce the issue myself I've asked the original reporter to test. |
As for the different behaviour for nullable vs non-null columns, this is caused by this expression in
We could fix this with a simple hack like this:
but this feels a bit of a hack. I wonder if the exclamation mark should be dropped earlier in the process, we only use it for schema validation anyway. Therefore I propose not to make the above hacky change and would like some input from @nehanene15. |
I think So I think the issue lies within our cast to 'string' when we do row hash validation here rather than to '!string ' if the original column is required. We assume that the cast will account for checking the same type regardless of nullability. Due to this, I think your approach makes the most sense. |
Thanks Neha,
Just that, we are still seeing rows failing while using concat and we could
not find the differences during manual comparison. Would help if we know
why it's failing on concat. OR if we could resolve the hash, that would
help too. The client will be using the tool on their own during their prod
migration and would help them if this gets resolved, and we could have them
use concat or hash.
Divya Veerapandian
Cloud Data Architect
Global Delivery Center
…On Mon, Oct 16, 2023 at 11:25 PM Neha Nene ***@***.***> wrote:
I think op.to would be equivalent to the target_type provided in the
function parameters:
def cast(self, target_type: dt.DataType) -> Value
So I think the issue lies within our cast to 'string' when we do row hash
validation here
<https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/query_builder/query_builder.py#L378>
rather than to '!string ' if the original column is required. We assume
that the cast will account for checking the same type regardless of
nullability.
Due to this, I think your approach makes the most sense.
—
Reply to this email directly, view it on GitHub
<#990 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AU2O4V4NWFPVR5F52TYG2ODX7VYIBAVCNFSM6AAAAAA4WZ7WD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUHE4DSNZXGY>
.
You are receiving this because you commented.Message ID:
<GoogleCloudPlatform/professional-services-data-validator/issues/990/1764989776
@github.com>
|
I think validating this data is going to be very challenging. Looking through the validation failures there are plenty of multibyte characters to deal with in there, it looks like this was intentional based on the surrounding text. For example ID 1125 has a check mark I can see that we would like to support the full range of unicode symbols but to do that we need to take a step back and consider our comparison expressions across all SQL engines, e.g. BigQuery, Spanner, Oracle, MySQL, etc in addition to SQL Server and PostgreSQL. I think I need to split this issue into two issues:
There is also a "fix" on this branch for the initially reported issue of the missing |
Spun off the null vs not-null change to this issue: #1036 |
Abandoning the issue for the time being as it is not clear if we need the MAX keyword in casts or not. I could not reproduce a problem. |
When running hash validations columns are cast to VARCHAR, then concatenated and finally a hash is produced.
We've had a customer issue reported for nvarchar(500) and nvarchar(2000) columns. The cast to varchar uses a length of 30 by default so is trimming the text. We should consider casting to
varchar(max)
instead (or another technique if that works out to be easier).Varchar docs: https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16
Cast docs: https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver16
Examples from sqlcmd
Notice the
x
(at position 31) is cropped:Notice the
x
(at position 31) is included:The text was updated successfully, but these errors were encountered: