row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

nj1973 · 2023-09-13T17:51:11Z

When running hash validations columns are cast to VARCHAR, then concatenated and finally a hash is produced.

We've had a customer issue reported for nvarchar(500) and nvarchar(2000) columns. The cast to varchar uses a length of 30 by default so is trimming the text. We should consider casting to varchar(max) instead (or another technique if that works out to be easier).

Varchar docs: https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16
Cast docs: https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver16

Examples from sqlcmd

Notice the x (at position 31) is cropped:

1> select concat(cast('123456789012345678901234567890x' as varchar),cast('123456789012345678901234567890x' as varchar));
2> go

------------------------------------------------------------
123456789012345678901234567890123456789012345678901234567890

Notice the x (at position 31) is included:

1> select concat(cast('123456789012345678901234567890x' as varchar(max)),cast('123456789012345678901234567890x' as varchar(max)));
2> go

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
123456789012345678901234567890x123456789012345678901234567890x

The text was updated successfully, but these errors were encountered:

nehanene15 · 2023-09-14T21:00:17Z

This will involve updating the following line to reference VARCHAR(max) instead: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/third_party/ibis/ibis_mssql/datatypes.py#L34

nj1973 · 2023-09-15T10:05:27Z

I've attempted to reproduce this with these columns:

,   col_nvarchar      nvarchar(max)
,   col_nvarchar_500  nvarchar(500)

And I do not see any CAST to VARCHAR in a row validation SQL:

data-validation -v validate row -sc sqlserver -tc sqlserver -tbls dvt_test.dvt_mssql2pg_types \
--primary-keys=id --hash='id,col_nvarchar_500,col_nvarchar'
...
09/15/2023 10:03:11 AM-INFO: -- ** Source Query ** --
09/15/2023 10:03:11 AM-INFO: WITH t0 AS
(SELECT t6.id AS id, t6.col_varchar AS col_varchar, t6.col_nvarchar AS col_nvarchar, t6.col_nvarchar_500 AS col_nvarchar_500, t6.col_nchar_2 AS col_nchar_2, t6.col_ntext AS col_ntext, t6.col_smallmoney AS col_smallmoney, t6.col_money AS col_money, t6.col_smalldatetime AS col_smalldatetime, t6.col_time AS col_time, t6.col_uuid AS col_uuid, CAST(t6.id AS VARCHAR) AS cast__id, t6.col_nvarchar AS cast__col_nvarchar, t6.col_nvarchar_500 AS cast__col_nvarchar_500
FROM dvt_test.dvt_mssql2pg_types AS t6),
t1 AS
...

divyapandian5 · 2023-09-15T10:24:32Z

Hi @nj1973 Attaching the ddl which has source and target table definitions. Fields like Description , shortDescription is where we saw issues. Also attaching command i used and the source and target queries

toReplicate.zip

nj1973 · 2023-09-18T10:28:31Z

I've been able to reproduce the problem, thanks for the DDL.
I created this table:

CREATE TABLE [dbo].[tblIssue990](
    [ID] [bigint] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [ShortDescription] [nvarchar](1000) NOT NULL,
    [ShortDescription2] [nvarchar](1000)
);

If I validate ShortDescription I get the cast to VARCHAR. If I validate ShortDescription2 I do not.

The presence of the NOT NULL constraint is triggering the problem, presumably because the data type is then prefixed with !. I haven't yet been able to figure out where in the code the cast is included/bypassed based on data type.

nj1973 · 2023-10-04T09:19:56Z

Using dbo.tblIssue990 I saved output to a config file and the contents were identical for ShortDescription vs ShortDescription2, except for column names, therefore the problem is not in the construction of the Ibis expressions, it must come later in the process.

nj1973 · 2023-10-04T11:14:32Z

The cast is bypassed for the not null column in this code in ibis_addon/api.py:

def cast(self, target_type: dt.DataType) -> Value:
    """Override ibis.expr.api's cast method.
    This allows for Timestamp-typed columns to be cast to Timestamp, since Ibis interprets some similar but non-equivalent types (eg. DateTime) to Timestamp (GitHub issue #451).
    """
    # validate
    op = ops.Cast(self, to=target_type)

    if op.to == self.type() and not op.to.is_timestamp():
        # noop case if passed type is the same
        return self
...

In the test case both op.to and self.type() are "string" for ShortDescription2. But for ShortDescription self.type() is "!string" which introduces an unnecessary cast.

This makes me think we actually have two problems here:

Nullable and non-nullable should not be treated differently just because there's a ! in the data type
Irrespective of 1 we still need a MAX in the cast to VARCHAR because, in theory, we could be casting a numeric value with more than 30 digits to VARCHAR which would also be truncated

nj1973 · 2023-10-04T12:14:21Z

I think @renzokuken was on the right lines with the change on this branch based on this comment in SQL Alchemy code in sqlalchemy/dialects/mssql/base.py:

MAX on VARCHAR / NVARCHAR
-------------------------

SQL Server supports the special string "MAX" within the
:class:`_types.VARCHAR` and :class:`_types.NVARCHAR` datatypes,
to indicate "maximum length possible".   The dialect currently handles this as
a length of "None" in the base type, rather than supplying a
dialect-specific version of these types, so that a base type
specified such as ``VARCHAR(None)`` can assume "unlengthed" behavior on
more than one backend without using dialect-specific types.

Therefore it is strange that the change is not having the desired effect.

In sqlalchemy/dialects/mssql/base.py class MSTypeCompiler there is a visit_VARCHAR method which uses "max" if a type has no length.

However, our casts appear to go through sqlalchemy/sql/compiler.py class GenericTypeCompiler.visit_VARCHAR which does not support "max".

I've not yet figured out why this is.

nj1973 · 2023-10-06T10:45:02Z

I have added data to my test case and a BigQuery table:

CREATE TABLE [dbo].[tblIssue990](
    [ID] [bigint] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [ShortDescription] [nvarchar](1000) NOT NULL,
    [ShortDescription2] [nvarchar](1000)
);
INSERT INTO [dbo].[tblIssue990] ([ShortDescription],[ShortDescription2])
VALUES ('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890',
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

BigQuery:

create table dvt_test.tblissue990
(id int64,ShortDescription string,ShortDescription2 string);
insert into dvt_test.tblissue990
values (1,'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890',
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

I then ran a row validation and it succeeded, with data > 100 characters:

data-validation -v validate row -sc sqlserver -tc bq -tbls dbo.tblIssue990=dvt_test.tblissue990 --primary-keys=id --hash='ID,ShortDescription'

╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │ source_agg_value                                                 │ target_agg_value                                                 │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ hash__all         │ Row               │ dbo.tblIssue990     │ hash__all            │ e25df97cf5142ab3afe61be0e6c14870bccc709499ff8c43888afabba212ed0d │ e25df97cf5142ab3afe61be0e6c14870bccc709499ff8c43888afabba212ed0d │                  │ success             │ 39153d4d-fbe3-4a67-a479-d28fccc6d14c │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

And concat shows the full data returned:

data-validation -v validate row -sc sqlserver -tc bq -tbls dbo.tblIssue990=dvt_test.tblissue990 --primary-keys=id --concat='ID,ShortDescription'

╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │                                                                                      source_agg_value │                                                                                      target_agg_value │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ concat__all       │ Row               │ dbo.tblIssue990     │ concat__all          │ 11234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 │ 11234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 │                  │ success             │ ced960b8-d3d2-4c87-bc08-00a9933b75d0 │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

I do still see incorrect SQL on screen, CAST(t6."ShortDescription" AS VARCHAR) but I think that is a different issue. We need to go back to the beginning and try to re-understand the actual problem.

nj1973 · 2023-10-11T12:22:01Z

I've added a SQL Server specific cast override which ensures we include the MAX keyword. But seeing as I cannot reproduce the issue myself I've asked the original reporter to test.

nj1973 · 2023-10-11T14:59:47Z

As for the different behaviour for nullable vs non-null columns, this is caused by this expression in ibis_addon/api.py:

    if op.to == self.type() and not op.to.is_timestamp():
         # noop case if passed type is the same
         return self

self.type() is an Ibis type expression that internally holds a value prefixed with ! for not-null columns, e.g. !string for String columns.
op.to is always the nullable variant.
Because of this only nullable columns drop into the no-op branch and bypass a cast to string.

We could fix this with a simple hack like this:

     """
+    def same_type(from_type, to_type) -> bool:
+        # The data type for Non-nullable columns if prefixed with "!", this is causing deviations
+        # between nullable and non-nullable columns. The second comparison below is catering for this.
+        return bool(
+            from_type == to_type
+            or str(from_type).lstrip("!") == str(to_type).lstrip("!")
+        )
+
     # validate
     op = ops.Cast(self, to=target_type)

-    if op.to == self.type() and not op.to.is_timestamp():
+    if same_type(op.to, self.type()) and not op.to.is_timestamp():
         # noop case if passed type is the same

but this feels a bit of a hack. I wonder if the exclamation mark should be dropped earlier in the process, we only use it for schema validation anyway.

Therefore I propose not to make the above hacky change and would like some input from @nehanene15.

nehanene15 · 2023-10-16T17:55:01Z

I think op.to would be equivalent to the target_type provided in the function parameters:
def cast(self, target_type: dt.DataType) -> Value

So I think the issue lies within our cast to 'string' when we do row hash validation here rather than to '!string ' if the original column is required. We assume that the cast will account for checking the same type regardless of nullability.

Due to this, I think your approach makes the most sense.

divyapandian5 · 2023-10-17T06:25:01Z

Thanks Neha, Just that, we are still seeing rows failing while using concat and we could not find the differences during manual comparison. Would help if we know why it's failing on concat. OR if we could resolve the hash, that would help too. The client will be using the tool on their own during their prod migration and would help them if this gets resolved, and we could have them use concat or hash. Divya Veerapandian Cloud Data Architect Global Delivery Center

…

On Mon, Oct 16, 2023 at 11:25 PM Neha Nene ***@***.***> wrote: I think op.to would be equivalent to the target_type provided in the function parameters: def cast(self, target_type: dt.DataType) -> Value So I think the issue lies within our cast to 'string' when we do row hash validation here <https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/data_validation/query_builder/query_builder.py#L378> rather than to '!string ' if the original column is required. We assume that the cast will account for checking the same type regardless of nullability. Due to this, I think your approach makes the most sense. — Reply to this email directly, view it on GitHub <#990 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AU2O4V4NWFPVR5F52TYG2ODX7VYIBAVCNFSM6AAAAAA4WZ7WD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRUHE4DSNZXGY> . You are receiving this because you commented.Message ID: <GoogleCloudPlatform/professional-services-data-validator/issues/990/1764989776 @github.com>

nj1973 · 2023-10-19T10:18:21Z

I think validating this data is going to be very challenging.

Looking through the validation failures there are plenty of multibyte characters to deal with in there, it looks like this was intentional based on the surrounding text. For example ID 1125 has a check mark ✓ which is being lost on the SQL Server side, presumably when we cast to VARCHAR. ID 1192 has a pumpkin emoji, I can't imagine what that looks like when cast to VARCHAR.

I can see that we would like to support the full range of unicode symbols but to do that we need to take a step back and consider our comparison expressions across all SQL engines, e.g. BigQuery, Spanner, Oracle, MySQL, etc in addition to SQL Server and PostgreSQL.

I think I need to split this issue into two issues:

Inconsistency between nullable and non-nullable columns. There is a fix for this on this branch.
Full unicode support and testing

There is also a "fix" on this branch for the initially reported issue of the missing MAX keyword when casting to VARCHAR. I'm not sure if that is actually needed or not.

nj1973 · 2023-11-08T14:18:23Z

Spun off the null vs not-null change to this issue: #1036

nj1973 · 2023-11-10T14:34:03Z

Abandoning the issue for the time being as it is not clear if we need the MAX keyword in casts or not. I could not reproduce a problem.

nj1973 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Sep 13, 2023

nehanene15 added the priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. label Sep 14, 2023

helensilva14 assigned renzokuken Sep 21, 2023

helensilva14 added priority: p1 High priority. Fix may be included in the next release. and removed priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Oct 3, 2023

helensilva14 assigned nj1973 and unassigned renzokuken Oct 3, 2023

helensilva14 mentioned this issue Oct 10, 2023

row validation: SQL Server --hash on text column throws exception #834

Open

helensilva14 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. and removed priority: p1 High priority. Fix may be included in the next release. labels Oct 19, 2023

helensilva14 added priority: p1 High priority. Fix may be included in the next release. and removed priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Nov 3, 2023

nj1973 mentioned this issue Nov 8, 2023

fix: Cast treats nullable and non-nullable columns in the same way #1037

Merged

nj1973 closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

nj1973 commented Sep 13, 2023

nehanene15 commented Sep 14, 2023

nj1973 commented Sep 15, 2023 •

edited

Loading

divyapandian5 commented Sep 15, 2023

nj1973 commented Sep 18, 2023

nj1973 commented Oct 4, 2023

nj1973 commented Oct 4, 2023 •

edited

Loading

nj1973 commented Oct 4, 2023

nj1973 commented Oct 6, 2023

nj1973 commented Oct 11, 2023

nj1973 commented Oct 11, 2023

nehanene15 commented Oct 16, 2023

divyapandian5 commented Oct 17, 2023 via email

nj1973 commented Oct 19, 2023

nj1973 commented Nov 8, 2023

nj1973 commented Nov 10, 2023

row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

Comments

nj1973 commented Sep 13, 2023

nehanene15 commented Sep 14, 2023

nj1973 commented Sep 15, 2023 • edited Loading

divyapandian5 commented Sep 15, 2023

nj1973 commented Sep 18, 2023

nj1973 commented Oct 4, 2023

nj1973 commented Oct 4, 2023 • edited Loading

nj1973 commented Oct 4, 2023

nj1973 commented Oct 6, 2023

nj1973 commented Oct 11, 2023

nj1973 commented Oct 11, 2023

nehanene15 commented Oct 16, 2023

divyapandian5 commented Oct 17, 2023 via email

nj1973 commented Oct 19, 2023

nj1973 commented Nov 8, 2023

nj1973 commented Nov 10, 2023

nj1973 commented Sep 15, 2023 •

edited

Loading

nj1973 commented Oct 4, 2023 •

edited

Loading