Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row validation: Long text are being cropped to 30 characters for hash validation on SQL Server #990

Closed
nj1973 opened this issue Sep 13, 2023 · 15 comments
Assignees
Labels
priority: p1 High priority. Fix may be included in the next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@nj1973
Copy link
Contributor

nj1973 commented Sep 13, 2023

When running hash validations columns are cast to VARCHAR, then concatenated and finally a hash is produced.

We've had a customer issue reported for nvarchar(500) and nvarchar(2000) columns. The cast to varchar uses a length of 30 by default so is trimming the text. We should consider casting to varchar(max) instead (or another technique if that works out to be easier).

Varchar docs: https://learn.microsoft.com/en-us/sql/t-sql/data-types/char-and-varchar-transact-sql?view=sql-server-ver16
Cast docs: https://learn.microsoft.com/en-us/sql/t-sql/functions/cast-and-convert-transact-sql?view=sql-server-ver16

Examples from sqlcmd

Notice the x (at position 31) is cropped:

1> select concat(cast('123456789012345678901234567890x' as varchar),cast('123456789012345678901234567890x' as varchar));
2> go

------------------------------------------------------------
123456789012345678901234567890123456789012345678901234567890

Notice the x (at position 31) is included:

1> select concat(cast('123456789012345678901234567890x' as varchar(max)),cast('123456789012345678901234567890x' as varchar(max)));
2> go

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
123456789012345678901234567890x123456789012345678901234567890x
@nj1973 nj1973 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Sep 13, 2023
@nehanene15 nehanene15 added the priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. label Sep 14, 2023
@nehanene15
Copy link
Collaborator

This will involve updating the following line to reference VARCHAR(max) instead: https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/third_party/ibis/ibis_mssql/datatypes.py#L34

@nj1973
Copy link
Contributor Author

nj1973 commented Sep 15, 2023

I've attempted to reproduce this with these columns:

,   col_nvarchar      nvarchar(max)
,   col_nvarchar_500  nvarchar(500)

And I do not see any CAST to VARCHAR in a row validation SQL:

data-validation -v validate row -sc sqlserver -tc sqlserver -tbls dvt_test.dvt_mssql2pg_types \
--primary-keys=id --hash='id,col_nvarchar_500,col_nvarchar'
...
09/15/2023 10:03:11 AM-INFO: -- ** Source Query ** --
09/15/2023 10:03:11 AM-INFO: WITH t0 AS
(SELECT t6.id AS id, t6.col_varchar AS col_varchar, t6.col_nvarchar AS col_nvarchar, t6.col_nvarchar_500 AS col_nvarchar_500, t6.col_nchar_2 AS col_nchar_2, t6.col_ntext AS col_ntext, t6.col_smallmoney AS col_smallmoney, t6.col_money AS col_money, t6.col_smalldatetime AS col_smalldatetime, t6.col_time AS col_time, t6.col_uuid AS col_uuid, CAST(t6.id AS VARCHAR) AS cast__id, t6.col_nvarchar AS cast__col_nvarchar, t6.col_nvarchar_500 AS cast__col_nvarchar_500
FROM dvt_test.dvt_mssql2pg_types AS t6),
t1 AS
...

@divyapandian5
Copy link

Hi @nj1973 Attaching the ddl which has source and target table definitions. Fields like Description , shortDescription is where we saw issues. Also attaching command i used and the source and target queries

toReplicate.zip

@nj1973
Copy link
Contributor Author

nj1973 commented Sep 18, 2023

I've been able to reproduce the problem, thanks for the DDL.
I created this table:

CREATE TABLE [dbo].[tblIssue990](
    [ID] [bigint] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [ShortDescription] [nvarchar](1000) NOT NULL,
    [ShortDescription2] [nvarchar](1000)
);

If I validate ShortDescription I get the cast to VARCHAR. If I validate ShortDescription2 I do not.

The presence of the NOT NULL constraint is triggering the problem, presumably because the data type is then prefixed with !. I haven't yet been able to figure out where in the code the cast is included/bypassed based on data type.

@helensilva14 helensilva14 added priority: p1 High priority. Fix may be included in the next release. and removed priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Oct 3, 2023
@helensilva14 helensilva14 assigned nj1973 and unassigned renzokuken Oct 3, 2023
@nj1973
Copy link
Contributor Author

nj1973 commented Oct 4, 2023

Using dbo.tblIssue990 I saved output to a config file and the contents were identical for ShortDescription vs ShortDescription2, except for column names, therefore the problem is not in the construction of the Ibis expressions, it must come later in the process.

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 4, 2023

The cast is bypassed for the not null column in this code in ibis_addon/api.py:

def cast(self, target_type: dt.DataType) -> Value:
    """Override ibis.expr.api's cast method.
    This allows for Timestamp-typed columns to be cast to Timestamp, since Ibis interprets some similar but non-equivalent types (eg. DateTime) to Timestamp (GitHub issue #451).
    """
    # validate
    op = ops.Cast(self, to=target_type)

    if op.to == self.type() and not op.to.is_timestamp():
        # noop case if passed type is the same
        return self
...

In the test case both op.to and self.type() are "string" for ShortDescription2. But for ShortDescription self.type() is "!string" which introduces an unnecessary cast.

This makes me think we actually have two problems here:

  1. Nullable and non-nullable should not be treated differently just because there's a ! in the data type
  2. Irrespective of 1 we still need a MAX in the cast to VARCHAR because, in theory, we could be casting a numeric value with more than 30 digits to VARCHAR which would also be truncated

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 4, 2023

I think @renzokuken was on the right lines with the change on this branch based on this comment in SQL Alchemy code in sqlalchemy/dialects/mssql/base.py:

MAX on VARCHAR / NVARCHAR
-------------------------

SQL Server supports the special string "MAX" within the
:class:`_types.VARCHAR` and :class:`_types.NVARCHAR` datatypes,
to indicate "maximum length possible".   The dialect currently handles this as
a length of "None" in the base type, rather than supplying a
dialect-specific version of these types, so that a base type
specified such as ``VARCHAR(None)`` can assume "unlengthed" behavior on
more than one backend without using dialect-specific types.

Therefore it is strange that the change is not having the desired effect.

In sqlalchemy/dialects/mssql/base.py class MSTypeCompiler there is a visit_VARCHAR method which uses "max" if a type has no length.

However, our casts appear to go through sqlalchemy/sql/compiler.py class GenericTypeCompiler.visit_VARCHAR which does not support "max".

I've not yet figured out why this is.

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 6, 2023

I have added data to my test case and a BigQuery table:

CREATE TABLE [dbo].[tblIssue990](
    [ID] [bigint] IDENTITY(1,1) NOT FOR REPLICATION NOT NULL,
    [ShortDescription] [nvarchar](1000) NOT NULL,
    [ShortDescription2] [nvarchar](1000)
);
INSERT INTO [dbo].[tblIssue990] ([ShortDescription],[ShortDescription2])
VALUES ('1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890',
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

BigQuery:

create table dvt_test.tblissue990
(id int64,ShortDescription string,ShortDescription2 string);
insert into dvt_test.tblissue990
values (1,'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890',
'1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890');

I then ran a row validation and it succeeded, with data > 100 characters:

data-validation -v validate row -sc sqlserver -tc bq -tbls dbo.tblIssue990=dvt_test.tblissue990 --primary-keys=id --hash='ID,ShortDescription'

╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │ source_agg_value                                                 │ target_agg_value                                                 │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ hash__all         │ Row               │ dbo.tblIssue990     │ hash__all            │ e25df97cf5142ab3afe61be0e6c14870bccc709499ff8c43888afabba212ed0d │ e25df97cf5142ab3afe61be0e6c14870bccc709499ff8c43888afabba212ed0d │                  │ success             │ 39153d4d-fbe3-4a67-a479-d28fccc6d14c │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

And concat shows the full data returned:

data-validation -v validate row -sc sqlserver -tc bq -tbls dbo.tblIssue990=dvt_test.tblissue990 --primary-keys=id --concat='ID,ShortDescription'

╒═══════════════════╤═══════════════════╤═════════════════════╤══════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════╤══════════════════╤═════════════════════╤══════════════════════════════════════╕
│ validation_name   │ validation_type   │ source_table_name   │ source_column_name   │                                                                                      source_agg_value │                                                                                      target_agg_value │ pct_difference   │ validation_status   │ run_id                               │
╞═══════════════════╪═══════════════════╪═════════════════════╪══════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════╪══════════════════╪═════════════════════╪══════════════════════════════════════╡
│ concat__all       │ Row               │ dbo.tblIssue990     │ concat__all          │ 11234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 │ 11234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 │                  │ success             │ ced960b8-d3d2-4c87-bc08-00a9933b75d0 │
╘═══════════════════╧═══════════════════╧═════════════════════╧══════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════╧══════════════════╧═════════════════════╧══════════════════════════════════════╛

I do still see incorrect SQL on screen, CAST(t6."ShortDescription" AS VARCHAR) but I think that is a different issue. We need to go back to the beginning and try to re-understand the actual problem.

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 11, 2023

I've added a SQL Server specific cast override which ensures we include the MAX keyword. But seeing as I cannot reproduce the issue myself I've asked the original reporter to test.

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 11, 2023

As for the different behaviour for nullable vs non-null columns, this is caused by this expression in ibis_addon/api.py:

    if op.to == self.type() and not op.to.is_timestamp():
         # noop case if passed type is the same
         return self

self.type() is an Ibis type expression that internally holds a value prefixed with ! for not-null columns, e.g. !string for String columns.
op.to is always the nullable variant.
Because of this only nullable columns drop into the no-op branch and bypass a cast to string.

We could fix this with a simple hack like this:

     """
+    def same_type(from_type, to_type) -> bool:
+        # The data type for Non-nullable columns if prefixed with "!", this is causing deviations
+        # between nullable and non-nullable columns. The second comparison below is catering for this.
+        return bool(
+            from_type == to_type
+            or str(from_type).lstrip("!") == str(to_type).lstrip("!")
+        )
+
     # validate
     op = ops.Cast(self, to=target_type)

-    if op.to == self.type() and not op.to.is_timestamp():
+    if same_type(op.to, self.type()) and not op.to.is_timestamp():
         # noop case if passed type is the same

but this feels a bit of a hack. I wonder if the exclamation mark should be dropped earlier in the process, we only use it for schema validation anyway.

Therefore I propose not to make the above hacky change and would like some input from @nehanene15.

@nehanene15
Copy link
Collaborator

I think op.to would be equivalent to the target_type provided in the function parameters:
def cast(self, target_type: dt.DataType) -> Value

So I think the issue lies within our cast to 'string' when we do row hash validation here rather than to '!string ' if the original column is required. We assume that the cast will account for checking the same type regardless of nullability.

Due to this, I think your approach makes the most sense.

@divyapandian5
Copy link

divyapandian5 commented Oct 17, 2023 via email

@nj1973
Copy link
Contributor Author

nj1973 commented Oct 19, 2023

I think validating this data is going to be very challenging.

Looking through the validation failures there are plenty of multibyte characters to deal with in there, it looks like this was intentional based on the surrounding text. For example ID 1125 has a check mark which is being lost on the SQL Server side, presumably when we cast to VARCHAR. ID 1192 has a pumpkin emoji, I can't imagine what that looks like when cast to VARCHAR.

I can see that we would like to support the full range of unicode symbols but to do that we need to take a step back and consider our comparison expressions across all SQL engines, e.g. BigQuery, Spanner, Oracle, MySQL, etc in addition to SQL Server and PostgreSQL.

I think I need to split this issue into two issues:

  1. Inconsistency between nullable and non-nullable columns. There is a fix for this on this branch.
  2. Full unicode support and testing

There is also a "fix" on this branch for the initially reported issue of the missing MAX keyword when casting to VARCHAR. I'm not sure if that is actually needed or not.

@helensilva14 helensilva14 added priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. and removed priority: p1 High priority. Fix may be included in the next release. labels Oct 19, 2023
@helensilva14 helensilva14 added priority: p1 High priority. Fix may be included in the next release. and removed priority: p0 Highest priority. Critical issue. Will be fixed prior to next release. labels Nov 3, 2023
@nj1973
Copy link
Contributor Author

nj1973 commented Nov 8, 2023

Spun off the null vs not-null change to this issue: #1036

@nj1973
Copy link
Contributor Author

nj1973 commented Nov 10, 2023

Abandoning the issue for the time being as it is not clear if we need the MAX keyword in casts or not. I could not reproduce a problem.

@nj1973 nj1973 closed this as completed Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 High priority. Fix may be included in the next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants