Add support for model contracts in v1.5 #148

jtcohen6 · 2023-04-20T09:01:59Z

Overview

There are two main components of "model contracts" in v1.5:

A "pre-flight" check that dbt will perform, before creating a view or table, to validate that column names and data types for the query match the ones defined in yaml
Support for platform-specific constraints, included in the templated DDL during table creation

(2) is pretty trivial to implement, since DuckDB supports constraints with a standard syntax. What's very cool is that DuckDB actually enforces primary/unique key constraints, which most columnar databases / analytical platforms really don't. It also enforces row-level check constraints. So this could be a true swap-out for not_null, unique, accepted_values tests.

(1) was a bit trickier. I reimplemented get_column_schema_from_query method, in lieu of implementing data_type_to_code, because the duckdb Python cursor returns the data type as a string, rather than an integer code. The other unfortunate thing is that some of the type precision seems to be lost:

> cursor.description
('my_int', 'NUMBER', None, None, None, None, None)
('my_decimal', 'NUMBER', None, None, None, None, None)
('my_bool', 'bool', None, None, None, None, None)
('my_timestamp', 'DATETIME', None, None, None, None, None)
('my_timestamp_tz', 'DATETIME', None, None, None, None, None)
('my_list_of_strings', 'list', None, None, None, None, None)
('my_list_of_numbers', 'list', None, None, None, None, None)
('my_struct', 'dict', None, None, None, None, None)

That will still catch when users implicitly switch out, e.g., a has_first_order column from being 0/1 to being true/false. But it can't catch subtler distinctions like int/decimal, timestamp with/without tz, and the contents of lists/structs. That degree of precision is maintained by most other database's connection cursors. Very open to ideas - I'm sure others know more about this than I do.

Example

select

    1 as my_int,
    2.0 as my_decimal,
    true as my_bool,
    '2013-11-03 00:00:00-07'::timestamp as my_timestamp,
    '2013-11-03 00:00:00-07'::timestamptz as my_timestamp_tz,
    ['a','b','c'] as my_list_of_strings,
    [1,2,3] as my_list_of_numbers,
    {'bar': 'baz', 'balance': 7.77, 'active': false} as my_struct

models:
  - name: my_model
    config:
      contract:
        enforced: true
    columns:
      - name: my_int
        data_type: int  # decimal also works here - cursor returns as 'NUMBER'
      - name: my_decimal
        data_type: decimal  # int also works here - cursor returns as 'NUMBER'
      - name: my_bool
        data_type: bool
      - name: my_timestamp
        data_type: timestamp  # timestamptz also works here - cursor returns as 'DATETIME'
      - name: my_timestamp_tz
        data_type: timestamptz  # timestamp/datetime also works here - cursor returns as 'DATETIME'
      - name: my_list_of_strings
        data_type: text[]     # int[] also works here - cursor returns as 'list'
      - name: my_list_of_numbers
        data_type: int[]    # text[] also works here - cursor returns as 'list'
      - name: my_struct
        data_type: struct(bar text, balance decimal, active boolean)  # struct(any text) also works here - cursor returns as 'dict'

jtcohen6 · 2023-04-20T10:02:05Z

dbt/adapters/duckdb/impl.py

+            # DuckDB doesn't support 'foreign key' as an alias
+            return f"references {constraint.expression}"


Actually references seems like the more standard syntax. And we're not really doing a good job of supporting FK constraints, anyway:

[CT-2519] [CT-2442] [Bug] Foreign key constraint templating dbt-labs/dbt-core#7417

[CT-2441] [Feature] Support expression for all constraint types dbt-labs/dbt-core#7416

The majority of columnar db's don't enforce them, so it hasn't felt like a priority. It's neat that DuckDB actually does.

jtcohen6 · 2023-04-20T10:24:38Z

tests/functional/adapter/test_constraints.py

+            # DuckDB's cursor doesn't seem to distinguish between:
+            #   INT and NUMERIC/DECIMAL -- both just return as 'NUMBER'
+            #   TIMESTAMP and TIMESTAMPTZ -- both just return as 'DATETIME'
+            #   [1,2,3] and ['a','b','c'] -- both just return as 'list'


What would happen in these cases?

Context: dbt is running an "empty" version of the model query (where false limit 0), and extracting column names + data types from the cursor description, before it actually materializes the model. If it detects a diff, it will raise an error like:

08:58:44 Compilation Error in model my_model (models/my_model.sql) 08:58:44 This model has an enforced contract that failed. 08:58:44 Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition. 08:58:44 08:58:44 | column_name | definition_type | contract_type | mismatch_reason | 08:58:44 | ----------- | --------------- | ------------- | ------------------ | 08:58:44 | my_col | NUMBER | TEXT | data type mismatch |

If no mismatch is detected, and it's a table, dbt will now materialize it via:

create table {model_name} (my_col int); insert into table (select {model_sql});

The database's type system is flexible by design, and is willing to do a lot of creative coercion:

D create table my_tbl (id int); D insert into my_tbl (select 2.1 as id); D select * from my_tbl; ┌───────┐ │ id │ │ int32 │ ├───────┤ │ 2 │ └───────┘

It will still catch the more egregious mismatches, including for structs and lists:

D create or replace table other_tbl (struct_col struct(num int)); D insert into other_tbl (select {'num': 'text'} as struct_col); Error: Conversion Error: Could not convert string 'text' to INT32 D insert into other_tbl (select ['some', 'strings'] as list_of_int); D create or replace table other_tbl (list_of_int int[]); Error: Conversion Error: Could not convert string 'some' to INT32

jwills · 2023-04-20T14:29:52Z

Hey Jeremy, thanks for taking a pass at this! ngl that when I told Sung that I was capacity constrained atm I didn't think you were going to be the one to attempt this!

So you've outlined the macro problem here very well-- DuckDB did a very literal interpretation of the DB API spec and does not provide the more detailed type information that ~ every other major database does. The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.

There is something of a way around it though, although it's some work to do, and given some other stuff I have going on right now I didn't want to start in on it under a deadline of having it ready in a week. I originally ran into this problem (I needed a more detailed description field from the cursor) when I started working on Buena Vista, my Presto/Postgres Python Proxy Project. After starting out by doing the dumb thing (i.e., just looking at the actual values returned by the fetch calls), I hit upon a better idea: I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.

jtcohen6 · 2023-04-20T17:43:21Z

I didn't think you were going to be the one to attempt this!

I woke up a bit too early this morning, caught the thread, and had some fun. Always good for me to duck around a bit & feel out what it's like to extend our new stuff.

The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.

Right on - I'm glad we're on the same page. I didn't realize you'd already had a chance to look into this, given the related work over in BV. I won't be offended if we need to scrap a lot of / all of the code that's here.

I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.

This looks cool!! where there's a Wills, there's a way

Ok - I'm happy to poke around your BV implementation, or let you take it from here - let me know

jwills · 2023-04-20T18:08:47Z

Yeah I will eventually get to it, it's really just a question of when, and it's not realistically going to be in the next couple of weeks.

I feel bad about this, as I know folks are starting to actually rely on me to do things with this project (which remember, started as a fun little toy project for me to learn about how databases worked!) in a semi-timely fashion. But for the moment, it's gonna have to do.

jtcohen6 · 2023-04-20T19:14:10Z

Heard - and totally agree with your call to ship minimal v1.5.0 compat next week, without model contracts! I still may poke around, for my own fun.

jwills · 2023-04-21T21:59:02Z

@Mause apologies for pulling you in here, please feel free to respond next week when you're back at work, but I was wondering if y'all had given any thought to revising the level of detailed type information that the description field returns on the PyConnection? I'm not sure if you saw the hack I came up with to get at this in Buena Vista using the Arrow schema returned on the RecordBatchReader, and I'm debating doing that same trick here in dbt-duckdb vs. maybe just going into the DuckDB python code and trying to come up with a fix there. 🙇

jwills · 2023-04-23T19:10:23Z

dbt/adapters/duckdb/impl.py

+    @available.parse(lambda *a, **k: [])
+    def get_column_schema_from_query(self, sql: str) -> List[Column]:
+        """Get a list of the Columns with names and data types from the given sql."""
+        _, cursor = self.connections.add_select_query(sql)


Going to talk out loud here a bit-- I've been reading through this code and the code in dbt-core to understand how this is intended to work and how I'm going to square it with DuckDB and with my future plan for environments to enable both local and remote execution of models in dbt-duckdb.

So I'm thinking that I'm going to override both the add_select_query and the data_type_code_to_name method on the DuckDBConnectionManager class, and they will in turn delegate their work to the (singleton) Environment instance associated that is associated with the run. The consequence of making that choice is that, by overriding those two methods, we should not have to override the impl of get_column_schema_from_query in the adapter impl as we are here.

Right now, I have two environment impls in the repo: a local one (the default), and a Buena Vista one that uses the Postgres protocol to talk to a Python server running DuckDB. As we discussed, Buena Vista already does (some of) the work required to translate the data types returned from DuckDB into Postgres-compatible type descriptions using the Arrow interface, which provides more detailed type information than the DBAPI implementation in DuckDB. I will need to flesh out those conversions a bit more so as to support all of the conversions specified in the data_types fixture in the adapter functional tests, and then copy an edited version of that code from the Buena Vista repo into dbt-duckdb repo for the LocalEnvironment implementation to use. I'm not wild about doing this kind of horrible copy-pasta, but it's the only way I can see having both 1) a proper implementation of model contracts in dbt-duckdb and 2) keeping the environments construct that I have totally fallen in love with in-place. The right longer term solution is going to be better/richer type information included in the description field returned by DuckDB's DBAPI implementation, but that's going to be months away in the best case, and patience is not one of my virtues.

Additionally, there is some more source-side plugin work I need to complete (I'm slowly coming to terms with the fact that the beautiful dream of write-side plugins in #143 isn't going to happen until at least 1.5.1) and some other stuff I have going on this week that is going to make completing that body and having everything tested to my satisfaction by Friday is going to be difficult to pull off.

There might be another (more naive) approach, if need be. Rather than trying to pull this off with a single (empty) query, and pull the data types off the cursor description, we could opt for two queries instead:

create a temp table from the empty query

adapter.get_columns_in_relation (a.k.a. information_schema.columns) - which should return more precise data types

Something like:

{% macro duckdb__get_column_schema_from_query(select_sql) -%} {% set tmp_relation = ... %} {% call statement('create_tmp_relation') %} {{ create_table_as(temporary=True, relation=tmp_relation, compiled_code=get_empty_subquery_sql(select_sql)) }} {% endcall %} {% set column_schema = adapter.get_columns_in_relation(tmp_relation) %} {{ return(column_schema) }} {% endmacro %}

(It looks like we didn't dispatch that macro - an oversight that we could quickly correct!)

jwills · 2023-04-23T19:12:53Z

@jtcohen6 I tremendously appreciate the head start here, and the constraints stuff as-specified looks 💯 . If you don't mind, I'd be happy to hack on the stuff I spec'd out in the comments on your branch, or if you prefer I can grab the patch of this PR and take things in the direction I want them to go in on a net-new branch.

jtcohen6 · 2023-04-23T20:35:09Z

Mr Wills, you have the helm

jwills

Okay, I think this is ready to go; I may make the 1.5.0 release day after all!

jwills · 2023-04-26T04:31:10Z

dbt/adapters/duckdb/impl.py

+
+        # Taking advantage of yet another amazing DuckDB SQL feature right here: the
+        # ability to DESCRIBE a query instead of a relation
+        describe_sql = f"DESCRIBE ({sql})"


@jtcohen6 after a truly circuitous journey, this ended up being simple and delightful to implement (and in an environment-independent way to boot!)

whoa! this is very handy

this is pretty sweet

jwills · 2023-04-26T04:34:04Z

tests/functional/adapter/test_constraints.py

+            ["1", schema_int_type, int_type],
+            ["'1'", string_type, string_type],
+            ["true", "bool", "BOOL"],
+            ["'2013-11-03 00:00:00-07'::timestamp", "TIMESTAMP", "TIMESTAMP"],


@jtcohen6 and @sungchun12: noting that I could not for the life of me figure out how to configure this so that it would work with the timestamptz typing; the resulting data type in DuckDB will be TIMESTAMP WITH TIMEZONE and I'm not sure if the spaces are messing things up, or something like that?

If either of you have a moment to take a look and figure out how it's supposed to work, that would be great, but if not nbd-- I think we can live without it for 1.5.0.

@jwills When I add this line, the tests are passing for me locally:

["'2013-11-03 00:00:00-07'::timestamptz", "TIMESTAMPTZ", "TIMESTAMP WITH TIME ZONE"]

order being:

SQL to return a value of this type

type name to put in the yaml file that should match

type name that we expect DuckDB to return (from describe) if there's a mismatch

(I agree this test could do with a clearer assert!)

gotcha-- thank you, will give it a shot locally now!

jwills · 2023-04-26T17:51:33Z

having mixed feelings about the fact that my baby has grown to the point where it has a flaky integration test 😭

jwills · 2023-04-26T22:31:25Z

Thank you so much for all of the help on this @jtcohen6! Looking forward to getting this in folks hands tomorrow or Friday!

Add support for model contracts in v1.5

b2c74d1

jtcohen6 mentioned this pull request Apr 20, 2023

upgrade to support dbt-core v1.5.0 #147

Closed

4 tasks

jtcohen6 added 2 commits April 20, 2023 11:39

Add references alias for foreign_key

6626430

Fixup create_table_as logic

64f351d

jtcohen6 commented Apr 20, 2023

View reviewed changes

Add back the test imports I foolishly removed

b5b488e

Reorder logic, fix tests

32a9dfa

jwills reviewed Apr 23, 2023

View reviewed changes

jwills added 8 commits April 23, 2023 21:20

First up: this is a bit nicer

40e56f1

Just trying to see where this leads me

6f70104

fix that

2aeb1ff

stopping here for the night; need to do most of the localenv stuff tmrw

6fd767c

Renaming/refactoring some things

2465dba

now with format fixes

7d81bd9

let's try whistling this

8d5b133

simplify this back down

259fd3b

jwills reviewed Apr 26, 2023

View reviewed changes

add in tz-aware timestamp test

5a76e45

jwills merged commit b45982b into duckdb:jwills_rc150 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for model contracts in v1.5 #148

Add support for model contracts in v1.5 #148

jtcohen6 commented Apr 20, 2023 •

edited

Loading

jtcohen6 Apr 20, 2023 •

edited

Loading

jtcohen6 Apr 20, 2023

jwills commented Apr 20, 2023

jtcohen6 commented Apr 20, 2023

jwills commented Apr 20, 2023

jtcohen6 commented Apr 20, 2023

jwills commented Apr 21, 2023

jwills Apr 23, 2023

jtcohen6 Apr 23, 2023 •

edited

Loading

jwills commented Apr 23, 2023

jtcohen6 commented Apr 23, 2023

jwills left a comment

jwills Apr 26, 2023 •

edited

Loading

jtcohen6 Apr 26, 2023

sungchun12 Apr 26, 2023

jwills Apr 26, 2023

jtcohen6 Apr 26, 2023 •

edited

Loading

jwills Apr 26, 2023

jwills commented Apr 26, 2023

jwills commented Apr 26, 2023

		# DuckDB doesn't support 'foreign key' as an alias
		return f"references {constraint.expression}"

Add support for model contracts in v1.5 #148

Add support for model contracts in v1.5 #148

Conversation

jtcohen6 commented Apr 20, 2023 • edited Loading

Overview

Example

jtcohen6 Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

jtcohen6 Apr 20, 2023

Choose a reason for hiding this comment

jwills commented Apr 20, 2023

jtcohen6 commented Apr 20, 2023

jwills commented Apr 20, 2023

jtcohen6 commented Apr 20, 2023

jwills commented Apr 21, 2023

jwills Apr 23, 2023

Choose a reason for hiding this comment

jtcohen6 Apr 23, 2023 • edited Loading

Choose a reason for hiding this comment

jwills commented Apr 23, 2023

jtcohen6 commented Apr 23, 2023

jwills left a comment

Choose a reason for hiding this comment

jwills Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

jtcohen6 Apr 26, 2023

Choose a reason for hiding this comment

sungchun12 Apr 26, 2023

Choose a reason for hiding this comment

jwills Apr 26, 2023

Choose a reason for hiding this comment

jtcohen6 Apr 26, 2023 • edited Loading

Choose a reason for hiding this comment

jwills Apr 26, 2023

Choose a reason for hiding this comment

jwills commented Apr 26, 2023

jwills commented Apr 26, 2023

jtcohen6 commented Apr 20, 2023 •

edited

Loading

jtcohen6 Apr 20, 2023 •

edited

Loading

jtcohen6 Apr 23, 2023 •

edited

Loading

jwills Apr 26, 2023 •

edited

Loading

jtcohen6 Apr 26, 2023 •

edited

Loading