Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for model contracts in v1.5 #148

Merged
merged 14 commits into from
Apr 26, 2023

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented Apr 20, 2023

Overview

There are two main components of "model contracts" in v1.5:

  1. A "pre-flight" check that dbt will perform, before creating a view or table, to validate that column names and data types for the query match the ones defined in yaml
  2. Support for platform-specific constraints, included in the templated DDL during table creation

(2) is pretty trivial to implement, since DuckDB supports constraints with a standard syntax. What's very cool is that DuckDB actually enforces primary/unique key constraints, which most columnar databases / analytical platforms really don't. It also enforces row-level check constraints. So this could be a true swap-out for not_null, unique, accepted_values tests.

(1) was a bit trickier. I reimplemented get_column_schema_from_query method, in lieu of implementing data_type_to_code, because the duckdb Python cursor returns the data type as a string, rather than an integer code. The other unfortunate thing is that some of the type precision seems to be lost:

> cursor.description
('my_int', 'NUMBER', None, None, None, None, None)
('my_decimal', 'NUMBER', None, None, None, None, None)
('my_bool', 'bool', None, None, None, None, None)
('my_timestamp', 'DATETIME', None, None, None, None, None)
('my_timestamp_tz', 'DATETIME', None, None, None, None, None)
('my_list_of_strings', 'list', None, None, None, None, None)
('my_list_of_numbers', 'list', None, None, None, None, None)
('my_struct', 'dict', None, None, None, None, None)

That will still catch when users implicitly switch out, e.g., a has_first_order column from being 0/1 to being true/false. But it can't catch subtler distinctions like int/decimal, timestamp with/without tz, and the contents of lists/structs. That degree of precision is maintained by most other database's connection cursors. Very open to ideas - I'm sure others know more about this than I do.

Example

select

    1 as my_int,
    2.0 as my_decimal,
    true as my_bool,
    '2013-11-03 00:00:00-07'::timestamp as my_timestamp,
    '2013-11-03 00:00:00-07'::timestamptz as my_timestamp_tz,
    ['a','b','c'] as my_list_of_strings,
    [1,2,3] as my_list_of_numbers,
    {'bar': 'baz', 'balance': 7.77, 'active': false} as my_struct
models:
  - name: my_model
    config:
      contract:
        enforced: true
    columns:
      - name: my_int
        data_type: int  # decimal also works here - cursor returns as 'NUMBER'
      - name: my_decimal
        data_type: decimal  # int also works here - cursor returns as 'NUMBER'
      - name: my_bool
        data_type: bool
      - name: my_timestamp
        data_type: timestamp  # timestamptz also works here - cursor returns as 'DATETIME'
      - name: my_timestamp_tz
        data_type: timestamptz  # timestamp/datetime also works here - cursor returns as 'DATETIME'
      - name: my_list_of_strings
        data_type: text[]     # int[] also works here - cursor returns as 'list'
      - name: my_list_of_numbers
        data_type: int[]    # text[] also works here - cursor returns as 'list'
      - name: my_struct
        data_type: struct(bar text, balance decimal, active boolean)  # struct(any text) also works here - cursor returns as 'dict'

@jtcohen6 jtcohen6 mentioned this pull request Apr 20, 2023
4 tasks
Comment on lines +207 to +208
# DuckDB doesn't support 'foreign key' as an alias
return f"references {constraint.expression}"
Copy link
Contributor Author

@jtcohen6 jtcohen6 Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually references seems like the more standard syntax. And we're not really doing a good job of supporting FK constraints, anyway:

The majority of columnar db's don't enforce them, so it hasn't felt like a priority. It's neat that DuckDB actually does.

Comment on lines 16 to 19
# DuckDB's cursor doesn't seem to distinguish between:
# INT and NUMERIC/DECIMAL -- both just return as 'NUMBER'
# TIMESTAMP and TIMESTAMPTZ -- both just return as 'DATETIME'
# [1,2,3] and ['a','b','c'] -- both just return as 'list'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen in these cases?

Context: dbt is running an "empty" version of the model query (where false limit 0), and extracting column names + data types from the cursor description, before it actually materializes the model. If it detects a diff, it will raise an error like:

08:58:44  Compilation Error in model my_model (models/my_model.sql)
08:58:44    This model has an enforced contract that failed.
08:58:44    Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.
08:58:44
08:58:44    | column_name | definition_type | contract_type | mismatch_reason    |
08:58:44    | ----------- | --------------- | ------------- | ------------------ |
08:58:44    | my_col      | NUMBER          | TEXT          | data type mismatch |

If no mismatch is detected, and it's a table, dbt will now materialize it via:

create table {model_name} (my_col int);
insert into table (select {model_sql});

The database's type system is flexible by design, and is willing to do a lot of creative coercion:

D create table my_tbl (id int);
D insert into my_tbl (select 2.1 as id);
D select * from my_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│     2 │
└───────┘

It will still catch the more egregious mismatches, including for structs and lists:

D create or replace table other_tbl (struct_col struct(num int));
D insert into other_tbl (select {'num': 'text'} as struct_col);
Error: Conversion Error: Could not convert string 'text' to INT32
D insert into other_tbl (select ['some', 'strings'] as list_of_int);
D create or replace table other_tbl (list_of_int int[]);
Error: Conversion Error: Could not convert string 'some' to INT32

@jwills
Copy link
Collaborator

jwills commented Apr 20, 2023

Hey Jeremy, thanks for taking a pass at this! ngl that when I told Sung that I was capacity constrained atm I didn't think you were going to be the one to attempt this!

So you've outlined the macro problem here very well-- DuckDB did a very literal interpretation of the DB API spec and does not provide the more detailed type information that ~ every other major database does. The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.

There is something of a way around it though, although it's some work to do, and given some other stuff I have going on right now I didn't want to start in on it under a deadline of having it ready in a week. I originally ran into this problem (I needed a more detailed description field from the cursor) when I started working on Buena Vista, my Presto/Postgres Python Proxy Project. After starting out by doing the dumb thing (i.e., just looking at the actual values returned by the fetch calls), I hit upon a better idea: I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.

@jtcohen6
Copy link
Contributor Author

I didn't think you were going to be the one to attempt this!

I woke up a bit too early this morning, caught the thread, and had some fun. Always good for me to duck around a bit & feel out what it's like to extend our new stuff.

The consequence of this is that dbt-duckdb would provide a qualitatively worse implementation of model contracts than any of the other adapters, which is unsatisfying for me in about a half-dozen different ways.

Right on - I'm glad we're on the same page. I didn't realize you'd already had a chance to look into this, given the related work over in BV. I won't be offended if we need to scrap a lot of / all of the code that's here.

I could fetch the results using DuckDB's support for Arrow, which does allow me to get the detailed type information I need.

This looks cool!! where there's a Wills, there's a way

Ok - I'm happy to poke around your BV implementation, or let you take it from here - let me know

@jwills
Copy link
Collaborator

jwills commented Apr 20, 2023

Yeah I will eventually get to it, it's really just a question of when, and it's not realistically going to be in the next couple of weeks.

I feel bad about this, as I know folks are starting to actually rely on me to do things with this project (which remember, started as a fun little toy project for me to learn about how databases worked!) in a semi-timely fashion. But for the moment, it's gonna have to do.

@jtcohen6
Copy link
Contributor Author

Heard - and totally agree with your call to ship minimal v1.5.0 compat next week, without model contracts! I still may poke around, for my own fun.

@jwills
Copy link
Collaborator

jwills commented Apr 21, 2023

@Mause apologies for pulling you in here, please feel free to respond next week when you're back at work, but I was wondering if y'all had given any thought to revising the level of detailed type information that the description field returns on the PyConnection? I'm not sure if you saw the hack I came up with to get at this in Buena Vista using the Arrow schema returned on the RecordBatchReader, and I'm debating doing that same trick here in dbt-duckdb vs. maybe just going into the DuckDB python code and trying to come up with a fix there. 🙇

@available.parse(lambda *a, **k: [])
def get_column_schema_from_query(self, sql: str) -> List[Column]:
"""Get a list of the Columns with names and data types from the given sql."""
_, cursor = self.connections.add_select_query(sql)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to talk out loud here a bit-- I've been reading through this code and the code in dbt-core to understand how this is intended to work and how I'm going to square it with DuckDB and with my future plan for environments to enable both local and remote execution of models in dbt-duckdb.

So I'm thinking that I'm going to override both the add_select_query and the data_type_code_to_name method on the DuckDBConnectionManager class, and they will in turn delegate their work to the (singleton) Environment instance associated that is associated with the run. The consequence of making that choice is that, by overriding those two methods, we should not have to override the impl of get_column_schema_from_query in the adapter impl as we are here.

Right now, I have two environment impls in the repo: a local one (the default), and a Buena Vista one that uses the Postgres protocol to talk to a Python server running DuckDB. As we discussed, Buena Vista already does (some of) the work required to translate the data types returned from DuckDB into Postgres-compatible type descriptions using the Arrow interface, which provides more detailed type information than the DBAPI implementation in DuckDB. I will need to flesh out those conversions a bit more so as to support all of the conversions specified in the data_types fixture in the adapter functional tests, and then copy an edited version of that code from the Buena Vista repo into dbt-duckdb repo for the LocalEnvironment implementation to use. I'm not wild about doing this kind of horrible copy-pasta, but it's the only way I can see having both 1) a proper implementation of model contracts in dbt-duckdb and 2) keeping the environments construct that I have totally fallen in love with in-place. The right longer term solution is going to be better/richer type information included in the description field returned by DuckDB's DBAPI implementation, but that's going to be months away in the best case, and patience is not one of my virtues.

Additionally, there is some more source-side plugin work I need to complete (I'm slowly coming to terms with the fact that the beautiful dream of write-side plugins in #143 isn't going to happen until at least 1.5.1) and some other stuff I have going on this week that is going to make completing that body and having everything tested to my satisfaction by Friday is going to be difficult to pull off.

Copy link
Contributor Author

@jtcohen6 jtcohen6 Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be another (more naive) approach, if need be. Rather than trying to pull this off with a single (empty) query, and pull the data types off the cursor description, we could opt for two queries instead:

  • create a temp table from the empty query
  • adapter.get_columns_in_relation (a.k.a. information_schema.columns) - which should return more precise data types

Something like:

{% macro duckdb__get_column_schema_from_query(select_sql) -%}
    {% set tmp_relation = ... %}
    {% call statement('create_tmp_relation') %}
       {{ create_table_as(temporary=True, relation=tmp_relation, compiled_code=get_empty_subquery_sql(select_sql)) }}
    {% endcall %}
    {% set column_schema = adapter.get_columns_in_relation(tmp_relation) %}
    {{ return(column_schema) }}
{% endmacro %}

(It looks like we didn't dispatch that macro - an oversight that we could quickly correct!)

@jwills
Copy link
Collaborator

jwills commented Apr 23, 2023

@jtcohen6 I tremendously appreciate the head start here, and the constraints stuff as-specified looks 💯 . If you don't mind, I'd be happy to hack on the stuff I spec'd out in the comments on your branch, or if you prefer I can grab the patch of this PR and take things in the direction I want them to go in on a net-new branch.

@jtcohen6
Copy link
Contributor Author

Mr Wills, you have the helm

Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is ready to go; I may make the 1.5.0 release day after all!


# Taking advantage of yet another amazing DuckDB SQL feature right here: the
# ability to DESCRIBE a query instead of a relation
describe_sql = f"DESCRIBE ({sql})"
Copy link
Collaborator

@jwills jwills Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 after a truly circuitous journey, this ended up being simple and delightful to implement (and in an environment-independent way to boot!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoa! this is very handy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty sweet

["1", schema_int_type, int_type],
["'1'", string_type, string_type],
["true", "bool", "BOOL"],
["'2013-11-03 00:00:00-07'::timestamp", "TIMESTAMP", "TIMESTAMP"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 and @sungchun12: noting that I could not for the life of me figure out how to configure this so that it would work with the timestamptz typing; the resulting data type in DuckDB will be TIMESTAMP WITH TIMEZONE and I'm not sure if the spaces are messing things up, or something like that?

If either of you have a moment to take a look and figure out how it's supposed to work, that would be great, but if not nbd-- I think we can live without it for 1.5.0.

Copy link
Contributor Author

@jtcohen6 jtcohen6 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwills When I add this line, the tests are passing for me locally:

["'2013-11-03 00:00:00-07'::timestamptz", "TIMESTAMPTZ", "TIMESTAMP WITH TIME ZONE"]

order being:

  1. SQL to return a value of this type
  2. type name to put in the yaml file that should match
  3. type name that we expect DuckDB to return (from describe) if there's a mismatch

(I agree this test could do with a clearer assert!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha-- thank you, will give it a shot locally now!

@jwills
Copy link
Collaborator

jwills commented Apr 26, 2023

having mixed feelings about the fact that my baby has grown to the point where it has a flaky integration test 😭

@jwills jwills merged commit b45982b into duckdb:jwills_rc150 Apr 26, 2023
@jwills
Copy link
Collaborator

jwills commented Apr 26, 2023

Thank you so much for all of the help on this @jtcohen6! Looking forward to getting this in folks hands tomorrow or Friday!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants