Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for model contracts in v1.5 #148

Merged
merged 14 commits into from
Apr 26, 2023
33 changes: 33 additions & 0 deletions dbt/adapters/duckdb/impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,15 @@

from dbt.adapters.base import BaseRelation
from dbt.adapters.base.column import Column
from dbt.adapters.base.impl import ConstraintSupport
from dbt.adapters.base.meta import available
from dbt.adapters.duckdb.connections import DuckDBConnectionManager
from dbt.adapters.duckdb.glue import create_or_update_table
from dbt.adapters.duckdb.relation import DuckDBRelation
from dbt.adapters.sql import SQLAdapter
from dbt.contracts.connection import AdapterResponse
from dbt.contracts.graph.nodes import ColumnLevelConstraint
from dbt.contracts.graph.nodes import ConstraintType
from dbt.exceptions import DbtInternalError
from dbt.exceptions import DbtRuntimeError

Expand All @@ -22,6 +25,14 @@ class DuckDBAdapter(SQLAdapter):
ConnectionManager = DuckDBConnectionManager
Relation = DuckDBRelation

CONSTRAINT_SUPPORT = {
ConstraintType.check: ConstraintSupport.ENFORCED,
ConstraintType.not_null: ConstraintSupport.ENFORCED,
ConstraintType.unique: ConstraintSupport.ENFORCED,
ConstraintType.primary_key: ConstraintSupport.ENFORCED,
ConstraintType.foreign_key: ConstraintSupport.ENFORCED,
}

@classmethod
def date_function(cls) -> str:
return "now()"
Expand Down Expand Up @@ -176,6 +187,28 @@ def get_rows_different_sql(
)
return sql

@available.parse(lambda *a, **k: [])
def get_column_schema_from_query(self, sql: str) -> List[Column]:
"""Get a list of the Columns with names and data types from the given sql."""
_, cursor = self.connections.add_select_query(sql)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to talk out loud here a bit-- I've been reading through this code and the code in dbt-core to understand how this is intended to work and how I'm going to square it with DuckDB and with my future plan for environments to enable both local and remote execution of models in dbt-duckdb.

So I'm thinking that I'm going to override both the add_select_query and the data_type_code_to_name method on the DuckDBConnectionManager class, and they will in turn delegate their work to the (singleton) Environment instance associated that is associated with the run. The consequence of making that choice is that, by overriding those two methods, we should not have to override the impl of get_column_schema_from_query in the adapter impl as we are here.

Right now, I have two environment impls in the repo: a local one (the default), and a Buena Vista one that uses the Postgres protocol to talk to a Python server running DuckDB. As we discussed, Buena Vista already does (some of) the work required to translate the data types returned from DuckDB into Postgres-compatible type descriptions using the Arrow interface, which provides more detailed type information than the DBAPI implementation in DuckDB. I will need to flesh out those conversions a bit more so as to support all of the conversions specified in the data_types fixture in the adapter functional tests, and then copy an edited version of that code from the Buena Vista repo into dbt-duckdb repo for the LocalEnvironment implementation to use. I'm not wild about doing this kind of horrible copy-pasta, but it's the only way I can see having both 1) a proper implementation of model contracts in dbt-duckdb and 2) keeping the environments construct that I have totally fallen in love with in-place. The right longer term solution is going to be better/richer type information included in the description field returned by DuckDB's DBAPI implementation, but that's going to be months away in the best case, and patience is not one of my virtues.

Additionally, there is some more source-side plugin work I need to complete (I'm slowly coming to terms with the fact that the beautiful dream of write-side plugins in #143 isn't going to happen until at least 1.5.1) and some other stuff I have going on this week that is going to make completing that body and having everything tested to my satisfaction by Friday is going to be difficult to pull off.

Copy link
Contributor Author

@jtcohen6 jtcohen6 Apr 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be another (more naive) approach, if need be. Rather than trying to pull this off with a single (empty) query, and pull the data types off the cursor description, we could opt for two queries instead:

  • create a temp table from the empty query
  • adapter.get_columns_in_relation (a.k.a. information_schema.columns) - which should return more precise data types

Something like:

{% macro duckdb__get_column_schema_from_query(select_sql) -%}
    {% set tmp_relation = ... %}
    {% call statement('create_tmp_relation') %}
       {{ create_table_as(temporary=True, relation=tmp_relation, compiled_code=get_empty_subquery_sql(select_sql)) }}
    {% endcall %}
    {% set column_schema = adapter.get_columns_in_relation(tmp_relation) %}
    {{ return(column_schema) }}
{% endmacro %}

(It looks like we didn't dispatch that macro - an oversight that we could quickly correct!)

columns = [
# duckdb returns column_type as a string, rather than int (code)
self.Column.create(column_name, column_type_name)
# https://peps.python.org/pep-0249/#description
for column_name, column_type_name, *_ in cursor.description
]
return columns

@classmethod
def render_column_constraint(cls, constraint: ColumnLevelConstraint) -> Optional[str]:
"""Render the given constraint as DDL text. Should be overriden by adapters which need custom constraint
rendering."""
if constraint.type == ConstraintType.foreign_key:
# DuckDB doesn't support 'foreign key' as an alias
return f"references {constraint.expression}"
Comment on lines +209 to +210
Copy link
Contributor Author

@jtcohen6 jtcohen6 Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually references seems like the more standard syntax. And we're not really doing a good job of supporting FK constraints, anyway:

The majority of columnar db's don't enforce them, so it hasn't felt like a priority. It's neat that DuckDB actually does.

else:
return super().render_column_constraint(constraint)


# Change `table_a/b` to `table_aaaaa/bbbbb` to avoid duckdb binding issues when relation_a/b
# is called "table_a" or "table_b" in some of the dbt tests
Expand Down
29 changes: 29 additions & 0 deletions dbt/include/duckdb/macros/adapters.sql
Original file line number Diff line number Diff line change
Expand Up @@ -28,17 +28,42 @@
{{ return(run_query(sql)) }}
{% endmacro %}

{% macro get_column_names() %}
{# loop through user_provided_columns to get column names #}
{%- set user_provided_columns = model['columns'] -%}
(
{% for i in user_provided_columns %}
{% set col = user_provided_columns[i] %}
{{ col['name'] }} {{ "," if not loop.last }}
{% endfor %}
)
{% endmacro %}


{% macro duckdb__create_table_as(temporary, relation, compiled_code, language='sql') -%}
{%- if language == 'sql' -%}
{% set contract_config = config.get('contract') %}
{% if contract_config.enforced %}
{{ get_assert_columns_equivalent(compiled_code) }}
{% endif %}
{%- set sql_header = config.get('sql_header', none) -%}

{{ sql_header if sql_header is not none }}

create {% if temporary: -%}temporary{%- endif %} table
{{ relation.include(database=(not temporary and adapter.use_database()), schema=(not temporary)) }}
{% if contract_config.enforced and not temporary %}
{#-- DuckDB doesnt support constraints on temp tables --#}
{{ get_table_columns_and_constraints() }} ;
insert into {{ relation }} {{ get_column_names() }} (
{{ compiled_code }}
);
{%- set compiled_code = get_select_subquery(compiled_code) %}
{% else %}
as (
{{ compiled_code }}
);
{% endif %}
{%- elif language == 'python' -%}
{{ py_write_table(temporary=temporary, relation=relation, compiled_code=compiled_code) }}
{%- else -%}
Expand All @@ -62,6 +87,10 @@ def materialize(df, con):
{% endmacro %}

{% macro duckdb__create_view_as(relation, sql) -%}
{% set contract_config = config.get('contract') %}
{% if contract_config.enforced %}
{{ get_assert_columns_equivalent(sql) }}
{%- endif %}
{%- set sql_header = config.get('sql_header', none) -%}

{{ sql_header if sql_header is not none }}
Expand Down
71 changes: 71 additions & 0 deletions tests/functional/adapter/test_constraints.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import pytest

class DuckDBColumnEqualSetup:
@pytest.fixture
def int_type(self):
return "NUMBER"

@pytest.fixture
def schema_int_type(self):
return "INT"

@pytest.fixture
def data_types(self, schema_int_type, int_type, string_type):
# sql_column_value, schema_data_type, error_data_type
return [
# DuckDB's cursor doesn't seem to distinguish between:
# INT and NUMERIC/DECIMAL -- both just return as 'NUMBER'
# TIMESTAMP and TIMESTAMPTZ -- both just return as 'DATETIME'
# [1,2,3] and ['a','b','c'] -- both just return as 'list'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen in these cases?

Context: dbt is running an "empty" version of the model query (where false limit 0), and extracting column names + data types from the cursor description, before it actually materializes the model. If it detects a diff, it will raise an error like:

08:58:44  Compilation Error in model my_model (models/my_model.sql)
08:58:44    This model has an enforced contract that failed.
08:58:44    Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.
08:58:44
08:58:44    | column_name | definition_type | contract_type | mismatch_reason    |
08:58:44    | ----------- | --------------- | ------------- | ------------------ |
08:58:44    | my_col      | NUMBER          | TEXT          | data type mismatch |

If no mismatch is detected, and it's a table, dbt will now materialize it via:

create table {model_name} (my_col int);
insert into table (select {model_sql});

The database's type system is flexible by design, and is willing to do a lot of creative coercion:

D create table my_tbl (id int);
D insert into my_tbl (select 2.1 as id);
D select * from my_tbl;
┌───────┐
│  id   │
│ int32 │
├───────┤
│     2 │
└───────┘

It will still catch the more egregious mismatches, including for structs and lists:

D create or replace table other_tbl (struct_col struct(num int));
D insert into other_tbl (select {'num': 'text'} as struct_col);
Error: Conversion Error: Could not convert string 'text' to INT32
D insert into other_tbl (select ['some', 'strings'] as list_of_int);
D create or replace table other_tbl (list_of_int int[]);
Error: Conversion Error: Could not convert string 'some' to INT32


["1", schema_int_type, int_type],
["'1'", string_type, string_type],
["true", "bool", "bool"],
["'2013-11-03 00:00:00-07'::timestamptz", "timestamptz", "DATETIME"],
["'2013-11-03 00:00:00-07'::timestamp", "timestamp", "DATETIME"],
["ARRAY['a','b','c']", "text[]", "list"],
["ARRAY[1,2,3]", "int[]", "list"],
#["'1'::numeric", "numeric", "DECIMAL"], -- no distinction, noted above
["""{'bar': 'baz', 'balance': 7.77, 'active': false}""", "struct(bar text, balance decimal, active boolean)", "dict"],
]


class TestTableConstraintsColumnsEqual(DuckDBColumnEqualSetup, BaseTableConstraintsColumnsEqual):
pass


class TestViewConstraintsColumnsEqual(DuckDBColumnEqualSetup, BaseViewConstraintsColumnsEqual):
pass


class TestIncrementalConstraintsColumnsEqual(DuckDBColumnEqualSetup, BaseIncrementalConstraintsColumnsEqual):
pass


class TestTableConstraintsRuntimeDdlEnforcement(DuckDBColumnEqualSetup, BaseConstraintsRuntimeDdlEnforcement):
pass


class TestTableConstraintsRollback(BaseConstraintsRollback):
@pytest.fixture(scope="class")
def expected_error_messages(self):
return ["NOT NULL constraint failed"]


class TestIncrementalConstraintsRuntimeDdlEnforcement(
BaseIncrementalConstraintsRuntimeDdlEnforcement
):
@pytest.fixture(scope="class")
def expected_error_messages(self):
return ["NOT NULL constraint failed"]


class TestIncrementalConstraintsRollback(BaseIncrementalConstraintsRollback):
@pytest.fixture(scope="class")
def expected_error_messages(self):
return ["NOT NULL constraint failed"]

class TestModelConstraintsRuntimeEnforcement(BaseModelConstraintsRuntimeEnforcement):
@pytest.fixture(scope="class")
def expected_error_messages(self):
return ["NOT NULL constraint failed"]