Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update BigQuery dependencies to fix group-by results handler #64

Merged
merged 3 commits into from
Aug 19, 2020

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Jul 23, 2020

  • Updates dependencies to capture NaN handling fix from google-cloud-bigquery.
  • Add integration test for BigQuery results handler. This ensures that NaN values are properly handled.

Closes #50

@tswast
Copy link
Collaborator Author

tswast commented Jul 23, 2020

Funnily enough, both tests actually fail with google-cloud-bigquery 1.25.0 (I think because pandas converts None to NaN). Probably makes sense just to delete the second "none" test, in that case.

Successfully installed google-cloud-bigquery-1.25.0
(pso-data-validator) 
# swast @ swast-macbookpro2 in ~/src/pso-data-validator on git:issue50-group-by-nan o [10:10:11] 
$ pytest tests/system/result_handlers/test_bigquery.py   
=========================================== test session starts ===========================================
platform darwin -- Python 3.6.10, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /Users/swast/src/pso-data-validator
collected 2 items                                                                                         

tests/system/result_handlers/test_bigquery.py FF                                                    [100%]

================================================ FAILURES =================================================
__________________________________________ test_execute_with_nan __________________________________________

bigquery_client = <google.cloud.bigquery.client.Client object at 0x7fc07eef2278>
bigquery_dataset_id = 'swast-scratch.data_validator_tests_202007231010oqmhuw'

    def test_execute_with_nan(bigquery_client, bigquery_dataset_id):
        table_id = f"{bigquery_dataset_id}.test_execute_with_nan"
        object_under_test = get_handler(bigquery_client, table_id)
        create_bigquery_results_table(bigquery_client, table_id)
        end = get_now()
        start = end - datetime.timedelta(minutes=1)
        df = pandas.DataFrame(
            {
                "run_id": ["grouped-test"] * 6,
                "start_time": [start] * 6,
                "end_time": [end] * 6,
                "source_table_name": [
                    "test_source",
                    "test_source",
                    _NAN,
                    _NAN,
                    "test_source",
                    "test_source",
                ],
                "source_column_name": [
                    "source_column",
                    "source_column",
                    _NAN,
                    _NAN,
                    "source_column",
                    "source_column",
                ],
                "target_table_name": [
                    "test_target",
                    "test_target",
                    "test_target",
                    "test_target",
                    _NAN,
                    _NAN,
                ],
                "target_column_name": [
                    "target_column",
                    "target_column",
                    "target_column",
                    "target_column",
                    _NAN,
                    _NAN,
                ],
                "validation_type": ["GroupedColumn"] * 6,
                "aggregation_type": ["count"] * 6,
                "validation_name": ["count"] * 6,
                "source_agg_value": ["2", "4", _NAN, _NAN, "6", "8"],
                "target_agg_value": ["1", "3", "5", "7", "8", "9"],
                "group_by_columns": [
                    '{"grp_a": "a", "grp_i": "0"}',
                    '{"grp_a": "a", "grp_i": "1"}',
                    '{"grp_a": "b", "grp_i": "0"}',
                    '{"grp_a": "b", "grp_i": "1"}',
                    '{"grp_a": "c", "grp_i": "0"}',
                    '{"grp_a": "c", "grp_i": "1"}',
                ],
                "difference": [-1.0, -1.0, _NAN, _NAN, _NAN, _NAN],
                "pct_difference": [-50.0, -25.0, _NAN, _NAN, _NAN, _NAN],
            }
        )
>       object_under_test.execute(None, df)

tests/system/result_handlers/test_bigquery.py:138: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data_validation/result_handlers/bigquery.py:59: in execute
    table, result_df
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:2577: in insert_rows_from_dataframe
    result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:2531: in insert_rows
    return self.insert_rows_json(table, json_rows, **kwargs)
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:2675: in insert_rows_json
    timeout=timeout,
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:558: in _call_api
    return call()
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/api_core/retry.py:286: in retry_wrapped_func
    on_error=on_error,
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/api_core/retry.py:184: in retry_target
    return target()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.bigquery._http.Connection object at 0x7fc07dd0b550>, method = 'POST'
path = '/projects/swast-scratch/datasets/data_validator_tests_202007231010oqmhuw/tables/test_execute_with_nan/insertAll'
query_params = None
data = '{"rows": [{"json": {"run_id": "grouped-test", "validation_name": "count", "validation_type": "GroupedColumn", "start_...rget_agg_value": "9", "difference": NaN, "pct_difference": NaN}, "insertId": "916ef0a7-5007-45a1-b6a7-11e26a201e94"}]}'
content_type = 'application/json', headers = None, api_base_url = None, api_version = None
expect_json = True, _target_object = None, timeout = None

    def api_request(
        self,
        method,
        path,
        query_params=None,
        data=None,
        content_type=None,
        headers=None,
        api_base_url=None,
        api_version=None,
        expect_json=True,
        _target_object=None,
        timeout=_DEFAULT_TIMEOUT,
    ):
        """Make a request over the HTTP transport to the API.
    
        You shouldn't need to use this method, but if you plan to
        interact with the API using these primitives, this is the
        correct one to use.
    
        :type method: str
        :param method: The HTTP method name (ie, ``GET``, ``POST``, etc).
                       Required.
    
        :type path: str
        :param path: The path to the resource (ie, ``'/b/bucket-name'``).
                     Required.
    
        :type query_params: dict or list
        :param query_params: A dictionary of keys and values (or list of
                             key-value pairs) to insert into the query
                             string of the URL.
    
        :type data: str
        :param data: The data to send as the body of the request. Default is
                     the empty string.
    
        :type content_type: str
        :param content_type: The proper MIME type of the data provided. Default
                             is None.
    
        :type headers: dict
        :param headers: extra HTTP headers to be sent with the request.
    
        :type api_base_url: str
        :param api_base_url: The base URL for the API endpoint.
                             Typically you won't have to provide this.
                             Default is the standard API base URL.
    
        :type api_version: str
        :param api_version: The version of the API to call.  Typically
                            you shouldn't provide this and instead use
                            the default for the library.  Default is the
                            latest API version supported by
                            google-cloud-python.
    
        :type expect_json: bool
        :param expect_json: If True, this method will try to parse the
                            response as JSON and raise an exception if
                            that cannot be done.  Default is True.
    
        :type _target_object: :class:`object`
        :param _target_object:
            (Optional) Protected argument to be used by library callers. This
            can allow custom behavior, for example, to defer an HTTP request
            and complete initialization of the object at a later time.
    
        :type timeout: float or tuple
        :param timeout: (optional) The amount of time, in seconds, to wait
            for the server response.
    
            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.
    
        :raises ~google.cloud.exceptions.GoogleCloudError: if the response code
            is not 200 OK.
        :raises ValueError: if the response content type is not JSON.
        :rtype: dict or str
        :returns: The API response payload, either as a raw string or
                  a dictionary if the response is valid JSON.
        """
        url = self.build_api_url(
            path=path,
            query_params=query_params,
            api_base_url=api_base_url,
            api_version=api_version,
        )
    
        # Making the executive decision that any dictionary
        # data will be sent properly as JSON.
        if data and isinstance(data, dict):
            data = json.dumps(data)
            content_type = "application/json"
    
        response = self._make_request(
            method=method,
            url=url,
            data=data,
            content_type=content_type,
            headers=headers,
            target_object=_target_object,
            timeout=timeout,
        )
    
        if not 200 <= response.status_code < 300:
>           raise exceptions.from_http_response(response)
E           google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/swast-scratch/datasets/data_validator_tests_202007231010oqmhuw/tables/test_execute_with_nan/insertAll: Invalid JSON payload received. Unexpected token.
E           : NaN, "target_table_n
E             ^

../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/_http.py:423: BadRequest
_________________________________________ test_execute_with_none __________________________________________

bigquery_client = <google.cloud.bigquery.client.Client object at 0x7fc07eef2278>
bigquery_dataset_id = 'swast-scratch.data_validator_tests_202007231010oqmhuw'

    def test_execute_with_none(bigquery_client, bigquery_dataset_id):
        table_id = f"{bigquery_dataset_id}.test_execute_with_none"
        object_under_test = get_handler(bigquery_client, table_id)
        create_bigquery_results_table(bigquery_client, table_id)
        end = get_now()
        start = end - datetime.timedelta(minutes=1)
        df = pandas.DataFrame(
            {
                "run_id": ["grouped-test"] * 6,
                "start_time": [start] * 6,
                "end_time": [end] * 6,
                "source_table_name": [
                    "test_source",
                    "test_source",
                    None,
                    None,
                    "test_source",
                    "test_source",
                ],
                "source_column_name": [
                    "source_column",
                    "source_column",
                    None,
                    None,
                    "source_column",
                    "source_column",
                ],
                "target_table_name": [
                    "test_target",
                    "test_target",
                    "test_target",
                    "test_target",
                    None,
                    None,
                ],
                "target_column_name": [
                    "target_column",
                    "target_column",
                    "target_column",
                    "target_column",
                    None,
                    None,
                ],
                "validation_type": ["GroupedColumn"] * 6,
                "aggregation_type": ["count"] * 6,
                "validation_name": ["count"] * 6,
                "source_agg_value": ["2", "4", None, None, "6", "8"],
                "target_agg_value": ["1", "3", "5", "7", "8", "9"],
                "group_by_columns": [
                    '{"grp_a": "a", "grp_i": "0"}',
                    '{"grp_a": "a", "grp_i": "1"}',
                    '{"grp_a": "b", "grp_i": "0"}',
                    '{"grp_a": "b", "grp_i": "1"}',
                    '{"grp_a": "c", "grp_i": "0"}',
                    '{"grp_a": "c", "grp_i": "1"}',
                ],
                "difference": [-1.0, -1.0, None, None, None, None],
                "pct_difference": [-50.0, -25.0, None, None, None, None],
            }
        )
>       object_under_test.execute(None, df)

tests/system/result_handlers/test_bigquery.py:204: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data_validation/result_handlers/bigquery.py:59: in execute
    table, result_df
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:2577: in insert_rows_from_dataframe
    result = self.insert_rows(table, rows_chunk, selected_fields, **kwargs)
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:2531: in insert_rows
    return self.insert_rows_json(table, json_rows, **kwargs)
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:2675: in insert_rows_json
    timeout=timeout,
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/bigquery/client.py:558: in _call_api
    return call()
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/api_core/retry.py:286: in retry_wrapped_func
    on_error=on_error,
../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/api_core/retry.py:184: in retry_target
    return target()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <google.cloud.bigquery._http.Connection object at 0x7fc07dd0b550>, method = 'POST'
path = '/projects/swast-scratch/datasets/data_validator_tests_202007231010oqmhuw/tables/test_execute_with_none/insertAll'
query_params = None
data = '{"rows": [{"json": {"run_id": "grouped-test", "validation_name": "count", "validation_type": "GroupedColumn", "start_...rget_agg_value": "9", "difference": NaN, "pct_difference": NaN}, "insertId": "1cfec2d8-2715-41df-88ae-3152db100b06"}]}'
content_type = 'application/json', headers = None, api_base_url = None, api_version = None
expect_json = True, _target_object = None, timeout = None

    def api_request(
        self,
        method,
        path,
        query_params=None,
        data=None,
        content_type=None,
        headers=None,
        api_base_url=None,
        api_version=None,
        expect_json=True,
        _target_object=None,
        timeout=_DEFAULT_TIMEOUT,
    ):
        """Make a request over the HTTP transport to the API.
    
        You shouldn't need to use this method, but if you plan to
        interact with the API using these primitives, this is the
        correct one to use.
    
        :type method: str
        :param method: The HTTP method name (ie, ``GET``, ``POST``, etc).
                       Required.
    
        :type path: str
        :param path: The path to the resource (ie, ``'/b/bucket-name'``).
                     Required.
    
        :type query_params: dict or list
        :param query_params: A dictionary of keys and values (or list of
                             key-value pairs) to insert into the query
                             string of the URL.
    
        :type data: str
        :param data: The data to send as the body of the request. Default is
                     the empty string.
    
        :type content_type: str
        :param content_type: The proper MIME type of the data provided. Default
                             is None.
    
        :type headers: dict
        :param headers: extra HTTP headers to be sent with the request.
    
        :type api_base_url: str
        :param api_base_url: The base URL for the API endpoint.
                             Typically you won't have to provide this.
                             Default is the standard API base URL.
    
        :type api_version: str
        :param api_version: The version of the API to call.  Typically
                            you shouldn't provide this and instead use
                            the default for the library.  Default is the
                            latest API version supported by
                            google-cloud-python.
    
        :type expect_json: bool
        :param expect_json: If True, this method will try to parse the
                            response as JSON and raise an exception if
                            that cannot be done.  Default is True.
    
        :type _target_object: :class:`object`
        :param _target_object:
            (Optional) Protected argument to be used by library callers. This
            can allow custom behavior, for example, to defer an HTTP request
            and complete initialization of the object at a later time.
    
        :type timeout: float or tuple
        :param timeout: (optional) The amount of time, in seconds, to wait
            for the server response.
    
            Can also be passed as a tuple (connect_timeout, read_timeout).
            See :meth:`requests.Session.request` documentation for details.
    
        :raises ~google.cloud.exceptions.GoogleCloudError: if the response code
            is not 200 OK.
        :raises ValueError: if the response content type is not JSON.
        :rtype: dict or str
        :returns: The API response payload, either as a raw string or
                  a dictionary if the response is valid JSON.
        """
        url = self.build_api_url(
            path=path,
            query_params=query_params,
            api_base_url=api_base_url,
            api_version=api_version,
        )
    
        # Making the executive decision that any dictionary
        # data will be sent properly as JSON.
        if data and isinstance(data, dict):
            data = json.dumps(data)
            content_type = "application/json"
    
        response = self._make_request(
            method=method,
            url=url,
            data=data,
            content_type=content_type,
            headers=headers,
            target_object=_target_object,
            timeout=timeout,
        )
    
        if not 200 <= response.status_code < 300:
>           raise exceptions.from_http_response(response)
E           google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/swast-scratch/datasets/data_validator_tests_202007231010oqmhuw/tables/test_execute_with_none/insertAll: Invalid JSON payload received. Unexpected token.
E            "5", "difference": NaN, "pct_difference
E                               ^

../../miniconda3/envs/pso-data-validator/lib/python3.6/site-packages/google/cloud/_http.py:423: BadRequest
========================================= short test summary info =========================================
FAILED tests/system/result_handlers/test_bigquery.py::test_execute_with_nan - google.api_core.exceptions...
FAILED tests/system/result_handlers/test_bigquery.py::test_execute_with_none - google.api_core.exception...
============================================ 2 failed in 2.46s ============================================

@tswast tswast added the kokoro:rebuild Re-run tests label Jul 23, 2020
@tswast tswast requested a review from dhercher July 23, 2020 15:13
@tswast tswast changed the title add integration test for BigQuery results handler fix: update BigQuery dependencies to fix group-by results handler Jul 23, 2020
@tswast tswast added kokoro:rebuild Re-run tests and removed kokoro:rebuild Re-run tests labels Jul 23, 2020
@cloud-pso-bot cloud-pso-bot removed the kokoro:rebuild Re-run tests label Jul 23, 2020
@tswast tswast added the kokoro:rebuild Re-run tests label Jul 30, 2020
@cloud-pso-bot cloud-pso-bot removed the kokoro:rebuild Re-run tests label Jul 30, 2020
@tswast tswast added the kokoro:rebuild Re-run tests label Aug 18, 2020
@cloud-pso-bot cloud-pso-bot removed the kokoro:rebuild Re-run tests label Aug 18, 2020
@dhercher dhercher merged commit 5861514 into develop Aug 19, 2020
@dhercher dhercher deleted the issue50-group-by-nan branch August 19, 2020 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: tries to insert NaN for source_table_name in GroupedQuery count
3 participants