Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for generic series limit #16660

Merged
merged 7 commits into from
Sep 16, 2021

Conversation

villebro
Copy link
Member

@villebro villebro commented Sep 10, 2021

SUMMARY

This is the first in a series of multiple PRs that aims to add generic x-axis support. The long-term goal is the following:

  • deprecate is_timeseries - a plugin should explicitly add the temporal column to the groupby if it is used as a dimension. This behavior is not changed in this PR, but removes the need for enabling is_timeseries to be able to limit series count.
  • deprecate timeseries_limit and timeseries_limit_metric with series_columns, series_limit and series_limit_metric.
  • Deprecate groupby in favor of columns.

Existing charts should not be affected - legacy charts will work as before, but V1 charts will emit deprecation warnings when requesting data using the deprecated field names.

Related PR on superset-ui: apache-superset/superset-ui#1356

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Sep 10, 2021

Codecov Report

Merging #16660 (40764d8) into master (092ef5b) will decrease coverage by 0.02%.
The diff coverage is 86.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16660      +/-   ##
==========================================
- Coverage   76.99%   76.96%   -0.03%     
==========================================
  Files        1007     1007              
  Lines       54133    54155      +22     
  Branches     7374     7374              
==========================================
+ Hits        41678    41681       +3     
- Misses      12215    12234      +19     
  Partials      240      240              
Flag Coverage Δ
hive 81.31% <86.15%> (-0.05%) ⬇️
javascript 71.30% <ø> (ø)
mysql 81.66% <86.15%> (-0.05%) ⬇️
postgres 81.76% <86.15%> (-0.05%) ⬇️
presto 81.65% <86.15%> (-0.01%) ⬇️
python 82.27% <86.15%> (-0.06%) ⬇️
sqlite 81.33% <86.15%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/common/query_actions.py 92.95% <ø> (-0.10%) ⬇️
superset/common/query_context.py 90.74% <ø> (ø)
superset/db_engine_specs/base.py 88.39% <ø> (ø)
superset/connectors/sqla/models.py 87.95% <75.75%> (-1.76%) ⬇️
superset/common/query_object.py 90.62% <96.55%> (-0.05%) ⬇️
superset/charts/schemas.py 100.00% <100.00%> (ø)
superset/db_engine_specs/presto.py 89.95% <0.00%> (-0.42%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 092ef5b...40764d8. Read the comment docs.

@@ -1087,7 +1096,6 @@ def get_sqla_query( # pylint: disable=too-many-arguments,too-many-locals,too-ma
metrics_exprs = []

time_range_endpoints = extras.get("time_range_endpoints")
groupby_exprs_with_timestamp = OrderedDict(groupby_exprs_sans_timestamp.items())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrderedDict is no longer necessary, as dict is guaranteed to be ordered as of Python 3.7.

image

https://docs.python.org/3.8/library/stdtypes.html

and groupby_exprs_sans_timestamp
and dttm_col
):
if db_engine_spec.allows_subqueries and series_limit and groupby_series_columns:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bycatch: I noticed we weren't checking if the engine allows subqueries

subq = subq.where(and_(*(where_clause_and + [inner_time_filter])))
inner_time_filter = []

if dttm_col and not db_engine_spec.time_groupby_inline:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what db_engine_spec.time_groupby_inline does, but I can only assume it relates to temporal filters in subqueries. If this turns out not to be the case I can improve the documentation in a subsequent PR.

Comment on lines -77 to +102
granularity: Optional[str]
columns: List[str]
datasource: Optional[BaseDatasource]
extras: Dict[str, Any]
filter: List[QueryObjectFilterClause]
from_dttm: Optional[datetime]
to_dttm: Optional[datetime]
granularity: Optional[str]
inner_from_dttm: Optional[datetime]
inner_to_dttm: Optional[datetime]
is_rowcount: bool
is_timeseries: bool
time_shift: Optional[timedelta]
groupby: List[str]
metrics: Optional[List[Metric]]
row_limit: int
row_offset: int
filter: List[QueryObjectFilterClause]
timeseries_limit: int
timeseries_limit_metric: Optional[Metric]
order_desc: bool
extras: Dict[str, Any]
columns: List[str]
orderby: List[OrderBy]
post_processing: List[Dict[str, Any]]
datasource: Optional[BaseDatasource]
metrics: Optional[List[Metric]]
result_type: Optional[ChartDataResultType]
is_rowcount: bool
row_limit: int
row_offset: int
series_columns: List[str]
series_limit: int
series_limit_metric: Optional[Metric]
time_offsets: List[str]
time_shift: Optional[timedelta]
to_dttm: Optional[datetime]
post_processing: List[Dict[str, Any]]
Copy link
Member Author

@villebro villebro Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight I should not have reordered these in this PR to keep it as easily reviewable as possible (my sincere apologies to reviewers!), but having these non-ordered annoyed me so much I couldn't help myself. The point here is I've added series_columns, series_limit and series_limit_metric and removed timeseries_limit and timeseries_limit_metric as they're picked up by kwargs and handled in the deprecated mapping methods.

Comment on lines 960 to 973
"columns": columns,
"from_dttm": from_dttm.isoformat() if from_dttm else None,
"groupby": groupby,
"metrics": metrics,
"row_limit": row_limit,
"row_offset": row_offset,
"to_dttm": to_dttm.isoformat() if to_dttm else None,
"table_columns": [col.column_name for col in self.columns],
"filter": filter,
"columns": [col.column_name for col in self.columns],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm renaming columns to table_columns, as columns should refer to the actual query object property (I doubt the old columns property is really being actively used).

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! This is long overdue. I'm super excited on seeing what we can do with Superset charts by "downgrading" time columns to a place it really deserves---to be treated the same like any other columns.

elif is_timeseries and metrics:
self.series_columns = columns
else:
self.series_columns = []
Copy link
Member

@ktmud ktmud Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.series_columns = series_columns or (columns if is_timeseries and metrics else [])

Would this be more Pythonic?

I'm also wondering whether we should have another layer of parameter consolidation before QueryObject to handle all the special overrides & fallbacks for legacy/deprecated parameters (e.g. timeseries_limit vs series_limit, columns vs groupby, sortby vs orderby). By isolating parameter consolidation in the Flask view handler layer, we reduce the number of parameters for QueryObject itself and simply all downstream functions, which may help cleaning up deprecated code faster---all without affecting backward compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.series_columns = series_columns or (columns if is_timeseries and metrics else [])

Would this be more Pythonic?

IMO the current implementation is more pythonic/readable, as it doesn't require unnesting the inline logic when reading it. But I'm ok with the proposed snippet, too. I'm curious to hear what @dpgaspar thinks - it would be nice to add something on this to the code style guide.

I'm also wondering whether we should have another layer of parameter consolidation before QueryObject so to handle all the special overrides & fallbacks for legacy/deprecated parameters (e.g. timeseries_limit vs series_limit, columns vs groupby, sortby vs orderby). By isolating parameter consolidation in the Flask view handler layer, we reduce the number of parameters for QueryObject itself and simply all downstream functions, which may help cleaning up deprecated code faster---all without affecting backward compatibility.

I wanted to restrict this PR to the bare minimum amount of changes to contain the blast radius (whenever we're touching this type of core logic there's serious risk of regressions). I absolutely agree we should consolidate more (and should finish the consolidation work before cutting 2.0), but I'd propose doing it across multiple PRs to avoid making one big bang PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinions here, but, I do like to use simple ternaries. But current implementation seems more readable to me because of the nested ternary.

@villebro villebro changed the title [WIP] feat: add support for generic series limit feat: add support for generic series limit Sep 15, 2021
series_limit_metric: Optional[Metric] = None,
row_limit: Optional[int] = None,
row_offset: Optional[int] = None,
timeseries_limit: Optional[int] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeseries_limit previously defaulted to 15. However, this was always defaulted to 0 in both viz.py and QueryObject, so we never actually used this magic number.

Copy link
Member

@zhaoyongjie zhaoyongjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@villebro
Copy link
Member Author

Merging this to unblock superset-ui development

@villebro villebro merged commit 836b5e2 into apache:master Sep 16, 2021
@villebro villebro deleted the villebro/generic-series-limit branch September 16, 2021 09:09
@john-bodley
Copy link
Member

john-bodley commented Sep 23, 2021

@villebro any reason you marked these fields as deprecated—my understanding is both the old and new name may coexist in the form-data (thought the deprecated fields are renamed)—and not just rename them and add a migration?

@villebro
Copy link
Member Author

@villebro any reason you marked these fields as deprecated—my understanding is both the old and new name may coexist in the form-data (thought the deprecated fields are renamed)—and not just rename them and add a migration?

This is mostly in preparation for Superset 2.0. One reason for deprecating these is to avoid introducing breaking changes right now (I assume some people may be calling both the old and new chart data endpoints for custom use cases). Another reason I didn't yet want to add a db migration for these was to avoid having to do extensive refactoring on viz.py. But I fully intend for us to do this later this year when we start preparing in earnest for 2.0.

@@ -254,6 +264,14 @@ def validate(
"""Validate query object"""
error: Optional[QueryObjectValidationError] = None
all_labels = self.metric_names + self.column_names
missing_series = [col for col in self.series_columns if col not in self.columns]
if missing_series:
_(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro you forgot to raise the exception here

#16946 fix it

opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021
* feat: add support for generic series limit

* refine series_columns logic

* update docs

* bump superset-ui

* add note to UPDATING.md

* remove default value for timeseries_limit
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 28, 2021
* feat: add support for generic series limit

* refine series_columns logic

* update docs

* bump superset-ui

* add note to UPDATING.md

* remove default value for timeseries_limit
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.4.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels preset-io size/L 🚢 1.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants