Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027

Open
19 of 28 tasks
rjzamora opened this issue Feb 12, 2024 · 0 comments
Open
19 of 28 tasks

[FEA] Support "dataframe.query-planning" config in dask.dataframe #15027

rjzamora opened this issue Feb 12, 2024 · 0 comments
Assignees
Labels
dask Dask issue feature request New feature or request

Comments

@rjzamora
Copy link
Member

rjzamora commented Feb 12, 2024

PSA

To unblock CI failures related to the dask-expr migration, down-stream RAPIDS libraries can set the following environment variable in CI (before dask.dataframe/dask_cudf is ever imported):

export DASK_DATAFRAME__QUERY_PLANNING=False

If you do this, please be sure to comment on the change, and link it to this meta issue. (So I can make the necessary changes/fixes, and turn query-planning back on)


Background

The 2024.2.0 release of Dask has deprecated the "legacy" dask.dataframe API. Given that dask-cudf (and much of RAPIDS) is tightly integrated with dask.dataframe, it is critical that dask_cudf be updated to use the new dask_expr backend smoothly.

Most of the heavy lifting is already being done in #14805. However, there will also be some follow-up work to expand coverage/examples/documentation/benchmarks. We will also need to update dask-cuda/explicit-comms.

Action Items

Basics (to be covered by #14805):

  • Add dask-expr DataFrameBackendEntrypoint entrypoint for "cudf"
  • Align top-level dask_cudf imports with dask.dataframe for "dataframe.query-planning" support

Expected Follow-up:

cuDF / Dask cuDF doc build:

cuML support:

cuxfilter support:

cugraph support:

Dask CUDA:

Dask SQL:

  • Migrate predicate pushdown to dask-expr

NeMo Curator:

  • Migrate custom-graph code and test against latest dask/cudf

Merlin:

  • Port Merlin/NVTabular (Heavy lift - Aiming for 24.08)
@rjzamora rjzamora added feature request New feature or request dask Dask issue labels Feb 12, 2024
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this issue Feb 13, 2024
Dask CUDA must use the deprecated `dask.dataframe` API until #1311 and rapidsai/cudf#15027 are both closed. This means that we must explicitly filter the following deprecation warning to avoid nighlty CI failures:

```
DeprecationWarning: The current Dask DataFrame implementation is deprecated. 
In a future release, Dask DataFrame will use new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 

  import dask.dataframe as dd
```

This PR adds the (temporarily) necessary warning filter.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1312
rapids-bot bot pushed a commit that referenced this issue Mar 11, 2024
Mostly addresses #15027

dask/dask-expr#728 exposed the necessary mechanisms for us to define a custom dask-expr backend for `cudf`. The new dispatching mechanisms are effectively the same as those in `dask.dataframe`. The only difference is that we are now registering/implementing "expression-based" collections.

This PR does the following:
- Defines a basic `DataFrameBackendEntrypoint` class for collection creation, and registers new collections using `get_collection_type`.
- Refactors the `dask_cudf` import structure to properly support the `"dataframe.query-planning"` configuration.
- Modifies CI to test dask-expr support for some of the `dask_cudf` tests. This coverage can be expanded in follow-up work.

~**Experimental Change**: This PR patches `dask_expr._expr.Expr.__new__` to enable type-based dispatching. This effectively allows us to surgically replace problematic `Expr` subclasses that do not work for cudf-backed data. For example, this PR replaces the upstream `TakeLast` expression to avoid using `squeeze` (since this method is not supported by cudf). This particular fix can be moved upstream relatively easily. However, having this kind of "patching" mechanism may be valuable for more complicated pandas/cudf discrepancies.~

## Usage example

```python
from dask import config
config.set({"dataframe.query-planning": True})
import dask_cudf

df = dask_cudf.DataFrame.from_dict(
    {"x": range(100), "y":  [1, 2, 3, 4] * 25, "z": ["1", "2"] * 50},
    npartitions=10,
)
df["y2"] = df["x"] + df["y"]
agg = df.groupby("y").agg({"y2": "mean"})["y2"]
agg.simplify().pprint()
```
Dask cuDF should now be using dask-expr for "query planning":
```
Projection: columns='y2'
  GroupbyAggregation: arg={'y2': 'mean'} observed=True split_out=1'y'
    Assign: y2=
      Projection: columns=['y']
        FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y']
      Add:
        Projection: columns='x'
          FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y']
        Projection: columns='y'
          FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y']
```

## TODO

- [x] Add basic tests
- [x] Confirm that general design makes sense

**Follow Up Work**:

- Expand dask-expr test coverage
- Fix local and upstream bugs
- Add documentation once "critical mass" is reached

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Lawrence Mitchell (https://github.com/wence-)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Ray Douglass (https://github.com/raydouglass)

URL: #14805
@rjzamora rjzamora self-assigned this Mar 19, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 1, 2024
Addresses parts of #15027 (json and s3 testing).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15408
rapids-bot bot pushed a commit that referenced this issue Apr 9, 2024
Related to orc and text support in #15027

Follow-up work can to enable predicate pushdown and column projection with ORC, but the goal of this PR is basic functionality (and parity with the legacy API).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #15439
younseojava pushed a commit to ROCm/dask-cuda-rocm that referenced this issue Apr 16, 2024
Dask CUDA must use the deprecated `dask.dataframe` API until rapidsai#1311 and rapidsai/cudf#15027 are both closed. This means that we must explicitly filter the following deprecation warning to avoid nighlty CI failures:

```
DeprecationWarning: The current Dask DataFrame implementation is deprecated. 
In a future release, Dask DataFrame will use new implementation that
contains several improvements including a logical query planning.
The user-facing DataFrame API will remain unchanged.

The new implementation is already available and can be enabled by
installing the dask-expr library:

    $ pip install dask-expr

and turning the query planning option on:

    >>> import dask
    >>> dask.config.set({'dataframe.query-planning': True})
    >>> import dask.dataframe as dd

API documentation for the new implementation is available at
https://docs.dask.org/en/stable/dask-expr-api.html

Any feedback can be reported on the Dask issue tracker
https://github.com/dask/dask/issues 

  import dask.dataframe as dd
```

This PR adds the (temporarily) necessary warning filter.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: rapidsai#1312
rapids-bot bot pushed a commit that referenced this issue May 1, 2024
Related to #15027

Adds a minor tokenization fix, and adjusts testing for categorical-accessor support.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Bradley Dice (https://github.com/bdice)

URL: #15591
rapids-bot bot pushed a commit that referenced this issue May 9, 2024
Related to #15027

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #15639
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

1 participant