Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split up CUDA-suffixed dependencies in dependencies.yaml #4552

Merged
merged 6 commits into from
Jul 30, 2024

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Jul 24, 2024

Description

Contributes to rapidsai/build-planning#31

In short, RAPIDS DLFW builds want to produce wheels with unsuffixed dependencies, e.g. cudf depending on rmm, not rmm-cu12.

This PR is part of a series across all of RAPIDS to try to support that type of build by setting up CUDA-suffixed and CUDA-unsuffixed dependency lists in dependencies.yaml.

For more details, see:

Notes for Reviewers

Why target 24.08?

This is targeting 24.08 because:

  1. it should be very low-risk
  2. getting these changes into 24.08 prevents the need to carry around patches for every library in DLFW builds using RAPIDS 24.08

@jameslamb jameslamb added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 24, 2024
@jameslamb jameslamb changed the title WIP: split up CUDA-suffixed dependencies in dependencies.yaml split up CUDA-suffixed dependencies in dependencies.yaml Jul 24, 2024
@jameslamb jameslamb marked this pull request as ready for review July 24, 2024 05:46
@jameslamb jameslamb requested a review from a team as a code owner July 24, 2024 05:46
@jameslamb jameslamb requested a review from bdice July 24, 2024 05:46
@alexbarghi-nv
Copy link
Member

I think we need this applied to rapidsai/cugraph-gnn as well

@jameslamb
Copy link
Member Author

Sure @alexbarghi-nv I can do that! You're right, I'd missed that because cugraph-gnn wasn't in the initial list on rapidsai/build-planning#31. I'll put up a PR with changes like this there.

@jameslamb
Copy link
Member Author

I've updated this based on the suggestions from rapidsai/cudf#16183.

Ran ci/release/update-version.sh '24.10.00' and did not see any other places that need updates.

@jameslamb
Copy link
Member Author

@alexbarghi-nv I created the following in cugraph-gnn:

dependencies.yaml Show resolved Hide resolved
Comment on lines 906 to 908
# NOTE: this is not broken into groups by a 'cuda_suffixed' selector like
# other packages with -cu{nn}x suffixes in this file because devcontainers-based
# builds that build it from source expect a package with a `-cuda{nn}x` suffix
Copy link
Contributor

@bdice bdice Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that build it from source

The antecedent is a little unclear here. "it" refers to cugraph, not cupy here. We never build cupy from source in RAPIDS. Maybe we can simplify the comment like:

Suggested change
# NOTE: this is not broken into groups by a 'cuda_suffixed' selector like
# other packages with -cu{nn}x suffixes in this file because devcontainers-based
# builds that build it from source expect a package with a `-cuda{nn}x` suffix
# cupy is not built from source in devcontainers, so cuda_suffixed isn't needed

or even

Suggested change
# NOTE: this is not broken into groups by a 'cuda_suffixed' selector like
# other packages with -cu{nn}x suffixes in this file because devcontainers-based
# builds that build it from source expect a package with a `-cuda{nn}x` suffix
# cupy is not built from source so cuda_suffixed is not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My choice to say "devcontainers-based builds" was a bad one. It's very unclear.

I should have said "DLFW builds".

We do not build cupy from source in RAPIDS devcontainers, but we DO build it from source in the RAPIDS DLFW builds. And there, pass the following through to pip wheel.

--build-option="--cupy-package-name=cupy-cuda$(cut -d'.' -f1 <<< "${CUDA_VERSION}")x"

Here's where cupy exposes that option in its configuration:

https://github.com/cupy/cupy/blob/9381230cc55a5bf96bf838a6c6cc077b38fe68d1/install/cupy_builder/_context.py#L81-L83

So it's important, in the DFLW context, that cugraph depend on e.g. "cupy-cuda12x", not "cupy".

Maybe we should change that in DLFW in the future, or maybe the cupy dependency should be broken out into a separate block called like depends_on_cupy here in dependencies.yaml, so that these cuda_suffixed: "false" entries could be removed.

But with code freeze for 24.08 so close, I wasn't looking to change any behavior... just to reflect the current state in the smallest, least controversial changeset possible to get these merged in.

Given all that... would you support leaving this as-is and instead having a comment like the one I'm proposing on cuml's cupy dependency over in rapidsai/cuml#5974?

# NOTE: cupy still has a "-cuda12x" suffix here, because it's suffixed
#       in DLFW builds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are on the same page with everything but the wording of this comment -- the behavior here is what I expect, but I don't think the comment needs to describe only DLFW.

To the best of my knowledge, there is no such pip package called cupy (public or internal/DLFW), so there's never a case where we want cupy to be unsuffixed in requirements or pyproject lists. cupy is only ever used as the conda package name.

Note that for RAPIDS, devcontainers (including outside of DLFW) need the unsuffixed name because that matches the exclusion list.

So either way, nothing here is DLFW-specific, if I understand correctly. We should be able to reword this to not mention DLFW and just say devcontainers. Perhaps the "from source" piece of my previous suggestion is irrelevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think about it and merge whatever you think fits best for the comment. The behavior is what I expect, so I don't want to get stuck on the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks!

I was thinking cugraph went into code freeze today, but I realize that wasn't correct. So we do have a bit more time with this one to make it look the way we long-term want it to. I'll push a change here (tomorrow) with a separated-out depends_on_cupy list and a less-confusing comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need the unsuffixed name because that matches the exclusion list.

And just to note... the exclusion list there includes both the suffixed and unsuffixed versions. That's what this line there does:

pip_noinstall+=("${pkg}" "${pkg}-cu.*");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, cupy is not captured by that exclusion list -- that list is constructed from RAPIDS packages (things in manifest.yaml). cupy is handled in a special case here, where we transform it into:
cupy-cuda[0-9]+x -> cupy-cuda${cuda_version_major}x

This is to work around issues in previous releases' dependencies.yaml that always output cupy-cuda11x instead of outputting the correct version based on the cuda matrix entry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that, appreciated!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed, updating the comment to this:

# NOTE: This is intentionally not broken into groups by a 'cuda_suffixed' selector like
#       other packages with -cu{nn}x suffixes in this file.
#       All RAPIDS wheel builds (including in devcontainers) expect cupy to be suffixed.

Hopefully that more clearly captures the intent.

...maybe the cupydependency should be broken out into a separate block called like depends_on_cupy here in dependencies.yaml

🤦🏻 that's already how it works here in cugraph, sorry... too many very-similar-looking PRs yesterday.

@jameslamb
Copy link
Member Author

Cancelling CI here (and will on other commits in the next 24 hours) to save some CI capacity for other projects that are about to enter code freeze.

@jameslamb
Copy link
Member Author

conda tests here are failing like this:

/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/raft_dask/common/comms.py:28: in <module>
    from .comms_utils import (
E   ImportError: /opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/raft_dask/common/comms_utils.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN4ucxx6Worker19registerGenericPostESt8functionIFvvEE
full stacktrace (click me)
ImportError while loading conftest '/__w/cugraph/cugraph/python/cugraph-pyg/cugraph_pyg/tests/conftest.py'.
tests/conftest.py:20: in <module>
    from cugraph.dask.comms import comms as Comms
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/__init__.py:14: in <module>
    from cugraph.community import (
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/community/__init__.py:14: in <module>
    from cugraph.community.louvain import louvain
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/community/louvain.py:15: in <module>
    from cugraph.structure import Graph
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/structure/__init__.py:14: in <module>
    from cugraph.structure.graph_classes import (
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/structure/graph_classes.py:15: in <module>
    from .graph_implementation import (
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/structure/graph_implementation/__init__.py:14: in <module>
    from .simpleGraph import simpleGraphImpl
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleGraph.py:14: in <module>
    from cugraph.structure import graph_primtypes_wrapper
graph_primtypes_wrapper.pyx:26: in init cugraph.structure.graph_primtypes_wrapper
    ???
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/dask/__init__.py:16: in <module>
    from .link_analysis.pagerank import pagerank
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/dask/link_analysis/pagerank.py:30: in <module>
    import cugraph.dask.comms.comms as Comms
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/cugraph/dask/comms/comms.py:19: in <module>
    from raft_dask.common.comms import Comms as raftComms
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/raft_dask/common/__init__.py:16: in <module>
    from .comms import Comms, local_handle
/opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/raft_dask/common/comms.py:28: in <module>
    from .comms_utils import (
E   ImportError: /opt/conda/envs/test_cugraph_pyg/lib/python3.9/site-packages/raft_dask/common/comms_utils.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN4ucxx6Worker19registerGenericPostESt8functionIFvvEE

(build link)

Manually retriggering all jobs so builds will be done against latest ucxx / raft nightlies, let's see if that helps.

@alexbarghi-nv
Copy link
Member

@jameslamb this is a known issue. I think @nv-rliu is looking into it.

@jameslamb
Copy link
Member Author

@jameslamb this is a known issue. I think @nv-rliu is looking into it.

Ah ok thanks @alexbarghi-nv ! Then I'll cancel the CI run here, to give back some capacity while it's being worked on.

@nv-rliu if you have an issue or PR, could you link it here so I can follow along?

@jameslamb
Copy link
Member Author

restarted all CI here, to hopefully pick up new artifacts with the fixes from rapidsai/raft#2394 and rapidsai/raft#2398

@jameslamb
Copy link
Member Author

I've cancelled the 3 conda-python-tests jobs, to release some CI capacity back to the rest of RAPIDS.

These have been stuck for about 2 hours. There's some issue causing 1 test to not complete (unrelated to this PR), and it's being temporarily skipped in #4559.

@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 4fb6470 into rapidsai:branch-24.08 Jul 30, 2024
131 checks passed
@jameslamb jameslamb deleted the suffix-split branch July 30, 2024 05:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants