Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add build wheel script and accompanying version info #167

Merged
merged 86 commits into from
Jan 31, 2024

Conversation

msarahan
Copy link
Contributor

@msarahan msarahan commented Jan 17, 2024

This PR adds wheel builds for ucxx. A follow-up PR will add wheel builds for distributed-ucxx.

Closes #145

@msarahan msarahan requested review from a team as code owners January 17, 2024 19:18
@msarahan msarahan requested a review from a team as a code owner January 17, 2024 19:57
python/CMakeLists.txt Show resolved Hide resolved
cpp/CMakeLists.txt Show resolved Hide resolved
ci/build_wheel.sh Outdated Show resolved Hide resolved
ci/build_wheel.sh Outdated Show resolved Hide resolved
@pentschev
Copy link
Member

@vyasr do we have a guideline on what we normally test/should test for wheels builds? Are other projects running the full test set or only a subset of everything that runs for conda? For now I've only added subset of the more common uses as reference.

Comment on lines +22 to +26
# TODO: We need distributed installed in developer mode to provide test utils,
# we still need to match to the `rapids-dask-dependency` version.
rapids-logger "Install Distributed in developer mode"
git clone https://github.com/dask/distributed /tmp/distributed
python -m pip install -e /tmp/distributed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vyasr we don't need to address this immediately, but do you think it would work if we had optional dependencies for tests in https://github.com/rapidsai/rapids-dask-dependency/blob/ac821e6a3e396340f65fe79dc834f5b711d3b0cb/pip/rapids-dask-dependency/pyproject.toml#L14-L17 , where we essentially pip install -e the packages? I don't know if that's possible -- and I think it isn't -- but would like to know if you have any ideas how this could be done. We would also need something similar for conda packages, I currently pip install -e in conda CI as well which isn't a good long-term solution.

Copy link
Contributor

@bdice bdice Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a question about this as well. Let's discuss in this thread.

@pentschev Can you explain a bit more about this requirement, and how you would resolve the TODO here? It looks like this is a requirement to run distributed-ucxx tests, but it isn't available in the distributed package? Having a dependency on installing distributed from source (as opposed to something available in the wheel) might make it hard to reproduce CI failures in a local development environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also adding here what I wrote in the other thread:

Yes, we need some testing dependencies from distributed, specifically

from distributed.comm.tests.test_comms import check_deserialize

I don't know of a good solution for this problem either though and I very much don't want to copy-paste the code in here because that will eventually cause us to get out-of-sync with Distributed code and essentially mean we're not testing it properly anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think dask devs would be open to moving some of the functionality of distributed.comm.tests.test_comms into a utility module that is actually shipped with the package? It sounds like everything that we're doing here is to work around the fact that we're trying to use functionality that they explicitly and intentionally do not ship. Before we look into strange workarounds like this one, is there an alternative that involves them actually shipping the necessary APIs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that they intentionally do not ship because they are only used in that particular test file. I very much wanted to avoid having to modify things there, the file is reasonably large with many functions that would need to move and we would have to bear the burden of properly documenting/annotating those functions that I'm not even familiar with internals, and eventually we would also need to bear the burden of some maintenance, which we currently don't have the bandwidth to. If we can't find a reasonably simple solution to install Distributed in dev mode, I think I'd prefer to limit or skip that test for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather limit or skip. This logic:

I very much don't want to copy-paste the code in here because that will eventually cause us to get out-of-sync with Distributed code and essentially mean we're not testing it properly anymore.

feels directly contradictory to

we would have to bear the burden of properly documenting/annotating those functions that I'm not even familiar with internals, and eventually we would also need to bear the burden of some maintenance, which we currently don't have the bandwidth to

If we don't want to bear the maintenance burden then I'd much rather not try and take on some partial maintenance burden via a hacky installation of distributed here that forces us to stay partially in sync.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the contradiction as the statements are talking about two different aspects:

  1. Copy-pasting those function in distributed-ucxx means they will eventually become out-of-sync with the actual implementation from Distributed, which is the source-of-truth;
  2. The maintenance burden I'm referring to is the maintenance burden of those functions in Distributed as we will now have "some authorship" because we're requesting they live as part of distributed packages and will have to document and properly type-annotate them, etc.

IOW, we still want to use them here for testing purposes only but I'd very much avoid touching much of Distributed code if we don't absolutely need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but right now we implicitly incur part of the maintenance burden of 2 because if dask decides to change the behavior of that function in some way we'll see it in our test. We're not maintaining the function, but we are effectively signing up to keep track of an implementation detail of dask's tests. Conversely, if it was public there would be a higher maintenance burden (because we'd have to maintain docs/annotations/etc as you say) but we would also be less likely to be broken without warning (of course, given dask's low stability it could still happen, but it reduces the odds). The current approach also makes it harder for devs to understand why the tests fail if they don't know that they need to download the source of distributed to access this functionality.

In any case, I don't think there is a good solution for installing the editable version from source. Every solution I can think of is equally hacky, so for now I think we can leave this as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if dask decides to change the behavior of that function in some way we'll see it in our test.

Yes, we definitely want to see any of that. If the behavior changed there's a high chance we need to do adjustments on our end, which we must catch ASAP.

We're not maintaining the function, but we are effectively signing up to keep track of an implementation detail of dask's tests. [...] given dask's low stability it could still happen, but it reduces the odds [...]

I agree with you here, but for now this is a price I'm willing to pay to have an extra 1% confidence that we're less likely to be broken with little/no warning due to something we're skipping tests for.

The current approach also makes it harder for devs to understand why the tests fail if they don't know that they need to download the source of distributed to access this functionality.

I absolutely agree with you and it's annoying. What if I would then split that one test (and perhaps others in the future) into a tests_internals directory and run everything else with "vanilla" Distributed, and have essentially the following:

pytest python/distributed-ucxx/distributed_ucxx/tests/
pip install -e ...
pytest python/distributed-ucxx/distributed_ucxx/tests_internals/

Does that sound like a reasonable tradeoff to you? I just want to test as much as possible and sometimes that means we need to test bits that are not part of Distributed's public API, like this.

In any case, I don't think there is a good solution for installing the editable version from source. Every solution I can think of is equally hacky, so for now I think we can leave this as is.

Agreed, hopefully what I'm proposing above is a reasonable tradeoff for now.

Copy link
Contributor

@vyasr vyasr Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the split you proposed for getting extra coverage in CI makes sense. Let's make that change in a follow-up PR where you apply it to both conda and pip testing.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments to address, which need some input from others.

ci/test_wheel_distributed_ucxx.sh Show resolved Hide resolved
ci/test_wheel_distributed_ucxx.sh Outdated Show resolved Hide resolved
ci/test_wheel_ucxx.sh Show resolved Hide resolved
ci/test_wheel_distributed_ucxx.sh Outdated Show resolved Hide resolved
ci/test_wheel_ucxx.sh Outdated Show resolved Hide resolved
Copy link
Contributor Author

@msarahan msarahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm amazed and horrified by the library manipulation that has to be done, but needs must. I'm glad that the three of you are so knowledgeable. I can't approve this, but I don't have anything left to add that would improve this.

@vyasr
Copy link
Contributor

vyasr commented Jan 30, 2024

I'm amazed and horrified by the library manipulation that has to be done, but needs must. I'm glad that the three of you are so knowledgeable. I can't approve this, but I don't have anything left to add that would improve this.

Yeah it's pretty horrific. At least this time around I already knew what was needed, the first time was an adventure to figure out why cuml dask tests were seg faulting 😅 Especially since we don't have MG CI for wheel tests so that was a bit of a last-minute discovery for wheels.

The full history's in this ucx-py PR. A couple of long live debugging sessions with me and Ben Z to get this working IIRC.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very close now, we should be able to merge soon.

.github/workflows/build.yaml Outdated Show resolved Hide resolved
.github/workflows/pr.yaml Outdated Show resolved Hide resolved
ci/build_wheel.sh Outdated Show resolved Hide resolved
ci/build_wheel_distributed_ucxx.sh Outdated Show resolved Hide resolved
ci/build_wheel_ucxx.sh Outdated Show resolved Hide resolved
ci/wheel_smoke_test_distributed_ucxx.py Outdated Show resolved Hide resolved
ci/wheel_smoke_test_distributed_ucxx.py Outdated Show resolved Hide resolved
ci/wheel_smoke_test_distributed_ucxx.py Show resolved Hide resolved
ci/wheel_smoke_test_ucxx.py Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've addressed all your comments @vyasr , please have another look.

ci/build_wheel.sh Outdated Show resolved Hide resolved
Comment on lines +22 to +26
# TODO: We need distributed installed in developer mode to provide test utils,
# we still need to match to the `rapids-dask-dependency` version.
rapids-logger "Install Distributed in developer mode"
git clone https://github.com/dask/distributed /tmp/distributed
python -m pip install -e /tmp/distributed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where the contradiction as the statements are talking about two different aspects:

  1. Copy-pasting those function in distributed-ucxx means they will eventually become out-of-sync with the actual implementation from Distributed, which is the source-of-truth;
  2. The maintenance burden I'm referring to is the maintenance burden of those functions in Distributed as we will now have "some authorship" because we're requesting they live as part of distributed packages and will have to document and properly type-annotate them, etc.

IOW, we still want to use them here for testing purposes only but I'd very much avoid touching much of Distributed code if we don't absolutely need to.

ci/wheel_smoke_test_distributed_ucxx.py Outdated Show resolved Hide resolved
ci/test_common.sh Outdated Show resolved Hide resolved
ci/wheel_smoke_test_distributed_ucxx.py Show resolved Hide resolved
ci/wheel_smoke_test_distributed_ucxx.py Outdated Show resolved Hide resolved
ci/wheel_smoke_test_ucxx.py Outdated Show resolved Hide resolved
cpp/CMakeLists.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor leftover items then we should be good to go.

ci/build_wheel_distributed_ucxx.sh Outdated Show resolved Hide resolved
ci/build_wheel_ucxx.sh Outdated Show resolved Hide resolved
Comment on lines +22 to +26
# TODO: We need distributed installed in developer mode to provide test utils,
# we still need to match to the `rapids-dask-dependency` version.
rapids-logger "Install Distributed in developer mode"
git clone https://github.com/dask/distributed /tmp/distributed
python -m pip install -e /tmp/distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but right now we implicitly incur part of the maintenance burden of 2 because if dask decides to change the behavior of that function in some way we'll see it in our test. We're not maintaining the function, but we are effectively signing up to keep track of an implementation detail of dask's tests. Conversely, if it was public there would be a higher maintenance burden (because we'd have to maintain docs/annotations/etc as you say) but we would also be less likely to be broken without warning (of course, given dask's low stability it could still happen, but it reduces the odds). The current approach also makes it harder for devs to understand why the tests fail if they don't know that they need to download the source of distributed to access this functionality.

In any case, I don't think there is a good solution for installing the editable version from source. Every solution I can think of is equally hacky, so for now I think we can leave this as is.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for all the teamwork on this!

@bdice bdice removed the DO NOT MERGE Hold off on merging; see PR for details label Jan 31, 2024
@bdice
Copy link
Contributor

bdice commented Jan 31, 2024

/merge

@rapids-bot rapids-bot bot merged commit 5d50ef9 into rapidsai:branch-0.37 Jan 31, 2024
47 checks passed
@msarahan msarahan deleted the wheel-build-ci branch May 29, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build and publish wheels for UCXX
6 participants