Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo metal kernel build for pytorch & executorch #385

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft

Conversation

larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Jun 16, 2024

PyTorch access to this kernel:

Same flow: python setup.py install

import torchao
import torch
int4mv = torch.ops.torchao.int4mv.default
import torch._C
print(int4mv.has_kernel_for_dispatch_key(torch._C.DispatchKey.MPS))
> True

ExecuTorch access to this kernel:

Currently still need to manually call cmake:

cmake torchao/csrc/executorch -Bcmake-out/torchao/csrc/executorch
cmake --build cmake-out/torchao/csrc/executorch

The dylib can be found:

cmake-out/torchao/csrc/executorch/libexecutorch_kernels.dylib

TODO: add support for python setup.py install to install this dylib, so that it can go into pip wheel.

Copy link

pytorch-bot bot commented Jun 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/385

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit b8cd4de with merge base 8841094 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2024
@larryliu0820 larryliu0820 marked this pull request as draft June 16, 2024 18:36
@msaroufim
Copy link
Member

msaroufim commented Jun 16, 2024

Apologies if you're already considering this but keep in mind that I ran into tons of issues when trying to go from local CUDA binaries work to they work in CI and the built binaries on CI work locally https://github.com/pytorch/ao/tree/main/.github/workflows so that would be needed so we have a clearer idea of the end to end story

Also the need for a cmake is kind of clunky and am wondering whether we can instead have the extension creation code live in core https://github.com/pytorch/ao/blob/demo/setup.py#L33 - cc @zou3519 in case he has some thoughts on this. Our fp6 kernels don't really have build scripts

@gau-nernst
Copy link
Collaborator

Seems like the Executorch extension needs cmake to make the some of the headers available. As a temp solution, maybe we can add Executorch as a git submodule (and add appropriate include flags to setup.py)? (I tested locally and I could compile executorch/int4mv_executorch.mm directly with setup.py, but I don't know how to check if it can run correctly).

@zou3519
Copy link
Contributor

zou3519 commented Jun 17, 2024

Also the need for a cmake is kind of clunky and am wondering whether we can instead have the extension creation code live in core https://github.com/pytorch/ao/blob/demo/setup.py#L33 - cc @zou3519 in case he has some thoughts on this. Our fp6 kernels don't really have build scripts

My two cents:

  • if it works, we could pre-build a .dylib and ship it with the pip wheels. My questions here are if cross-compilation works and if there are any versioning concerns (i.e. if the dylib was built with executorch version X, does it work with an executorch runtime with version Y)
  • torchao would need to take executorch as an optional build dependency.

@larryliu0820
Copy link
Contributor Author

@zou3519 re cross-compile question: for platforms dev infra is supporting (mac, linux etc) we would like to ship corresponding .dylib/.so along with the pip wheel. For those dev infra doesn't support we allow them to use cmake to leverage their own toolchains. For example, if a user wants to build a custom kernel for ExecuTorch on Android, they will pull ao and call cmake commands with Android toolchain.

@msaroufim re cmake vs setup.py (ninja): CMake here serves as a build script to pull in ExecuTorch lazily, only needed when someone wants to build for ExecuTorch from source, and for pip wheel packaging. If we still go through the setup.py flow for int4mv_executorch.mm then ao will have to unconditionally depend on ExecuTorch, as a git submomdule, as @gau-nernst was suggesting.

From edge side we are more than happy (we prefer :)) to go with the git submodule approach, but I think it's ao's call to decide if we want a hard dependency.

deallocator:nil];
}
static inline void checkSupportsBFloat16() {
assert(isMacOS13OrNewer(MacOSVersion::MACOS_VER_14_0_PLUS) &&
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert is a no-op and should not be used in production code (TORCH_CHECK unlike assert is a valid way to do a runtime check and error out)

id<MTLBuffer> getMTLBufferStorage(const uint8_t * data, size_t nbytes) {
return [MPSDevice::getInstance()->device() newBufferWithBytesNoCopy:(void*)data
length:nbytes
options:0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From function signature, it looks like you are trying to allocate a read-only buffer, but at the same time options are null.

Also, from an API perspective, passing a raw pointer with no liftetime guarantees feels wrong. Can it be converted to say shared_ptr and deallocator will dec-ref it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll ended up just calling PyTorch's mtl_setBuffer()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the code in OperationUtils.mm specific to pytorch?

Comment on lines 234 to 235
_stream =
new MPSStream();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a linter in AO? Why a new line break here?

Suggested change
_stream =
new MPSStream();
_stream = new MPSStream();

dispatch_sync(_serialQueue, ^() {
@autoreleasepool {
endKernelCoalescing();
if (@available(iOS 13.0, *)) {
Copy link

@malfet malfet Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit rusty with @available macro. Wouldn't it make it copy unavailable on MacOS?

Also, what is the else path here? Just pretend it happened? (IMO it should error out)

And please note that @available does not really work for shared libraries (ready Python/PyTorch usescase, this is why ugly isMacOS13Plus macro exists, that just checks for the selectors that were added in specific MacOS/iOS version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is copied from ExecuTorch's MPSStream.mm. I think I'll refactor it so that it's using ExecuTorch's version directly.

@@ -0,0 +1,81 @@
#include "metal/int4mv_kernel.h"
#include "metal/mps/MPSStream.h"
#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b q: what do the executorch includes do exactly? Still a beginner on executorch so could use more context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is serving the same purpose of <torch/library.h> which allows us to do EXECUTORCH_LIBRARY (similar to TORCH_LIBRARY macro). Basically registers the kernels into ExecuTorch op registry.

#endif
)METAL_QUANTIZED";

void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add strides too as general api? Like we may use the same for cpu too and then we can check whether expected tensor is supposed to be contiguous or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimishpatel this function probably needs to be refactored given we want to use PyTorch and ExecuTorch's own MPSDevice and MPSStream

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but those changes will be on top of what i am saying, right?

Also for mpsdevice/stream makes sense. This is something I had in mind as I think we probably gonna have to resurrect delegate specific custom op to avoid using separate stream

@kimishpatel
Copy link
Contributor

If we still go through the setup.py flow for int4mv_executorch.mm then ao will have to unconditionally depend on ExecuTorch, as a git submomdule, as @gau-nernst was suggesting.

@larryliu0820 why is this the case though? Unless we are building executorch as part of the pip wheel. If executorch support is mainly for xcompiling for different platforms, than we dont need to build it for host platform, right?

@msaroufim
Copy link
Member

msaroufim commented Jun 17, 2024

A combination of comments and n00b questions

  1. Personally I prefer the cmake setup then over a git submodule, it makes things more opt-in
  2. I'm still not clear on when a custom op should be registered as PT vs ET op? Are they branching paths? How does ET handle existing PT ops or PT custom ops? For code reuse shouldn't the kernel here be registered as both?
  3. In some ways the main controversial thing about this PR is the executorch includes, if those didn't exist then this would be an identical setup to the way we package existing kernels. We register them as torch ops. So why should executorch ops be created outside of executorch

I agree with the goal of this PR is we should figure out a way for higher code reuse so the same kernel could be deployed on many devices, this is an important problem so would appreciate longer form explanations

@larryliu0820
Copy link
Contributor Author

larryliu0820 commented Jun 17, 2024

@msaroufim please see inline comments

A combination of comments and n00b questions

  1. Personally I prefer the cmake setup then over a git submodule, it makes things more opt-in
  2. I'm still not clear on when a custom op should be registered as PT vs ET op? Are they branching paths? How does ET handle existing PT ops or PT custom ops? For code reuse shouldn't the kernel here be registered as both?

For any application that uses ExecuTorch, it needs to use the code in int4mv_executorch.mm. There are 2 ways of using it:

  • Through pybind. On user's host machine (normally linux/macos), in a python environment they will be able to load a ExecuTorch model containing that particular operator (in our case int4mv) and they can run this model. At runtime ExecuTorch will look for this kernel registered through int4mv_executorch.mm. We wanted to support this flow through packaging prebuilt libraries into pip package.
  • On device. For this to work on device, we will have to cross compile it into a arm64 binary or ios binary or whatever binary. It doesn't make sense to package shared objects for all these platforms into pip package, so we are providing a CMake file for users to build it using their own toolchain.

Both of these use cases require an ExecuTorch model that contains the custom op to be available. Only after this operator is registered into PT can we export the eager model into an ExecuTorch model. Therefore it is crucial that the schema (or calling convention) is the same across PT and ET, so that the model matches eager mode.

  1. In some ways the main controversial thing about this PR is the executorch includes, if those didn't exist then this would be an identical setup to the way we package existing kernels. We register them as torch ops. So why should executorch ops be created outside of executorch

Like I mentioned above, it's crucial to make sure ET op is matching PT op definition. The only way to guarantee it is to make sure they share the same kernel and is guarded by CI job. This is also because there's no easy way for PT to "share out" a kernel without linking libtorch.

Let's say the kernel (e.g., int4mv.mm) lives in torchao but the ET kernel registration (int4mv_executorch.mm) doesn't live in torchao and lives in ET. The kernel calling convention immediately becomes a BC/FC surface and we need to take care of cross-repo compatibility. If this pattern grows it would be hard to maintain the code. Hope this makes sense.

I agree with the goal of this PR is we should figure out a way for higher code reuse so the same kernel could be deployed on many devices, this is an important problem so would appreciate longer form explanations

@gau-nernst
Copy link
Collaborator

Re-posting my comment here from discussion with @msaroufim

Overall I do feel bundling executorch code into torchao wheel feels a bit strange.

One thing I don't know is how would an executorch user consume a custom kernel from torchao. from what I understand (just skimmed through executorch stuff), they need to do some compiling with cmake anyway (maybe there are some pre-compiled components, I'm not sure). so maybe it's easier to provide a top-level Cmake folder (e.g. cmake/executorch/inv4mv.mm) that user can use to opt-in (all or specific op only) when they need to build for executorch only. i.e. people write their own Cmake, then add torchao as a "Cmake library" (not sure if I use the terminology correctly).

Or maybe when exporting to executorch, torchao will emit a Cmake file for a consumer to integrate into his executorch model (the Cmake file itself must still somewhere though, or we emit Cmake code programmatically - feel overkill for now). In general executorch is not really a python library (I think) so bundling executorch-related stuff to torchao wheel is kinda strange.

@larryliu0820
Copy link
Contributor Author

Re-posting my comment here from discussion with @msaroufim

Overall I do feel bundling executorch code into torchao wheel feels a bit strange.

One thing I don't know is how would an executorch user consume a custom kernel from torchao. from what I understand (just skimmed through executorch stuff), they need to do some compiling with cmake anyway (maybe there are some pre-compiled components, I'm not sure). so maybe it's easier to provide a top-level Cmake folder (e.g. cmake/executorch/inv4mv.mm) that user can use to opt-in (all or specific op only) when they need to build for executorch only. i.e. people write their own Cmake, then add torchao as a "Cmake library" (not sure if I use the terminology correctly).

Or maybe when exporting to executorch, torchao will emit a Cmake file for a consumer to integrate into his executorch model (the Cmake file itself must still somewhere though, or we emit Cmake code programmatically - feel overkill for now). In general executorch is not really a python library (I think) so bundling executorch-related stuff to torchao wheel is kinda strange.

We do have a pybind API though, so we want to support python use case for this custom op in ET. Please see my comment above: #385 (comment). Would it be helpful if I provide a notebook example showing how it is being used by ET in python? How can I explain this better?

@gau-nernst
Copy link
Collaborator

Yes that would be great. I'm not familiar with ET in general.

Noob question. If a user wants to run a model in Python, why would he use ET for that instead of just using PyTorch directly?

#include "mps/OperationUtils.h"
namespace torchao {

void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just do void*?

#include "mps/OperationUtils.h"
namespace torchao {

void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is draft but whenever you pdate, describe the args. Like nbytes seem to suggest size per "tensor" arg. In that case why use std::array? (also pass by ref)

@larryliu0820
Copy link
Contributor Author

Yes that would be great. I'm not familiar with ET in general.

Noob question. If a user wants to run a model in Python, why would he use ET for that instead of just using PyTorch directly?

ET user’s end goal would be deploying the models to devices. After exporting PyTorch eager model to ET we provide Pybind API for users to validate the exported model against PyTorch eager, before their production deployment.

Comment on lines 1 to 4
//
// Copyright (c) 2023 Apple Inc. All rights reserved.
// Provided subject to the LICENSE file in the top level directory.
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrong header

@kimishpatel
Copy link
Contributor

5. In some ways the main controversial thing about this PR is the executorch includes, if those didn't exist then this would be an identical setup to the way we package existing kernels. We register them as torch ops. So why should executorch ops be created outside of executorch

To be sure @msaroufim, here pytorch ops are created outside of pytorch as well. That is the whole story of custom ops. You can create them out-of-the-tree and include as deps for the runtime that needs it.

Comment on lines +49 to +57
id<MTLComputePipelineState> quantizedPSO = [device newComputePipelineStateWithFunction:customQuantizedLinearFunction error:nil];
[computeEncoder setComputePipelineState:quantizedPSO];
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(A) offset:0 atIndex:0];
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(B) offset:0 atIndex:1];
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(scalesAndZeros) offset:0 atIndex:2];
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(C) offset:0 atIndex:3];
[computeEncoder setBytes:sizes.data() length:sizeof(uint32_t) * sizes.size() atIndex:4];
[computeEncoder dispatchThreads:MTLSizeMake(N / 4 * 32, 1, M)
threadsPerThreadgroup:MTLSizeMake(64, 1, 1)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be shared between et and pytorch @larryliu0820

RuntimeContext& ctx,
const Tensor& A,
const Tensor& B,
int64_t groupSize,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why groupSize is signed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants