-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo metal kernel build for pytorch & executorch #385
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/385
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit b8cd4de with merge base 8841094 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Apologies if you're already considering this but keep in mind that I ran into tons of issues when trying to go from local CUDA binaries work to they work in CI and the built binaries on CI work locally https://github.com/pytorch/ao/tree/main/.github/workflows so that would be needed so we have a clearer idea of the end to end story Also the need for a cmake is kind of clunky and am wondering whether we can instead have the extension creation code live in core https://github.com/pytorch/ao/blob/demo/setup.py#L33 - cc @zou3519 in case he has some thoughts on this. Our fp6 kernels don't really have build scripts |
Seems like the Executorch extension needs cmake to make the some of the headers available. As a temp solution, maybe we can add Executorch as a git submodule (and add appropriate include flags to |
My two cents:
|
@zou3519 re cross-compile question: for platforms dev infra is supporting (mac, linux etc) we would like to ship corresponding .dylib/.so along with the pip wheel. For those dev infra doesn't support we allow them to use cmake to leverage their own toolchains. For example, if a user wants to build a custom kernel for ExecuTorch on Android, they will pull ao and call cmake commands with Android toolchain. @msaroufim re cmake vs setup.py (ninja): CMake here serves as a build script to pull in ExecuTorch lazily, only needed when someone wants to build for ExecuTorch from source, and for pip wheel packaging. If we still go through the setup.py flow for From edge side we are more than happy (we prefer :)) to go with the |
deallocator:nil]; | ||
} | ||
static inline void checkSupportsBFloat16() { | ||
assert(isMacOS13OrNewer(MacOSVersion::MACOS_VER_14_0_PLUS) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert is a no-op and should not be used in production code (TORCH_CHECK
unlike assert
is a valid way to do a runtime check and error out)
id<MTLBuffer> getMTLBufferStorage(const uint8_t * data, size_t nbytes) { | ||
return [MPSDevice::getInstance()->device() newBufferWithBytesNoCopy:(void*)data | ||
length:nbytes | ||
options:0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From function signature, it looks like you are trying to allocate a read-only buffer, but at the same time options are null.
Also, from an API perspective, passing a raw pointer with no liftetime guarantees feels wrong. Can it be converted to say shared_ptr
and deallocator will dec-ref it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll ended up just calling PyTorch's mtl_setBuffer()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the code in OperationUtils.mm specific to pytorch?
torchao/csrc/metal/mps/MPSStream.mm
Outdated
_stream = | ||
new MPSStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a linter in AO? Why a new line break here?
_stream = | |
new MPSStream(); | |
_stream = new MPSStream(); |
torchao/csrc/metal/mps/MPSStream.mm
Outdated
dispatch_sync(_serialQueue, ^() { | ||
@autoreleasepool { | ||
endKernelCoalescing(); | ||
if (@available(iOS 13.0, *)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit rusty with @available
macro. Wouldn't it make it copy unavailable on MacOS?
Also, what is the else
path here? Just pretend it happened? (IMO it should error out)
And please note that @available
does not really work for shared libraries (ready Python/PyTorch usescase, this is why ugly isMacOS13Plus
macro exists, that just checks for the selectors that were added in specific MacOS/iOS version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is copied from ExecuTorch's MPSStream.mm. I think I'll refactor it so that it's using ExecuTorch's version directly.
@@ -0,0 +1,81 @@ | |||
#include "metal/int4mv_kernel.h" | |||
#include "metal/mps/MPSStream.h" | |||
#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b q: what do the executorch includes do exactly? Still a beginner on executorch so could use more context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is serving the same purpose of <torch/library.h>
which allows us to do EXECUTORCH_LIBRARY
(similar to TORCH_LIBRARY
macro). Basically registers the kernels into ExecuTorch op registry.
#endif | ||
)METAL_QUANTIZED"; | ||
|
||
void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add strides too as general api? Like we may use the same for cpu too and then we can check whether expected tensor is supposed to be contiguous or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kimishpatel this function probably needs to be refactored given we want to use PyTorch and ExecuTorch's own MPSDevice
and MPSStream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok but those changes will be on top of what i am saying, right?
Also for mpsdevice/stream makes sense. This is something I had in mind as I think we probably gonna have to resurrect delegate specific custom op to avoid using separate stream
@larryliu0820 why is this the case though? Unless we are building executorch as part of the pip wheel. If executorch support is mainly for xcompiling for different platforms, than we dont need to build it for host platform, right? |
A combination of comments and n00b questions
I agree with the goal of this PR is we should figure out a way for higher code reuse so the same kernel could be deployed on many devices, this is an important problem so would appreciate longer form explanations |
@msaroufim please see inline comments
For any application that uses ExecuTorch, it needs to use the code in int4mv_executorch.mm. There are 2 ways of using it:
Both of these use cases require an ExecuTorch model that contains the custom op to be available. Only after this operator is registered into PT can we export the eager model into an ExecuTorch model. Therefore it is crucial that the schema (or calling convention) is the same across PT and ET, so that the model matches eager mode.
Like I mentioned above, it's crucial to make sure ET op is matching PT op definition. The only way to guarantee it is to make sure they share the same kernel and is guarded by CI job. This is also because there's no easy way for PT to "share out" a kernel without linking Let's say the kernel (e.g.,
|
Re-posting my comment here from discussion with @msaroufim Overall I do feel bundling executorch code into torchao wheel feels a bit strange. One thing I don't know is how would an executorch user consume a custom kernel from torchao. from what I understand (just skimmed through executorch stuff), they need to do some compiling with cmake anyway (maybe there are some pre-compiled components, I'm not sure). so maybe it's easier to provide a top-level Cmake folder (e.g. Or maybe when exporting to executorch, torchao will emit a Cmake file for a consumer to integrate into his executorch model (the Cmake file itself must still somewhere though, or we emit Cmake code programmatically - feel overkill for now). In general executorch is not really a python library (I think) so bundling executorch-related stuff to torchao wheel is kinda strange. |
We do have a pybind API though, so we want to support python use case for this custom op in ET. Please see my comment above: #385 (comment). Would it be helpful if I provide a notebook example showing how it is being used by ET in python? How can I explain this better? |
Yes that would be great. I'm not familiar with ET in general. Noob question. If a user wants to run a model in Python, why would he use ET for that instead of just using PyTorch directly? |
torchao/csrc/metal/int4mv_kernel.mm
Outdated
#include "mps/OperationUtils.h" | ||
namespace torchao { | ||
|
||
void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do void*?
torchao/csrc/metal/int4mv_kernel.mm
Outdated
#include "mps/OperationUtils.h" | ||
namespace torchao { | ||
|
||
void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is draft but whenever you pdate, describe the args. Like nbytes seem to suggest size per "tensor" arg. In that case why use std::array? (also pass by ref)
ET user’s end goal would be deploying the models to devices. After exporting PyTorch eager model to ET we provide Pybind API for users to validate the exported model against PyTorch eager, before their production deployment. |
torchao/csrc/metal/mps/MPSDevice.h
Outdated
// | ||
// Copyright (c) 2023 Apple Inc. All rights reserved. | ||
// Provided subject to the LICENSE file in the top level directory. | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wrong header
To be sure @msaroufim, here pytorch ops are created outside of pytorch as well. That is the whole story of custom ops. You can create them out-of-the-tree and include as deps for the runtime that needs it. |
id<MTLComputePipelineState> quantizedPSO = [device newComputePipelineStateWithFunction:customQuantizedLinearFunction error:nil]; | ||
[computeEncoder setComputePipelineState:quantizedPSO]; | ||
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(A) offset:0 atIndex:0]; | ||
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(B) offset:0 atIndex:1]; | ||
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(scalesAndZeros) offset:0 atIndex:2]; | ||
[computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(C) offset:0 atIndex:3]; | ||
[computeEncoder setBytes:sizes.data() length:sizeof(uint32_t) * sizes.size() atIndex:4]; | ||
[computeEncoder dispatchThreads:MTLSizeMake(N / 4 * 32, 1, M) | ||
threadsPerThreadgroup:MTLSizeMake(64, 1, 1)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should be shared between et and pytorch @larryliu0820
RuntimeContext& ctx, | ||
const Tensor& A, | ||
const Tensor& B, | ||
int64_t groupSize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why groupSize is signed?
PyTorch access to this kernel:
Same flow:
python setup.py install
ExecuTorch access to this kernel:
Currently still need to manually call cmake:
The dylib can be found:
TODO: add support for
python setup.py install
to install this dylib, so that it can go into pip wheel.