Demo metal kernel build for pytorch & executorch #385

larryliu0820 · 2024-06-16T18:33:12Z

PyTorch access to this kernel:

Same flow: python setup.py install

import torchao
import torch
int4mv = torch.ops.torchao.int4mv.default
import torch._C
print(int4mv.has_kernel_for_dispatch_key(torch._C.DispatchKey.MPS))
> True

ExecuTorch access to this kernel:

Currently still need to manually call cmake:

cmake torchao/csrc/executorch -Bcmake-out/torchao/csrc/executorch
cmake --build cmake-out/torchao/csrc/executorch

The dylib can be found:

cmake-out/torchao/csrc/executorch/libexecutorch_kernels.dylib

TODO: add support for python setup.py install to install this dylib, so that it can go into pip wheel.

pytorch-bot · 2024-06-16T18:33:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/385

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Checkout action fails due to incompatible GLIBC

✅ No Failures

As of commit b8cd4de with merge base 8841094 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-06-16T19:33:03Z

Apologies if you're already considering this but keep in mind that I ran into tons of issues when trying to go from local CUDA binaries work to they work in CI and the built binaries on CI work locally https://github.com/pytorch/ao/tree/main/.github/workflows so that would be needed so we have a clearer idea of the end to end story

Also the need for a cmake is kind of clunky and am wondering whether we can instead have the extension creation code live in core https://github.com/pytorch/ao/blob/demo/setup.py#L33 - cc @zou3519 in case he has some thoughts on this. Our fp6 kernels don't really have build scripts

gau-nernst · 2024-06-17T00:51:28Z

Seems like the Executorch extension needs cmake to make the some of the headers available. As a temp solution, maybe we can add Executorch as a git submodule (and add appropriate include flags to setup.py)? (I tested locally and I could compile executorch/int4mv_executorch.mm directly with setup.py, but I don't know how to check if it can run correctly).

zou3519 · 2024-06-17T14:51:07Z

Also the need for a cmake is kind of clunky and am wondering whether we can instead have the extension creation code live in core https://github.com/pytorch/ao/blob/demo/setup.py#L33 - cc @zou3519 in case he has some thoughts on this. Our fp6 kernels don't really have build scripts

My two cents:

if it works, we could pre-build a .dylib and ship it with the pip wheels. My questions here are if cross-compilation works and if there are any versioning concerns (i.e. if the dylib was built with executorch version X, does it work with an executorch runtime with version Y)
torchao would need to take executorch as an optional build dependency.

larryliu0820 · 2024-06-17T16:15:52Z

@zou3519 re cross-compile question: for platforms dev infra is supporting (mac, linux etc) we would like to ship corresponding .dylib/.so along with the pip wheel. For those dev infra doesn't support we allow them to use cmake to leverage their own toolchains. For example, if a user wants to build a custom kernel for ExecuTorch on Android, they will pull ao and call cmake commands with Android toolchain.

@msaroufim re cmake vs setup.py (ninja): CMake here serves as a build script to pull in ExecuTorch lazily, only needed when someone wants to build for ExecuTorch from source, and for pip wheel packaging. If we still go through the setup.py flow for int4mv_executorch.mm then ao will have to unconditionally depend on ExecuTorch, as a git submomdule, as @gau-nernst was suggesting.

From edge side we are more than happy (we prefer :)) to go with the git submodule approach, but I think it's ao's call to decide if we want a hard dependency.

malfet · 2024-06-17T20:59:15Z

torchao/csrc/metal/mps/OperationUtils.mm

+                                                            deallocator:nil];
+  }
+  static inline void checkSupportsBFloat16() {
+    assert(isMacOS13OrNewer(MacOSVersion::MACOS_VER_14_0_PLUS) &&


Assert is a no-op and should not be used in production code (TORCH_CHECK unlike assert is a valid way to do a runtime check and error out)

malfet · 2024-06-17T21:01:09Z

torchao/csrc/metal/mps/OperationUtils.mm

+  id<MTLBuffer> getMTLBufferStorage(const uint8_t * data, size_t nbytes) {
+    return [MPSDevice::getInstance()->device() newBufferWithBytesNoCopy:(void*)data
+                                                                length:nbytes
+                                                                options:0


From function signature, it looks like you are trying to allocate a read-only buffer, but at the same time options are null.

Also, from an API perspective, passing a raw pointer with no liftetime guarantees feels wrong. Can it be converted to say shared_ptr and deallocator will dec-ref it?

I think I'll ended up just calling PyTorch's mtl_setBuffer()

Is the code in OperationUtils.mm specific to pytorch?

malfet · 2024-06-17T21:02:04Z

torchao/csrc/metal/mps/MPSStream.mm

+    _stream =
+        new MPSStream();


Is there a linter in AO? Why a new line break here?

Suggested change

_stream =

new MPSStream();

_stream = new MPSStream();

malfet · 2024-06-17T21:04:49Z

torchao/csrc/metal/mps/MPSStream.mm

+  dispatch_sync(_serialQueue, ^() {
+    @autoreleasepool {
+      endKernelCoalescing();
+      if (@available(iOS 13.0, *)) {


I'm a bit rusty with @available macro. Wouldn't it make it copy unavailable on MacOS?

Also, what is the else path here? Just pretend it happened? (IMO it should error out)

And please note that @available does not really work for shared libraries (ready Python/PyTorch usescase, this is why ugly isMacOS13Plus macro exists, that just checks for the selectors that were added in specific MacOS/iOS version)

Yeah this is copied from ExecuTorch's MPSStream.mm. I think I'll refactor it so that it's using ExecuTorch's version directly.

msaroufim · 2024-06-17T21:27:35Z

torchao/csrc/executorch/int4mv_executorch.mm

@@ -0,0 +1,81 @@
+#include "metal/int4mv_kernel.h"
+#include "metal/mps/MPSStream.h"
+#include <executorch/extension/kernel_util/make_boxed_from_unboxed_functor.h>


n00b q: what do the executorch includes do exactly? Still a beginner on executorch so could use more context

This is serving the same purpose of <torch/library.h> which allows us to do EXECUTORCH_LIBRARY (similar to TORCH_LIBRARY macro). Basically registers the kernels into ExecuTorch op registry.

kimishpatel · 2024-06-17T21:32:28Z

torchao/csrc/metal/int4mv_kernel.h

+#endif
+)METAL_QUANTIZED";
+
+void int4mv(const uint8_t * A, const uint8_t * B, uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type);


Maybe add strides too as general api? Like we may use the same for cpu too and then we can check whether expected tensor is supposed to be contiguous or not

@kimishpatel this function probably needs to be refactored given we want to use PyTorch and ExecuTorch's own MPSDevice and MPSStream

Ok but those changes will be on top of what i am saying, right?

Also for mpsdevice/stream makes sense. This is something I had in mind as I think we probably gonna have to resurrect delegate specific custom op to avoid using separate stream

kimishpatel · 2024-06-17T21:37:56Z

If we still go through the setup.py flow for int4mv_executorch.mm then ao will have to unconditionally depend on ExecuTorch, as a git submomdule, as @gau-nernst was suggesting.

@larryliu0820 why is this the case though? Unless we are building executorch as part of the pip wheel. If executorch support is mainly for xcompiling for different platforms, than we dont need to build it for host platform, right?

msaroufim · 2024-06-17T21:56:43Z

A combination of comments and n00b questions

Personally I prefer the cmake setup then over a git submodule, it makes things more opt-in
I'm still not clear on when a custom op should be registered as PT vs ET op? Are they branching paths? How does ET handle existing PT ops or PT custom ops? For code reuse shouldn't the kernel here be registered as both?
In some ways the main controversial thing about this PR is the executorch includes, if those didn't exist then this would be an identical setup to the way we package existing kernels. We register them as torch ops. So why should executorch ops be created outside of executorch

I agree with the goal of this PR is we should figure out a way for higher code reuse so the same kernel could be deployed on many devices, this is an important problem so would appreciate longer form explanations

larryliu0820 · 2024-06-17T22:34:35Z

@msaroufim please see inline comments

A combination of comments and n00b questions

Personally I prefer the cmake setup then over a git submodule, it makes things more opt-in

I'm still not clear on when a custom op should be registered as PT vs ET op? Are they branching paths? How does ET handle existing PT ops or PT custom ops? For code reuse shouldn't the kernel here be registered as both?

For any application that uses ExecuTorch, it needs to use the code in int4mv_executorch.mm. There are 2 ways of using it:

Through pybind. On user's host machine (normally linux/macos), in a python environment they will be able to load a ExecuTorch model containing that particular operator (in our case int4mv) and they can run this model. At runtime ExecuTorch will look for this kernel registered through int4mv_executorch.mm. We wanted to support this flow through packaging prebuilt libraries into pip package.
On device. For this to work on device, we will have to cross compile it into a arm64 binary or ios binary or whatever binary. It doesn't make sense to package shared objects for all these platforms into pip package, so we are providing a CMake file for users to build it using their own toolchain.

Both of these use cases require an ExecuTorch model that contains the custom op to be available. Only after this operator is registered into PT can we export the eager model into an ExecuTorch model. Therefore it is crucial that the schema (or calling convention) is the same across PT and ET, so that the model matches eager mode.

In some ways the main controversial thing about this PR is the executorch includes, if those didn't exist then this would be an identical setup to the way we package existing kernels. We register them as torch ops. So why should executorch ops be created outside of executorch

Like I mentioned above, it's crucial to make sure ET op is matching PT op definition. The only way to guarantee it is to make sure they share the same kernel and is guarded by CI job. This is also because there's no easy way for PT to "share out" a kernel without linking libtorch.

Let's say the kernel (e.g., int4mv.mm) lives in torchao but the ET kernel registration (int4mv_executorch.mm) doesn't live in torchao and lives in ET. The kernel calling convention immediately becomes a BC/FC surface and we need to take care of cross-repo compatibility. If this pattern grows it would be hard to maintain the code. Hope this makes sense.

I agree with the goal of this PR is we should figure out a way for higher code reuse so the same kernel could be deployed on many devices, this is an important problem so would appreciate longer form explanations

gau-nernst · 2024-06-17T23:23:04Z

Re-posting my comment here from discussion with @msaroufim

Overall I do feel bundling executorch code into torchao wheel feels a bit strange.

One thing I don't know is how would an executorch user consume a custom kernel from torchao. from what I understand (just skimmed through executorch stuff), they need to do some compiling with cmake anyway (maybe there are some pre-compiled components, I'm not sure). so maybe it's easier to provide a top-level Cmake folder (e.g. cmake/executorch/inv4mv.mm) that user can use to opt-in (all or specific op only) when they need to build for executorch only. i.e. people write their own Cmake, then add torchao as a "Cmake library" (not sure if I use the terminology correctly).

Or maybe when exporting to executorch, torchao will emit a Cmake file for a consumer to integrate into his executorch model (the Cmake file itself must still somewhere though, or we emit Cmake code programmatically - feel overkill for now). In general executorch is not really a python library (I think) so bundling executorch-related stuff to torchao wheel is kinda strange.

larryliu0820 · 2024-06-17T23:27:14Z

Re-posting my comment here from discussion with @msaroufim

Overall I do feel bundling executorch code into torchao wheel feels a bit strange.

One thing I don't know is how would an executorch user consume a custom kernel from torchao. from what I understand (just skimmed through executorch stuff), they need to do some compiling with cmake anyway (maybe there are some pre-compiled components, I'm not sure). so maybe it's easier to provide a top-level Cmake folder (e.g. cmake/executorch/inv4mv.mm) that user can use to opt-in (all or specific op only) when they need to build for executorch only. i.e. people write their own Cmake, then add torchao as a "Cmake library" (not sure if I use the terminology correctly).

Or maybe when exporting to executorch, torchao will emit a Cmake file for a consumer to integrate into his executorch model (the Cmake file itself must still somewhere though, or we emit Cmake code programmatically - feel overkill for now). In general executorch is not really a python library (I think) so bundling executorch-related stuff to torchao wheel is kinda strange.

We do have a pybind API though, so we want to support python use case for this custom op in ET. Please see my comment above: #385 (comment). Would it be helpful if I provide a notebook example showing how it is being used by ET in python? How can I explain this better?

gau-nernst · 2024-06-17T23:31:20Z

Yes that would be great. I'm not familiar with ET in general.

Noob question. If a user wants to run a model in Python, why would he use ET for that instead of just using PyTorch directly?

kimishpatel · 2024-06-17T23:33:22Z

torchao/csrc/metal/int4mv_kernel.mm

+#include "mps/OperationUtils.h"
+namespace torchao {
+
+void int4mv(const uint8_t * A, const uint8_t * B,  uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type) {


Why not just do void*?

kimishpatel · 2024-06-17T23:35:02Z

torchao/csrc/metal/int4mv_kernel.mm

+#include "mps/OperationUtils.h"
+namespace torchao {
+
+void int4mv(const uint8_t * A, const uint8_t * B,  uint32_t groupSize, const uint8_t * scalesAndZeros, uint8_t * C, std::array<uint32_t, 4> sizes, std::array<uint64_t, 4> nbytes, std::string A_scalar_type) {


I know this is draft but whenever you pdate, describe the args. Like nbytes seem to suggest size per "tensor" arg. In that case why use std::array? (also pass by ref)

larryliu0820 · 2024-06-17T23:35:05Z

Yes that would be great. I'm not familiar with ET in general.

Noob question. If a user wants to run a model in Python, why would he use ET for that instead of just using PyTorch directly?

ET user’s end goal would be deploying the models to devices. After exporting PyTorch eager model to ET we provide Pybind API for users to validate the exported model against PyTorch eager, before their production deployment.

kimishpatel · 2024-06-17T23:39:23Z

torchao/csrc/metal/mps/MPSDevice.h

+//
+//  Copyright (c) 2023 Apple Inc. All rights reserved.
+//  Provided subject to the LICENSE file in the top level directory.
+//


nit: wrong header

kimishpatel · 2024-06-18T03:38:51Z

5. In some ways the main controversial thing about this PR is the executorch includes, if those didn't exist then this would be an identical setup to the way we package existing kernels. We register them as torch ops. So why should executorch ops be created outside of executorch

To be sure @msaroufim, here pytorch ops are created outside of pytorch as well. That is the whole story of custom ops. You can create them out-of-the-tree and include as deps for the runtime that needs it.

kimishpatel · 2024-06-24T16:49:20Z

torchao/csrc/executorch/int4mv_executorch.mm

+      id<MTLComputePipelineState> quantizedPSO = [device newComputePipelineStateWithFunction:customQuantizedLinearFunction error:nil];
+      [computeEncoder setComputePipelineState:quantizedPSO];
+      [computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(A) offset:0 atIndex:0];
+      [computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(B) offset:0 atIndex:1];
+      [computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(scalesAndZeros) offset:0 atIndex:2];
+      [computeEncoder setBuffer:mps::delegate::getMTLBufferStorage(C) offset:0 atIndex:3];
+      [computeEncoder setBytes:sizes.data() length:sizeof(uint32_t) * sizes.size() atIndex:4];
+      [computeEncoder dispatchThreads:MTLSizeMake(N / 4 * 32, 1, M)
+        threadsPerThreadgroup:MTLSizeMake(64, 1, 1)];


This code should be shared between et and pytorch @larryliu0820

malfet · 2024-06-24T19:35:19Z

torchao/csrc/executorch/int4mv_executorch.mm

+  RuntimeContext& ctx,
+  const Tensor& A, 
+  const Tensor& B,
+  int64_t groupSize,


Why groupSize is signed?

Demo metal kernel build for pytorch & executorch

81e9646

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2024

larryliu0820 marked this pull request as draft June 16, 2024 18:36

Verify int4mv works in python

8dfbfb0

malfet reviewed Jun 17, 2024

View reviewed changes

msaroufim reviewed Jun 17, 2024

View reviewed changes

kimishpatel reviewed Jun 17, 2024

View reviewed changes

Add demo

b0bff87

kimishpatel reviewed Jun 24, 2024

View reviewed changes

malfet reviewed Jun 24, 2024

View reviewed changes

larryliu0820 added 3 commits July 3, 2024 13:33

Add demo

53a16b4

Update demo.ipynb

e479abd

Merge branch 'demo' of https://github.com/pytorch/ao into demo

b8cd4de

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo metal kernel build for pytorch & executorch #385

Demo metal kernel build for pytorch & executorch #385

larryliu0820 commented Jun 16, 2024 •

edited

Loading

pytorch-bot bot commented Jun 16, 2024 •

edited

Loading

msaroufim commented Jun 16, 2024 •

edited

Loading

gau-nernst commented Jun 17, 2024

zou3519 commented Jun 17, 2024

larryliu0820 commented Jun 17, 2024

malfet Jun 17, 2024

malfet Jun 17, 2024

larryliu0820 Jun 17, 2024

kimishpatel Jun 17, 2024

malfet Jun 17, 2024

malfet Jun 17, 2024 •

edited

Loading

larryliu0820 Jun 17, 2024

msaroufim Jun 17, 2024

larryliu0820 Jun 17, 2024

kimishpatel Jun 17, 2024

larryliu0820 Jun 17, 2024

kimishpatel Jun 17, 2024

kimishpatel commented Jun 17, 2024

msaroufim commented Jun 17, 2024 •

edited

Loading

larryliu0820 commented Jun 17, 2024 •

edited

Loading

gau-nernst commented Jun 17, 2024

larryliu0820 commented Jun 17, 2024

gau-nernst commented Jun 17, 2024

kimishpatel Jun 17, 2024

kimishpatel Jun 17, 2024

larryliu0820 commented Jun 17, 2024

kimishpatel Jun 17, 2024

kimishpatel commented Jun 18, 2024

kimishpatel Jun 24, 2024

malfet Jun 24, 2024

Demo metal kernel build for pytorch & executorch #385

Are you sure you want to change the base?

Demo metal kernel build for pytorch & executorch #385

Conversation

larryliu0820 commented Jun 16, 2024 • edited Loading

PyTorch access to this kernel:

ExecuTorch access to this kernel:

pytorch-bot bot commented Jun 16, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/385

❗ 1 Active SEVs

✅ No Failures

msaroufim commented Jun 16, 2024 • edited Loading

gau-nernst commented Jun 17, 2024

zou3519 commented Jun 17, 2024

larryliu0820 commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

malfet Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kimishpatel commented Jun 17, 2024

msaroufim commented Jun 17, 2024 • edited Loading

larryliu0820 commented Jun 17, 2024 • edited Loading

gau-nernst commented Jun 17, 2024

larryliu0820 commented Jun 17, 2024

gau-nernst commented Jun 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larryliu0820 commented Jun 17, 2024

Choose a reason for hiding this comment

kimishpatel commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larryliu0820 commented Jun 16, 2024 •

edited

Loading

pytorch-bot bot commented Jun 16, 2024 •

edited

Loading

msaroufim commented Jun 16, 2024 •

edited

Loading

malfet Jun 17, 2024 •

edited

Loading

msaroufim commented Jun 17, 2024 •

edited

Loading

larryliu0820 commented Jun 17, 2024 •

edited

Loading