[memories] Transfer to pinned_host fast path in async_serialize #22114

gspschmid · 2024-06-26T18:32:18Z

Adds a fast path to jax.experimental.array_serialization.serialization.async_serialize that avoids XLA's regular device-to-host transfer and instead uses a single device-to-pinned-host transfer per _write_array(arr) invocation. This allows us to achieve much closer to ideal transfer bandwidths in practice. For comparison, the existing approach stages copies through a fixed size intermediate 128MB-buffer and requires sizeof(arr)/128MB alternations between D2H and H2H copies.

Note that the np.array(data, copy=False) is not strictly necessary as the tensorstore invocation t.write(...) immediately performs the C-API equivalent of np.array(data, copy=None). We expect all of these to be zero-copy, hence explicitly calling np.array(data, copy=False) provides some extra safety, since it would fail if jax.Array's implementation changed and no longer permitted zero-copying its private numpy array _value. Alas the latter check is not fool-proof: for example, prior to XLA#14089 the construction of the jax.Array from the device buffer also forced a copy.

Depends on XLA#14087 XLA#14088 XLA#14089 XLA#14090

yashk2810 · 2024-06-26T18:59:51Z

jax/experimental/array_serialization/serialization.py

+    # If available, transfer to pinned host memory
+    sharding = jax.sharding.SingleDeviceSharding(shard.device,
+        memory_kind="pinned_host")
+    data = jax.jit(lambda x: x, out_shardings=sharding)(data)


Use jax.device_put instead?

Unfortunately doesn't work yet with memory_type="pinned_host" and based on our attempts we believe this will require a longer tail of fixes to XLA. @jaro-sevcik can elaborate.

It should! We have tests in memories_test.py that shows it does work. It would be nice to not run a xla computation for a transfer of this kind.

We need couple more patches in XLA for non-jitted device_put to work: first openxla/xla#14089 (already submitted for review), and then the last commit from
https://github.com/jaro-sevcik/xla/tree/device-put-memory-kind-sharding (will submit once openxla/xla#14089 lands).

Switched over to using device_put in this PR.

For completeness, this PR now depends on the XLA PR openxla/xla#14268 (that enables copying buffers to a different memory space).

yashk2810 · 2024-06-26T19:00:13Z

jax/experimental/array_serialization/serialization.py

+        memory_kind="pinned_host")
+    data = jax.jit(lambda x: x, out_shardings=sharding)(data)
+    # Allow other transfers to be scheduled simultaneously
+    await asyncio.sleep(0)


Why do we need this sleep? await should schedule it concurrently anyways right?

(Deferred to a separate commit)

yashk2810 · 2024-06-26T19:00:55Z

Do you have some benchmarks where this is super fast and helpful? (something that you ran locally or in your runs?)

gspschmid · 2024-06-28T12:53:10Z

@yashk2810

Do you have some benchmarks where this is super fast and helpful? (something that you ran locally or in your runs?)

Here's a self-contained example that doesn't quite behave like the E2E workload mentioned before, but illustrates the effects and is a good candidate for profiling: https://gist.github.com/gspschmid/52a1062916c7030a513b0581bd56c5be

The first improvement corresponds to this PR (along with its XLA dependencies), the second improvement corresponds to #22169. Note that after applying the first improvement other overheads in tensorstore's t.write(data) begin to dominate. Attached below are some screenshots of nsys profiles corresponding to the last iteration for each variant.

Baseline:

device-to-pinned-host transfer:

device-to-pinned-host transfer + overlap shard transfers:

gspschmid · 2024-07-01T09:18:10Z

@yashk2810 Not sure what CI you run for JAX contributions, but now that the remaining XLA PRs are in (openxla/xla#14089 and openxla/xla#14268) this should be ready to test.

yashk2810 reviewed Jun 26, 2024

View reviewed changes

gspschmid force-pushed the gschmid/async_serialize-transfer-pinned branch 2 times, most recently from 976610e to 4488182 Compare June 27, 2024 10:46

gspschmid mentioned this pull request Jun 28, 2024

[memories] Overlap shard transfers in async_serialize #22169

Open

[memories] Transfer to pinned_host fast path in async_serialize

62940cf

gspschmid force-pushed the gschmid/async_serialize-transfer-pinned branch from 4488182 to 62940cf Compare July 1, 2024 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[memories] Transfer to pinned_host fast path in async_serialize #22114

[memories] Transfer to pinned_host fast path in async_serialize #22114

gspschmid commented Jun 26, 2024

yashk2810 Jun 26, 2024

gspschmid Jun 26, 2024

yashk2810 Jun 26, 2024

jaro-sevcik Jun 27, 2024

gspschmid Jun 27, 2024

jaro-sevcik Jun 28, 2024

yashk2810 Jun 26, 2024

gspschmid Jun 26, 2024

yashk2810 commented Jun 26, 2024

gspschmid commented Jun 28, 2024

gspschmid commented Jul 1, 2024

[memories] Transfer to pinned_host fast path in async_serialize #22114

Are you sure you want to change the base?

[memories] Transfer to pinned_host fast path in async_serialize #22114

Conversation

gspschmid commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yashk2810 commented Jun 26, 2024

gspschmid commented Jun 28, 2024

gspschmid commented Jul 1, 2024