Optimize cache update. #151

wang2yn84 · 2024-07-19T00:11:24Z

We used to insert cache inside attention, then use updated cache for calculation. With the help of flash attention/ragged attention, we can delay the cache insertion to the end of each step. By switching to left aligned stacked cache, we can minimize the data transfer to HBM and therefore improve performance. The decode step time reduced from 52ms to 42ms. The left aligned cache also improves the insert efficiency. The overall benchmark performance is boosted by 15%.

FanhaiLu1 · 2024-07-19T00:24:29Z

jetstream_pt/attention_kernel.py

 ) -> tuple[jax.Array, tuple[jax.Array, jax.Array]]:
  """Ragged multi query attention."""
  with jax.named_scope("ragged_mqa"):
-    batch_size, num_heads, head_dim = q.shape
-    seq_len = k.shape[1]
+    batch_size, time, head_dim = q.shape


Any reason to change num_heads to time?

After vmap, the number of heads dimension are gone. So it's indeed the sequence length dimension, which we can also call it "time".

I feel "Time" is kind of misleading variable name here. Can we use q_seq_len instead of time?

If we are only using ragged attention in decode sate, do we need this query seq len as it always be 1?

FanhaiLu1 · 2024-07-19T00:27:05Z

jetstream_pt/attention_kernel.py

+    seq_len = k.shape[-2]
+
+    stacked = False
+    if k.ndim == 5:


Can you share an example that the n.ndim is 5 (with block quantization)?

I'm not sure why the block the quantization matters. If the cache is stacked, it will have layer, batch, number of heads, time, head dim these 5 dimensions no matter if it's quantized or not.

Thanks for clarification!

FanhaiLu1 · 2024-07-19T00:44:23Z

jetstream_pt/attention_kernel.py

+    normalize_var: bool,
+    quantized: bool,
+):
+  """Pallas kernel for flash attention."""


Replace "flash" with "ragged"?

Thanks, updated!

FanhaiLu1 · 2024-07-19T00:54:30Z

jetstream_pt/attention_kernel.py

+  def run():
+    q = q_ref[...].astype(jnp.float32)
+    k = k_ref[...].astype(jnp.float32)
+    v = v_ref[...].astype(jnp.float32)


Do we have to convert to fp32? can we use bf16?

All the arithmetic operation only supports f32 and it reports error if force to be bf16. Confirmed with XLA team about the constraint: b/340263269 and b/341729764.

FanhaiLu1 · 2024-07-19T01:05:19Z

jetstream_pt/attention_kernel.py

+      return layer_ref[0], b_next, i_next, 0
+    return b_next, i_next, 0
+
+  def kv_scale_index_map(b, i, layer_ref, start_ref, end_ref, *_):


Can you share why i_next are assigned to different position between kv_index_map and kv_scale_index_map?

i_next doesn't get different value. It's in different position because the scale has the shape of batch, 1, kv_length. And the grid[1] applied to the last dimension here. That's why we give the i_next in this dimension.

Combine with precompute_ragged_block_indices, for a giving decode: start = jnp.asarray([11, 0, 10])
input_pos = jnp.asarray([15, 9, 8]), suppose cache_len = 16
block_size = 4, can you share what are expected kv index map?

FanhaiLu1 · 2024-07-19T01:26:10Z

jetstream_pt/attention_kernel.py

@@ -310,15 +586,78 @@ def dense_attention(xq, keys, values, k_scaler=None, v_scaler=None, mask=None):
  return output


+def flash_attention(


Flash attention use block q, k, v to do tiling compute. Is this function just an vanilla attention?

Flash attention has the capability of blockwise compute the local softmax, which is exactly what we are doing here. In terms of how to divide the block, it's up to the user. We leveraged this to divide the attention calculation to existing cache and new cache. So this is indeed the flash attention.

is there any up function to call this local attention? If this function is only for the each for loop q_block, v_block and k_block, should we rename it as block_attention?

In generate Flash attention need to dynamic select the max and scale the softmax. Below code are like a flash attention from the ragged_mqa... funciton:

m_curr = jax.lax.broadcast_in_dim(m_curr, m_prev.shape, (0,)) m_next = jnp.maximum(m_prev, m_curr) alpha = jnp.exp(m_prev - m_next) beta = jnp.exp(m_curr - m_next) l_next = alpha * l_prev + beta * l_curr l_next_safe = jnp.where(l_next == 0.0, 1.0, l_next)

Correct me If i'm wrong.

FanhaiLu1 · 2024-07-19T01:54:26Z

jetstream_pt/cache_manager.py

+            self.input_pos,
+        )
+
+  def update(self, key, value, layer_id: int):


Great implementation! But in general, I feel the logic is too complex to maintain. Can we have different KVCacheGenerate class to handle ring_buffer, ragged attention and stacked or not?

I was thinking about merging the Int8KVCacheGenerate and KVCacheGenerate cuz there are a lot of shared code. I can combine all 4 additional flags (lazy_cache_update, generate_cache_stacked, new_cache_stacked, flash_attention) into 1 to simplify the logic, cuz these flags only helps for my experimentation. It should not be exposed to user. Wdyt?

Thanks. My main concerns is current code logic is too complex for read and maintain. The cache manager is very straightforward implementation before, but right now the logic is very complex. Let's only keep the most optimized code in the repo.

FanhaiLu1 · 2024-07-19T01:55:39Z

jetstream_pt/config.py

+    required=False,
+)
+flags.DEFINE_bool(
+    "generate_cache_stacked",


what are benefits of cache_stacked?

It reduces the DMA transfer time. Minimize the number of DMA transfer helps.

Also the XLA handles cache insertion for all the layers much more efficiently than iterating over layer dimension by user.

FanhaiLu1 · 2024-07-19T01:56:33Z

jetstream_pt/config.py

    "Whether to enable ring buffer",
    required=False,
 )
+flags.DEFINE_bool(
+    "flash_attention",


Do you plan to enable flash_attention by itself without ragged attention?

No, ragged attention has better performance than flash attention. As I indicated in the description, it only takes effect at test mode. Which means user cannot directly enable it in either interactive, offline or server mode.

FanhaiLu1 · 2024-07-19T02:04:06Z

jetstream_pt/layers.py

-        input_specs=(*([qkv_pspec] * 3), *([others_pspec] * 4)),
-        output_specs=(qkv_pspec, (others_pspec, others_pspec)),
-        sharding_axis=self.shard_axis,
+        input_specs=(q_pspec, q_pspec, q_pspec, *([others_pspec] * 7)),


correct me if I'm wrong, the ragged_attention_new doesn't support generate_cache_stacked

ragged_attention_new is for new cache in the current step, which has length of 1, so there is nothing to stack.

FanhaiLu1 · 2024-07-19T02:33:15Z

We used to insert cache inside attention, then use updated cache for calculation. With the help of flash attention/ragged attention, we can delay the cache insertion to the end of each step. By switching to left aligned stacked cache, we can minimize the data transfer to HBM and therefore improve performance. The decode step time reduced from 52ms to 42ms. The overall benchmark performance is boosted by 15%.

we can delay the cache insertion to the end of each step

15% improvement is a great achievement! I assume the test side use stacked aligned + ragged attention, do you have any performance number with left aligned (without stacked) + ragged attention?

wang2yn84 · 2024-07-19T04:38:02Z

We used to insert cache inside attention, then use updated cache for calculation. With the help of flash attention/ragged attention, we can delay the cache insertion to the end of each step. By switching to left aligned stacked cache, we can minimize the data transfer to HBM and therefore improve performance. The decode step time reduced from 52ms to 42ms. The overall benchmark performance is boosted by 15%.

we can delay the cache insertion to the end of each step

15% improvement is a great achievement! I assume the test side use stacked aligned + ragged attention, do you have any performance number with left aligned (without stacked) + ragged attention?

When cache is left aligned + unstacked, the data transfer overhead is non neglegible. I tried flash attention, which is 90ms for each step. These overhead has nothing to do with which attention you are using.

qihqi · 2024-07-19T19:25:00Z

jetstream_pt/cache_manager.py

+              self.new_v_scaler,
+          ]
+          (
+              self.cache_k._elem,


._elem seems spurious; as _elem is already a jax array.

So it's either: x._elem = foo(jax_array_inputs) OR x = call_jax(foo, torch_tensor_inputs)

Make sense, decided to remove _elem since it's violating lint anyway.

qihqi · 2024-07-19T19:29:46Z

jetstream_pt/layers.py

@@ -367,9 +367,25 @@ def apply_rotary_emb(
 def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
  """torch.repeat_interleave(x, dim=2, repeats=n_rep)."""

-  bs, n_kv_heads, slen, head_dim = x.shape
+  bs, n_kv_heads, slen, head_dim = (


bs, n_kv_heads, slen, head_dim, *_ = x.shape

I see, should be *_, bs, n_kv_heads, slen, head_dim.= x.shape ?

qihqi · 2024-07-19T19:29:58Z

jetstream_pt/layers.py

+      x.shape[-2],
+      x.shape[-1],
+  )
+  if x.ndim == 5:


stacked = x.ndim == 5

Better! Thanks!

qihqi · 2024-07-19T19:30:21Z

jetstream_pt/layers.py

  if n_rep == 1:
    return x
+  if stacked:


or just put the ndim == 5 here and remove the stacked var

I'd probably prefer to keep stacked to make the code more clear.

wang2yn84 · 2024-07-19T22:37:14Z

Fixed based on your comments, all the unit tests and lint errors. Please let me know if you have any other comment/suggestions. @qihqi @FanhaiLu1

qihqi · 2024-07-20T03:39:22Z

there is some updates on deps/Jetstream is that intentional?

FanhaiLu1 · 2024-07-22T17:23:46Z

jetstream_pt/attention_kernel.py

 ) -> tuple[jax.Array, tuple[jax.Array, jax.Array]]:
  """Ragged multi query attention."""
  with jax.named_scope("ragged_mqa"):
-    batch_size, num_heads, head_dim = q.shape
-    seq_len = k.shape[1]
+    batch_size, time, head_dim = q.shape


I feel "Time" is kind of misleading variable name here. Can we use q_seq_len instead of time?

If we are only using ragged attention in decode sate, do we need this query seq len as it always be 1?

FanhaiLu1 · 2024-07-22T17:25:41Z

jetstream_pt/attention_kernel.py

+    seq_len = k.shape[-2]
+
+    stacked = False
+    if k.ndim == 5:


Thanks for clarification!

FanhaiLu1 · 2024-07-22T17:39:48Z

jetstream_pt/attention_kernel.py

+  seq_len = k.shape[-2]
+
+  stacked = False
+  if k.ndim == 4:


The vmap reduce the head dem, so stacked ndim become 4 from 5. Correct me If'm I'm wrong.

I'm wondering, do we need a vmp in ragged attention? The shmap did first reduction which reduce head dim from 32 to 4 (take llama2 7b and v5e-8 as exmple), can we process 4 head in a single process? Is there performance regression if we use multiple head in ragged attention compared with single head attention?

It's not a must. By reducing the number of heads dimension to 1 the MHA becomes MQA. That's just for compatibility.

FanhaiLu1 · 2024-07-22T17:47:02Z

jetstream_pt/attention_kernel.py

+      return layer_ref[0], b_next, i_next, 0
+    return b_next, i_next, 0
+
+  def kv_scale_index_map(b, i, layer_ref, start_ref, end_ref, *_):


Combine with precompute_ragged_block_indices, for a giving decode: start = jnp.asarray([11, 0, 10])
input_pos = jnp.asarray([15, 9, 8]), suppose cache_len = 16
block_size = 4, can you share what are expected kv index map?

FanhaiLu1 · 2024-07-22T17:47:46Z

jetstream_pt/attention_kernel.py

+      jnp.array([layer]),
+      start,
+      end,
+      end,  # line_end, not actually used


Thanks for clarifying!

FanhaiLu1 · 2024-07-22T17:56:45Z

jetstream_pt/attention_kernel.py

@@ -310,15 +586,78 @@ def dense_attention(xq, keys, values, k_scaler=None, v_scaler=None, mask=None):
  return output


+def flash_attention(


is there any up function to call this local attention? If this function is only for the each for loop q_block, v_block and k_block, should we rename it as block_attention?

In generate Flash attention need to dynamic select the max and scale the softmax. Below code are like a flash attention from the ragged_mqa... funciton:

m_curr = jax.lax.broadcast_in_dim(m_curr, m_prev.shape, (0,)) m_next = jnp.maximum(m_prev, m_curr) alpha = jnp.exp(m_prev - m_next) beta = jnp.exp(m_curr - m_next) l_next = alpha * l_prev + beta * l_curr l_next_safe = jnp.where(l_next == 0.0, 1.0, l_next)

Correct me If i'm wrong.

FanhaiLu1 · 2024-07-22T18:00:32Z

jetstream_pt/cache_manager.py

+            self.input_pos,
+        )
+
+  def update(self, key, value, layer_id: int):


Thanks. My main concerns is current code logic is too complex for read and maintain. The cache manager is very straightforward implementation before, but right now the logic is very complex. Let's only keep the most optimized code in the repo.

FanhaiLu1 · 2024-07-23T16:29:16Z

Fixed based on your comments, all the unit tests and lint errors. Please let me know if you have any other comment/suggestions. @qihqi @FanhaiLu1

There are new lint error, can you fix it?

… ring buffer support then fix the mask. Int8 updates also included but not tested.

…ing.

…o the end of existing cache attention.

…od enough.

…on. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc.

…run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run.

* Fix TPU head resource name for v4 and v5e * fix format

* add xla2 fix * update jax version * revert jax TPU version

…rom merge; Fix lints;

wang2yn84 · 2024-08-06T19:35:10Z

Fixed based on your comments, all the unit tests and lint errors. Please let me know if you have any other comment/suggestions. @qihqi @FanhaiLu1

There are new lint error, can you fix it?

Fixed all the lint issues.

wang2yn84 · 2024-08-06T19:37:44Z

I will remove precompute_ragged_block_indices, clear up the ragged attention impl (e.g. remove the one for the ring buffer) and simplify the flags for non ring buffer case therefore simplify the cache manager in the subsequent PR. Will push this PR first since it's been standing alone for a while.

wang2yn84 requested review from qihqi and FanhaiLu1 July 19, 2024 00:11

FanhaiLu1 reviewed Jul 19, 2024

View reviewed changes

wang2yn84 closed this Jul 19, 2024

wang2yn84 reopened this Jul 19, 2024

qihqi reviewed Jul 19, 2024

View reviewed changes

qihqi approved these changes Jul 19, 2024

View reviewed changes

wang2yn84 mentioned this pull request Jul 20, 2024

Stacked cache for MLPerf #154

Merged

FanhaiLu1 reviewed Jul 22, 2024

View reviewed changes

FanhaiLu1 approved these changes Jul 24, 2024

View reviewed changes

wang2yn84 added 15 commits August 5, 2024 18:45

Almost working except mask, need to rebase to main to pick up the the…

f61641f

… ring buffer support then fix the mask. Int8 updates also included but not tested.

Fixed the test_model_impl for llama, but test_llama_e2e is still fail…

1bbceff

…ing.

Adds lazy_cache_update and restructure the cache flags.

eca6de7

Disable all the prints. Fix create engine.

841f393

Fix typos and minor errors.

90ffbbb

Fixes create engine.

dd4de9e

Adds new_cache_stacked and fixes cache update.

91181cd

Fix cache update when new_cach_stacked is False.

50a83d4

Fix the cache manager and make unit tests pass except for 1.

0336fb5

Updates the exportable model to return cache.

ba3d385

Removed the fori loop in cache finalize. Moves the cache.finalize() t…

c808ca8

…o the end of existing cache attention.

Try to use shard_map for cache update.

e2874fc

Fix update single cache line in cache.finalize()

0015e90

Adds int8 support.

7661bb2

Int8 left aligned lazy cache update working, performance still not go…

c965a42

…od enough.

wang2yn84 added 20 commits August 5, 2024 18:45

Using Jax API to slicing instead of Pytorch index slicing.

ee1c011

Adds stacked cache support in ragged attention reference kernel.

e08f31f

Adds stacked cache support for the modified ragged kernel.

b8e6b85

Llama2 70b int8 optimization done. Output not correct yet.

394e666

Remove testing temp output files.

0f24b8e

Fix the llama 70b output accuracy resulting from gqa.

f905860

Fixes the attention output slicing issue when not using flash attenti…

58dda18

…on. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc.

Fix the pallas kernel OOB issue

ba80c19

Fix tests; Fix lint issues;

fa0ad3f

Fix the interactive script.

4b6bfcb

Fix lint errors.

57cd1ed

Fix errors.

1f51536

Fix the comments.

3893e50

Fix based on comments; Fix all the unit tests.

89c4e88

Fix the remaining pylint errors.

004269b

Default ring buffer back to true so that all the test_run_server and …

d0777fd

…run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run.

Fix all the lint errors.

e99a815

Fix run_offline script.

223338f

Fix the ring buffer mode long latency issue.

1444e07

Rebase to main.

595ead2

wang2yn84 force-pushed the stacked-cache-2 branch from 4b6d862 to 595ead2 Compare August 5, 2024 18:51

wang2yn84 and others added 6 commits August 6, 2024 00:20

Fix all the lint issues.

d14e7f5

Fix Ray engine crash on multihost (#164)

62f3c51

Fix TPU head resource name for v4 and v5e (#165)

743c0e5

* Fix TPU head resource name for v4 and v5e * fix format

Fixed exhausted bug between head and workers (#163)

784801f

* add xla2 fix * update jax version * revert jax TPU version

Fix test_run_server issue from fixing the lint; Fix run_interactive f…

d318ce4

…rom merge; Fix lints;

Revert xla changes.

8b26e9f

wang2yn84 merged commit ee040a4 into main Aug 6, 2024
3 of 4 checks passed

		@@ -310,15 +586,78 @@ def dense_attention(xq, keys, values, k_scaler=None, v_scaler=None, mask=None):
		return output


		def flash_attention(

Optimize cache update. #151

Optimize cache update. #151

Conversation

wang2yn84 commented Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FanhaiLu1 commented Jul 19, 2024

wang2yn84 commented Jul 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wang2yn84 commented Jul 19, 2024

qihqi commented Jul 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FanhaiLu1 commented Jul 23, 2024

wang2yn84 commented Aug 6, 2024

wang2yn84 commented Aug 6, 2024

wang2yn84 commented Jul 19, 2024 •

edited

Loading