How to handle embedding layernorm #91

xiongqisong · 2024-07-09T09:37:01Z

some model may do layernorm after embedding, then send it to Attention layers, when face this type model, do i need to add embedding layernorm to eagle or any other trick which i need do to make eagle output right tokens.
I don't know why need -2 when generate train data in llama, and how to change the -2 in myown ge_data script to other model, from now on, i try not -2 at generrate data, and add embedding layernorm or not for training eagle, both don't make good result in parallel decoding, i'm confused, the Model is BluLM-7B-Chat, thanks for helping me!

Liyuhui-12 · 2024-07-10T16:09:36Z

The hidden state input to the draft model is after the norm layer, so we did not use a norm layer before the attention in the draft model.

Liyuhui-12 · 2024-07-10T16:09:38Z

What do you mean by -2?

xiongqisong · 2024-07-11T02:39:47Z

The hidden state input to the draft model is after the norm layer, so we did not use a norm layer before the attention in the draft model.

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

xiongqisong · 2024-07-11T02:42:02Z

What do you mean by -2?

In data generate python script, eagle code comment shows below:

I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

Liyuhui-12 · 2024-07-13T15:38:35Z

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

Liyuhui-12 · 2024-07-13T15:51:23Z

In data generate python script, eagle code comment shows below:

I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

This is to ensure the correct position of the loss mask. You can check tokenizer.decode(input_ids[loss_mask_pos]), which should correspond to the human instruction part offset by one token.

xiongqisong · 2024-07-23T07:48:57Z

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

I try to add embedding layernorm into Eagle to make structrue of Eagle is similar to the Original Model, i find after add embedding layernorm, Eagles work good, if i remove embedding layernorm, Eagle works bad. I don't know why, just report the appearance to you~

xiongqisong · 2024-07-23T07:50:33Z

In data generate python script, eagle code comment shows below:

I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

This is to ensure the correct position of the loss mask. You can check tokenizer.decode(input_ids[loss_mask_pos]), which should correspond to the human instruction part offset by one token.

3Q for you reply, now i know how to estimate this value! It's very helpful~

xiongqisong · 2024-07-23T07:53:21Z

I find Eagle has to many detail in the code implemention, so if it's convenient, i wish more comment or write a doc about the code design, i already add some in my fork to help me understand Eagle's complex logic, even some details which aren't mentioned in the article.

fousdfrf · 2024-08-22T14:56:52Z

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

I try to add embedding layernorm into Eagle to make structrue of Eagle is similar to the Original Model, i find after add embedding layernorm, Eagles work good, if i remove embedding layernorm, Eagle works bad. I don't know why, just report the appearance to you~

Hello, I would like to ask how much performance improvement can be achieved by adding this norm layer? Is it added during training, and then Eagle is retrained with it, or is it only added during inference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle embedding layernorm #91

How to handle embedding layernorm #91

xiongqisong commented Jul 9, 2024

Liyuhui-12 commented Jul 10, 2024

Liyuhui-12 commented Jul 10, 2024

xiongqisong commented Jul 11, 2024

xiongqisong commented Jul 11, 2024

Liyuhui-12 commented Jul 13, 2024

Liyuhui-12 commented Jul 13, 2024

xiongqisong commented Jul 23, 2024

xiongqisong commented Jul 23, 2024

xiongqisong commented Jul 23, 2024

fousdfrf commented Aug 22, 2024 •

edited

Loading

How to handle embedding layernorm #91

How to handle embedding layernorm #91

Comments

xiongqisong commented Jul 9, 2024

Liyuhui-12 commented Jul 10, 2024

Liyuhui-12 commented Jul 10, 2024

xiongqisong commented Jul 11, 2024

xiongqisong commented Jul 11, 2024

Liyuhui-12 commented Jul 13, 2024

Liyuhui-12 commented Jul 13, 2024

xiongqisong commented Jul 23, 2024

xiongqisong commented Jul 23, 2024

xiongqisong commented Jul 23, 2024

fousdfrf commented Aug 22, 2024 • edited Loading

fousdfrf commented Aug 22, 2024 •

edited

Loading