Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle embedding layernorm #91

Open
xiongqisong opened this issue Jul 9, 2024 · 10 comments
Open

How to handle embedding layernorm #91

xiongqisong opened this issue Jul 9, 2024 · 10 comments

Comments

@xiongqisong
Copy link

some model may do layernorm after embedding, then send it to Attention layers, when face this type model, do i need to add embedding layernorm to eagle or any other trick which i need do to make eagle output right tokens.
I don't know why need -2 when generate train data in llama, and how to change the -2 in myown ge_data script to other model, from now on, i try not -2 at generrate data, and add embedding layernorm or not for training eagle, both don't make good result in parallel decoding, i'm confused, the Model is BluLM-7B-Chat, thanks for helping me!

@Liyuhui-12
Copy link
Collaborator

The hidden state input to the draft model is after the norm layer, so we did not use a norm layer before the attention in the draft model.

@Liyuhui-12
Copy link
Collaborator

What do you mean by -2?

@xiongqisong
Copy link
Author

The hidden state input to the draft model is after the norm layer, so we did not use a norm layer before the attention in the draft model.

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

@xiongqisong
Copy link
Author

What do you mean by -2?

In data generate python script, eagle code comment shows below:
图片
I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

@Liyuhui-12
Copy link
Collaborator

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

@Liyuhui-12
Copy link
Collaborator

In data generate python script, eagle code comment shows below:
图片
I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

This is to ensure the correct position of the loss mask. You can check tokenizer.decode(input_ids[loss_mask_pos]), which should correspond to the human instruction part offset by one token.

@xiongqisong
Copy link
Author

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

I try to add embedding layernorm into Eagle to make structrue of Eagle is similar to the Original Model, i find after add embedding layernorm, Eagles work good, if i remove embedding layernorm, Eagle works bad. I don't know why, just report the appearance to you~

@xiongqisong
Copy link
Author

In data generate python script, eagle code comment shows below:
图片
I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

This is to ensure the correct position of the loss mask. You can check tokenizer.decode(input_ids[loss_mask_pos]), which should correspond to the human instruction part offset by one token.

3Q for you reply, now i know how to estimate this value! It's very helpful~

@xiongqisong
Copy link
Author

I find Eagle has to many detail in the code implemention, so if it's convenient, i wish more comment or write a doc about the code design, i already add some in my fork to help me understand Eagle's complex logic, even some details which aren't mentioned in the article.

@fousdfrf
Copy link

fousdfrf commented Aug 22, 2024

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

I try to add embedding layernorm into Eagle to make structrue of Eagle is similar to the Original Model, i find after add embedding layernorm, Eagles work good, if i remove embedding layernorm, Eagle works bad. I don't know why, just report the appearance to you~

Hello, I would like to ask how much performance improvement can be achieved by adding this norm layer? Is it added during training, and then Eagle is retrained with it, or is it only added during inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants