Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducing eagle on mistral-7b-v0.3-instruct #79

Open
alcholiclg opened this issue Jun 4, 2024 · 5 comments
Open

reproducing eagle on mistral-7b-v0.3-instruct #79

alcholiclg opened this issue Jun 4, 2024 · 5 comments

Comments

@alcholiclg
Copy link

Dear Eagle Team:

Hello, and thank you very much for your excellent work for the community. Recently, while attempting to replicate Eagle, I encountered some issues that I have been unable to resolve, and I would greatly appreciate your insights into the possible reasons behind them.

My goal is to replicate the effects of Eagle on mistral-7b-v0.3-instruct.

Here are the settings I used:

  1. For data generation, I employed the ge_data_all_llama2chat.py script, modifying the LLM selection to mistral-7b-v0.3-instruct. Additionally, I altered the conversation template used, removing the system_message component.

  2. During the training phase, I utilized a small model configuration with a batch size (bsz) of 12, 8xH100, and a learning rate (lr) of 18e-5. The training metrics were aligned with the official code, and the training progress is detailed below.

  3. In the testing phase, I initially evaluated the consistency on 80 questions from the vicuna_questions.jsonl file in the qlora codebase. Specifically, I compared the token_id outputs between the LLM and Eagle to assess their alignment. Surprisingly, the consistency was less than 10%. As a benchmark, I conducted tests using the officially provided Vicuna and Llama models, which yielded consistency rates of approximately 87% and 96%, respectively. These figures are significantly higher than my own test results.

Given the above, could you please provide me with some suggestions? I would be extremely grateful for any assistance you can offer.Thank you very much.

  • train settings:
image
  • config for eagle head:
image
  • loss and acc during trainning:
image
  • about the alignment metric:

ssm_ids= [1, 2, 3, 4, 5, 6], llm_ids=[1, 2, 4, 5, 6, 7], alignment=33.333%

@Liyuhui-12
Copy link
Collaborator

It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.

@alcholiclg
Copy link
Author

It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.

Thank you very much for your answer !

  1. After careful investigation, I found that the main problem in my reproduction process was that the tree mask was incorrectly constructed in the custom modeling_mistral.py file(copied from transformers and modified follow your instrution). After fixing this problem, the consistency rate of the output can reach 82%.
  2. Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.
  3. In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.

@alcholiclg alcholiclg changed the title replicate eagle on mistral-7b-v0.3-instruct reproducing eagle on mistral-7b-v0.3-instruct Jun 14, 2024
@ShivangiAg
Copy link

ShivangiAg commented Jun 18, 2024

Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?

@Liyuhui-12
Copy link
Collaborator

@alcholiclg

Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.

Floating-point calculations do not satisfy associativity, so a+b+c!=a+c+b. The final distribution can be affected by GPU calculations and reduction order. If the probabilities of two tokens are very close, the chosen token may be inconsistent. However, in our tests, under fp32 precision, the Vanilla generation and EAGLE generation in Mt-bench are completely consistent at the discrete token level, except for differences caused by different truncation strategies and maximum lengths. Is your inconsistency due to this reason?

In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.

In our experiments, when the draft model (LLaMA structure) is inconsistent with the base model (Mixtral 8x7B, MoE structure), the acceptance rate drops significantly. I believe the reason might be the structural inconsistency between the draft model and the base model.

@alcholiclg
Copy link
Author

alcholiclg commented Jul 6, 2024

Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?

Sorry for not being able to respond in time for personal reasons.
First, the changes made for Mistral mainly follow the detailed guidelines provided by the author liyihui (thanks to the author's patience), which you can check against the section identifying [modified] in modeling_llama_kv.py.

  1. make sure you import the correct libraries, classes and functions.
  2. make sure you correctly use the author's customized kv_cache, noting that its data structure is different from the original llama kv_cache, especially the indexes for accessing content, and the type of data.
  3. make sure you are using the correct attention mask for inference, you need to decide whether you want to supplement the tree_mask after the causal_mask based on the tree_mask attribute of the model. note that this is not marked very directly inside my clone's version of the code, which could lead to an error in inference. Specifically, you need to refer to the author's approach of adding a branch for judgment when generating the mask. Considering that Mistral uses a different attention mask than llama, you may need to proofread carefully.
  4. another thing I think you need to consider is whether or not to use gqa in the eagle head. from my observation, it is possible that you can get better results without using gqa to keep the structure consistent, which I have not yet verified carefully.

Second, 1.93 does not refer to tokens/second, but to the speedup ratio obtained by comparing the generation speed with vanilla autoregression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants