Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error 'accept_length' in Eagle1 or 2? #95

Closed
haiduo opened this issue Jul 15, 2024 · 13 comments
Closed

error 'accept_length' in Eagle1 or 2? #95

haiduo opened this issue Jul 15, 2024 · 13 comments

Comments

@haiduo
Copy link

haiduo commented Jul 15, 2024

best_candidate, accept_length,sample_p = evaluate_posterior(

It looks like a bug? The calculation of accept_length only considers the last step of the last conversation, not the average result. The results I reproduced are basically accept_length=2.
Experimental settings: LLM=vicuna-7b, test: MT-bench

@Lucas-TY
Copy link

They didn't release the average accept length script, you can simply dump them into jsonl file and calculate them yourself.

@haiduo
Copy link
Author

haiduo commented Jul 16, 2024

They didn't release the average accept length script, you can simply dump them into jsonl file and calculate them yourself.

Yes, I used the modified script to calculate the average length, but the result was 2.0. But the paper lists different from mine, as follows:
image
And I suspect that the calculation of n-α has similar problems. This is my reproduced results:
image

@yanjunplay
Copy link
Contributor

Hi @haiduo, curious, did you have any update on this? :-)

@haiduo
Copy link
Author

haiduo commented Jul 24, 2024

Hi @haiduo, curious, did you have any update on this? :-)

Hello. I try other devices(different GPUs) and the conclusions are basically the same as mine. I also look at the implementations of other open source frameworks and they are the same as mine. So I have reason to believe that the author either doesn't open source the complete actual test codes or his paper results are problematic.
In addition, the baseline and eagle results in the table above are reversed, and the unit is token/s.

@yanjunplay
Copy link
Contributor

yanjunplay commented Jul 24, 2024

Thanks @haiduo for replying me. I just checked the Spec Bench dashboard https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md from the link you shared. Their "Accepted Tokens" numbers for EAGLE are all 3+. You mean the logic there was correct? Then I am a bit confused how do they get 3+ while we can only get ~2 here for EAGLE 2? Have you tried the Spec-Bench scripts? Although Spec Bench did the bench mark on EAGLE 1 but I will be surprised that EAGLE 2 is so much worse then EAGLE 1. I also would like to debug together.

@haiduo
Copy link
Author

haiduo commented Jul 24, 2024

Thank you for your comment @yanjunplay . I don't reproduce the results of Spec Bench in time (but I am going to do so), I just look at the experiment of that code. In addition, the above results are problematic when using the code of eagle2 to test the accept length directly. I use the code of eagle1 (according to the author's previous reply in the issue). If there is a problem with my reproduction, it may be that I don't use chain speculation (used by the paper), but only use eagle1's "gen_ea_alpha_vicuna.py". In fact, its test is a tree with 26 nodes, that is, tree speculation. But even so, why can the values ​​of accept rate (0-alpha, 1-alpha, 2-alpha) be correct? I am confused. BTW, the author's open-source eagle2 can only train and test the speedup results currently, so the results I reproduced are based on eagle1 without any changes.
Finally, I need to ask you about the question I raised before. The author implements it differently in eagle2 and eagle1. I don't know if q(x)=1 still meets the same distribution assumption of speculative sampling.

@yanjunplay
Copy link
Contributor

yanjunplay commented Jul 24, 2024

@haiduo do you use wechat? Maybe we can quickly discuss a bit.

@haiduo
Copy link
Author

haiduo commented Jul 24, 2024

@haiduo do you use wechat? Maybe we can quickly discuss a bit. Mine wechat account: macazi

That's good!

@qwedaq
Copy link

qwedaq commented Aug 8, 2024

Thank you for your comment @yanjunplay . I don't reproduce the results of Spec Bench in time (but I am going to do so), I just look at the experiment of that code. In addition, the above results are problematic when using the code of eagle2 to test the accept length directly. I use the code of eagle1 (according to the author's previous reply in the issue). If there is a problem with my reproduction, it may be that I don't use chain speculation (used by the paper), but only use eagle1's "gen_ea_alpha_vicuna.py". In fact, its test is a tree with 26 nodes, that is, tree speculation. But even so, why can the values ​​of accept rate (0-alpha, 1-alpha, 2-alpha) be correct? I am confused. BTW, the author's open-source eagle2 can only train and test the speedup results currently, so the results I reproduced are based on eagle1 without any changes. Finally, I need to ask you about the question I raised before. The author implements it differently in eagle2 and eagle1. I don't know if q(x)=1 still meets the same distribution assumption of speculative sampling.

Hi @haiduo, were you able to get any answer as to why q(x)=1.0 in EAGLE2?

@haiduo
Copy link
Author

haiduo commented Aug 8, 2024

Hi @qwedaq, although we don't receive a reply from the author, we deduce it later and find that in the non-repeat sampling mode of Eagle2, q(x)=1.0 is a special case of speculative decoding, which only applies to Eagle2. It is not reasonable for Eagle1. Therefore, Eagle2 should have no problem doing this in theory, but I haven't had time to try it out to see how the actual generated quality is. You can try to use other benchmarks to measure its score.

@qwedaq
Copy link

qwedaq commented Aug 8, 2024

Hi @qwedaq, although we don't receive a reply from the author, we deduce it later and find that in the non-repeat sampling mode of Eagle2, q(x)=1.0 is a special case of speculative decoding, which only applies to Eagle2. It is not reasonable for Eagle1. Therefore, Eagle2 should have no problem doing this in theory, but I haven't had time to try it out to see how the actual generated quality is. You can try to use other benchmarks to measure its score.

Thank you for your quick response. I am a bit new to speculative decoding can you please elaborate on what you mean by non-repeat sampling model of EAGLE2?

@haiduo
Copy link
Author

haiduo commented Aug 8, 2024

Hi @qwedaq, although we don't receive a reply from the author, we deduce it later and find that in the non-repeat sampling mode of Eagle2, q(x)=1.0 is a special case of speculative decoding, which only applies to Eagle2. It is not reasonable for Eagle1. Therefore, Eagle2 should have no problem doing this in theory, but I haven't had time to try it out to see how the actual generated quality is. You can try to use other benchmarks to measure its score.

Thank you for your quick response. I am a bit new to speculative decoding can you please elaborate on what you mean by non-repeat sampling model of EAGLE2?

Firstly, you may need to read the two papers: "Fast Inference from Transformers via Speculative Decoding" and "Accelerating Large Language Model Decoding with Speculative Sampling". Secondly, my understanding of "sampling without replacement" or "non-repeat" is similar to what we describe in probability statistics: each time a sample is drawn, whether accepted or not, it is excluded from the total sample pool before the next round of sampling. Hope to help you.

@qwedaq
Copy link

qwedaq commented Aug 8, 2024

Got it. Will read the papers you mentioned. Thank you again :)

@haiduo haiduo closed this as completed Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants