Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721

ikawrakow · 2024-02-26T06:44:54Z

This PR adds two new quantization types, IQ2_S and IQ2_M, to complete the coverage of the 2-3 bit quantization range.

Why? The reason for having all these new quantization types is best explained with the following graph, which shows the quantization error defined as PPL(Q)/PPL(fp16)-1 as a function of bits-per-weight (bpw). The bpw is for the complete model, including output.weight and token_embd.weight tensors. The data is for LLaMA-v2-13B, but other models show a very similar behavior.

The black/blue symbols show the results for k-/legacy quants using 668b31f, which is the last commit before I started adding i-quants and imatrix stuff. The red symbols represent the new i-quants and updated k-quants, including IQ2_S and IQ2_M added by this PR; magenta circles are for legacy quants (with all i-, k-, and legacy quants using imatrix from wiki.train.raw). So, in a nutshell

We now have several quantization options in the sub-3-bit range. Why do we need several? Because the only reason to go to sub-3-bit quantization is to squeeze a large model into the limited RAM/VRAM available, and having several quantization types allows one to select the quantization type with the lowest quantization error that can be used with the available computing platform (fits in RAM/VRAM, has acceptable performance when partially offloading to the GPU, etc.)
We have a much lower quantization error in the 3-4 bpw quantization range (note that the y-axis is logarithmic, so the reduction in quantization error is in the 50%-100% range). Alternatively, if we were satisfied with the generation quality of the former 3-bit quantization, we can now have the same with a ~10% smaller model
We now have a lower quantization error in the 4+ bit range for k- and legacy quants (and Q4_1 behaves as expected instead of having a higher quantization error than Q4_0 as it often was the case).
I think this graph will make it easy to see the rough quantization error correspondence between k- and i-quants: Q2_K -> IQ3_XXS, Q3_K_S -> IQ3_XS, Q3_K_M -> IQ3_S, Q3_K_L -> IQ3_M

Interestingly enough, the IQ2_XXS...IQ3_M quantization error can be described with a simple fit in the form of a * exp(-b * bpw). The 1.5 bpw quantization IQ1_S (which I'm not showing here to not have too a large y-axis range) nearly falls onto the same fit. If we were able to keep the rate of quantization error reduction with bpw beyond 4 bpw, we would get Q6_K performance at about 5.3 bpw.

To me it looks like we need a quantization type with about 4 bpw to close the gap between IQ3_M and Q4_K.

ggerganov

To me it looks like we need a quantization type with about 4 bpw to close the gap between IQ3_M and Q4_K.

Yes, I agree

examples/quantize/quantize.cpp

sorasoras · 2024-02-26T08:52:34Z

@ikawrakow

Q5KS
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_1:   40 tensors
llama_model_loader: - type q5_K:  161 tensors
llama_model_loader: - type q6_K:    1 tensors

The question is：
Could expect new NL quant on its way like IQ4NL？
IQ5NL and IQ6NL in particular.
There‘re significance improvement by replace that 40 tensor from Q4_0 to IQ4NL without any different in size in QWEN1 at least.
Anyway
Thanks for the hard work

ikawrakow · 2024-02-26T09:06:28Z

The question is：Could expect new NL quant on its way like IQ4NL？IQ5NL and IQ6NL in particular

At 5 bits and above there isn't much gain from alternative quantization, at least not for the models that I'm using for testing where, once you use an imatrix, Q5_0 is basically as good as Q5_K.

Co-authored-by: Georgi Gerganov <[email protected]>

sorasoras · 2024-02-26T11:30:16Z

The question is：Could expect new NL quant on its way like IQ4NL？IQ5NL and IQ6NL in particular

At 5 bits and above there isn't much gain from alternative quantization, at least not for the models that I'm using for testing where, once you use an imatrix, Q5_0 is basically as good as Q5_K.

fun fact,
Q5KS beat Q5KM for my use case with imatrix
The differ is Q6K and Q8_0 where Q5KM use Q8 and Q6K for Q5KS
in my use case.

dranger003 · 2024-02-26T12:59:32Z

@ikawrakow Thanks for the amazing work. While testing IQ3_S/IQ3_M from #5676 I'm getting segfault when using more than 2 threads with quantize on some models. I'll test this PR later today to see if the same issue is present. All other quant types are working fine, so I'm not sure what is different with these (that could be thread related).

I added the output here #5676 (comment).

ikawrakow · 2024-02-26T13:59:10Z

@dranger003 Can you post a failing model somewhere where I can download? I have quantized many models with these quantization type without issue (and yes, I'm always using multi-threading), so don't know what could be wrong without a test case.

dranger003 · 2024-02-26T14:56:07Z

@dranger003 Can you post a failing model somewhere where I can download? I have quantized many models with these quantization type without issue (and yes, I'm always using multi-threading), so don't know what could be wrong without a test case.

@ikawrakow Yes, although you might hate me quite a bit given its size. See here.

EDIT: Adding details here as I find out more, hopefully this can help. Another finding is that it crashes using 8 or 12 threads but it doesn't crash using 2 or 16 threads. I have devtools installed and can debug the code if you need me to lookup something specific, but I just don't know where to look otherwise without some guidance.

EDIT2: I think this may be a race condition and not directly tied to the thread count. For example, if I run the quantize several times in a row with the same thread count, say 12, then after a number of failed attempts one of the run will go through fine. Also, I just tested IQ2_S/IQ2_M and I get the same behavior. I have been quantizing several models and I only get this issue with the new IQ3/IQ2 quant types.

…on range (ggerganov#5721) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Iwan Kawrakow <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

Adding IQ2_S and IQ2_M as a single cumulative commit

88a18f9

ggerganov approved these changes Feb 26, 2024

View reviewed changes

examples/quantize/quantize.cpp Outdated Show resolved Hide resolved

Update examples/quantize/quantize.cpp

ec0abd2

Co-authored-by: Georgi Gerganov <[email protected]>

ikawrakow merged commit a33e6a0 into master Feb 26, 2024
60 of 61 checks passed

ikawrakow deleted the ik/iq2_s_new2 branch February 26, 2024 16:28

sorasoras mentioned this pull request Feb 26, 2024

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantizati… sorasoras/llama.cpp#2

Closed

mofosyne mentioned this pull request May 15, 2024

gguf.md: Add GGUF Naming Convention Section ggerganov/ggml#822

Merged

mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721

ikawrakow commented Feb 26, 2024 •

edited

Loading

ggerganov left a comment

sorasoras commented Feb 26, 2024

ikawrakow commented Feb 26, 2024

sorasoras commented Feb 26, 2024 •

edited

Loading

dranger003 commented Feb 26, 2024

ikawrakow commented Feb 26, 2024

dranger003 commented Feb 26, 2024 •

edited

Loading

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721

Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range #5721

Conversation

ikawrakow commented Feb 26, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

sorasoras commented Feb 26, 2024

ikawrakow commented Feb 26, 2024

sorasoras commented Feb 26, 2024 • edited Loading

dranger003 commented Feb 26, 2024

ikawrakow commented Feb 26, 2024

dranger003 commented Feb 26, 2024 • edited Loading

ikawrakow commented Feb 26, 2024 •

edited

Loading

sorasoras commented Feb 26, 2024 •

edited

Loading

dranger003 commented Feb 26, 2024 •

edited

Loading