IQ4_XS: a 4.25 bpw quantization #5747

ikawrakow · 2024-02-27T08:42:49Z

This is basically the same as IQ4_NL, but in super-blocks of 256 with 6-bit scales for the blocks of 32 weights. It looks pretty good on the quantization error vs quantized model size curve:

It is possible to move the point closer to the IQ2_XXS...IQ3_M fit line by using IQ3_S for the attn_k and attn_q tensors. This reduces the quantized model size to about 4.1 bpw at the expense of a ~0.3% increase in PPL. But given that currently CPU performance for IQ3_S is pretty bad, I decided against this. Speaking of performance, it is excellent on all platforms where I can test except Metal (as usual):

133.7 t/s on CUDA (RTX-4080) vs 128.8 t/s for Q4_0
15.8 t/s on AVX2 (Ryzen-7950X) vs 14.5 t/s for Q4_0
28.8 t/s on ARM_NEON (M2 Max CPU) vs 28.2 t/s for Q4_0
53.9 t/s on Metal (30-core M2 Max GPU) vs 63.1 t/s for Q4_0

As usual, Metal / Apple Silicon don't like my quants.

PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS.

Nexesenex · 2024-02-27T10:00:34Z

Great work!

Why not make an IQ4_XXS by using IQ3_S for the attn_k and attn_q ?
At 4.1bpw, that fits the bill!

sorasoras · 2024-02-27T10:05:38Z

This is basically the same as IQ4_NL, but in super-blocks of 256 with 6-bit scales for the blocks of 32 weights. It looks pretty good on the quantization error vs quantized model size curve:

It is possible to move the point closer to the IQ2_XXS...IQ3_M fit line by using IQ3_S for the attn_k and attn_q tensors. This reduces the quantized model size to about 4.1 bpw at the expense of a ~0.3% increase in PPL. But given that currently CPU performance for IQ3_S is pretty bad, I decided against this. Speaking of performance, it is excellent on all platforms where I can test except Metal (as usual):

133.7 t/s on CUDA (RTX-4080) vs 128.8 t/s for Q4_0

15.8 t/s on AVX2 (Ryzen-7950X) vs 14.5 t/s for Q4_0

28.8 t/s on ARM_NEON (M2 Max CPU) vs 28.2 t/s for Q4_0

53.9 t/s on Metal (30-core M2 Max GPU) vs 63.1 t/s for Q4_0
@ikawrakow
I think i found a bug

get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for iq4_xsllama_model_quantize: failed to quantize:
Unsupported tensor size encountered
It should be fallback to IQ4NL？

ikawrakow · 2024-02-27T10:11:33Z

Why not make an IQ4_XXS by using IQ3_S for the attn_k and attn_q ? At 4.1bpw, that fits the bill!

Because I need to fix IQ3_S performance on the CPU first. With attn_q and attn_k quantized with IQ3_S (these two tensors contain about 16% of the model weights for a 7B LLaMA model), performance on my M2 Max CPU drops from 28.8 t/s to 21 t/s. On a Ryzen-7950X CPU performance goes down from 15.8 t/s (achieved with 4 threads) to 12.5 t/s (4 threads) or 14.5 (8 threads). I think IQ4_XS is a really nice alternative to Q4_0 with 6% smaller model size combined with better inference performance (except on Metal), so I don't want to destroy the performance benefit. Let me look more into what is the best way to get to 4 bpw quants.

ikawrakow · 2024-02-27T10:17:39Z

@sorasoras Thanks! I keep forgetting this check. It should be fixed now.

sorasoras · 2024-02-27T10:51:20Z

@sorasoras Thanks! I keep forgetting this check. It should be fixed now.

It's working.

Q4KM   4.6321                                       8.79 GB
Q3KXS  4.6299 +/- 0.04409                    6.12 GB   
IQ4NL  4.6048 +/- 0.04419                     7.61 GB
IQ4XS  4.5885 +/- 0.04395                     7.30 GB
Q6K    4.5787 +/- 0.04407                      11.4 GB
Q5_KS  4.5761 +/- 0.04412                     9.33 GB

That's great!

| qwen 13B IQ4_NL - 4.5 bpw      |   7.61 GiB |    14.17 B | ROCm       |  99 | pp 512     |  1488.78 ± 11.45 |
| qwen 13B IQ4_NL - 4.5 bpw      |   7.61 GiB |    14.17 B | ROCm       |  99 | tg 128     |     73.13 ± 0.18 |
| qwen 13B IQ4_XS - 4.25 bpw     |   7.30 GiB |    14.17 B | ROCm       |  99 | pp 512     |   1547.23 ± 9.30 |
| qwen 13B IQ4_XS - 4.25 bpw     |   7.30 GiB |    14.17 B | ROCm       |  99 | tg 128     |     76.88 ± 0.78 |

CyborgArmy83 · 2024-02-27T11:29:24Z

@ikawrakow Thanks a lot for your hard work! It is very much appreciated. Do you think that we can fix the slower Metal speeds with better kernels or does it require a whole new quantisation type? I am wondering why there is such a difference. Is it because of the additional overhead/calculations that are required for the new IQ quant methods?

Artefact2 · 2024-02-27T11:45:54Z

KL-divergence data for Mistral-7B

	Bits per weight	KL-divergence median	KL-divergence q99	Top tokens differ	ln(PPL(Q)/PPL(base))
IQ1_S	1.78	0.5495	5.5174	0.3840	0.9235
IQ2_XXS	2.20	0.1751	2.4983	0.2313	0.2988
IQ2_XS	2.43	0.1146	1.7693	0.1943	0.2046
IQ2_S	2.55	0.0949	1.6284	0.1806	0.1722
IQ2_M	2.76	0.0702	1.0935	0.1557	0.1223
Q2_K_S	2.79	0.0829	1.5111	0.1735	0.1600
Q2_K	3.00	0.0588	1.0337	0.1492	0.1103
IQ3_XXS	3.21	0.0330	0.5492	0.1137	0.0589
IQ3_XS	3.32	0.0296	0.4550	0.1071	0.0458
Q3_K_S	3.50	0.0304	0.4481	0.1068	0.0511
IQ3_S	3.52	0.0205	0.3018	0.0895	0.0306
IQ3_M	3.63	0.0186	0.2740	0.0859	0.0268
Q3_K_M	3.89	0.0171	0.2546	0.0839	0.0258
Q3_K_L	4.22	0.0152	0.2202	0.0797	0.0205
>>> IQ4_XS	4.32	0.0088	0.1082	0.0606	0.0079
IQ4_NL	4.56	0.0085	0.1077	0.0605	0.0074
Q4_K_S	4.57	0.0083	0.1012	0.0600	0.0081
Q4_K_M	4.83	0.0075	0.0885	0.0576	0.0060
Q5_K_S	5.52	0.0045	0.0393	0.0454	0.0005
Q5_K_M	5.67	0.0043	0.0368	0.0444	0.0005
Q6_K	6.57	0.0032	0.0222	0.0394	−0.0008

Very nice, seems to be a solid replacement for Q4KS, which was my default recommendation.

ikawrakow · 2024-02-27T12:27:47Z

Do you think that we can fix the slower Metal speeds with better kernels or does it require a whole new quantisation type?

The quantization in this PR is non-linear, hence it requires a table lookup. If you compare to Q4_0, there are two quants packed in one uint8_t, so getting these is just a matter of q & 0xf and q >> 4. Here we need lookup_table[q & 0xf] and lookup_table[q >> 4]. On the other platforms this makes zero difference. At least on my GPU the calculation is almost always memory bound, so this one additional lookup doesn't matter. On the CPU there are vector shuffle instructions that are very fast, so the cost is negligible too. But for some reason the Apple GPU very much dislikes this additional memory load. I'm already putting the lookup table in shared memory (and that gave a ~30% boost in performance compared to having the lookup table in constant memory), so not sure what else can be done. But I would very much appreciate if someone more knowledgeable than me in Apple GPU matters would find a better approach compared to my implementation.

* Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <[email protected]>

sorasoras · 2024-04-16T06:50:53Z

@ikawrakow
IQs don't seems support Forced dmmv.
Forced dmmv is about 7-8 percent faster for Q5KS. Do you any plan to implements that in the future to further improve performance of iq quants？

Kawrakow added 10 commits February 26, 2024 18:30

Try IQ4_NL with blocks of 64 - does not look good

67264b3

iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32

2b21d37

iq4_xs: CUDA works - 133.2 t/s

fddbfe8

iq4_xs: AVX2 dot product

061a16f

iq4_xs: ARM_NEON dot product

a37980c

iq4_nl: Metal implementation

ad40ae6

As usual, Metal / Apple Silicon don't like my quants.

iq3_xs: minor fix

6c2b233

iq4_xs: shrink by using IQ3_S for attn_k and attn_q

5c2b230

iq4_xs: revert using IQ3_S for attn_k and attn_v

f162fca

PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS.

Fix CI

801f998

iq4_xs: Added forgotten check for 256 divisibility

d7bb4b6

ggerganov approved these changes Feb 27, 2024

View reviewed changes

ikawrakow merged commit 0becb22 into master Feb 27, 2024
60 of 61 checks passed

ikawrakow deleted the ik/iq4_nl_xs branch February 27, 2024 14:34

InferenceIllusionist mentioned this pull request Mar 3, 2024

Support for newer I-Quant formats LostRuins/koboldcpp#722

Closed

mofosyne mentioned this pull request May 15, 2024

gguf.md: Add GGUF Naming Convention Section ggerganov/ggml#822

Merged

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes labels May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQ4_XS: a 4.25 bpw quantization #5747

IQ4_XS: a 4.25 bpw quantization #5747

ikawrakow commented Feb 27, 2024 •

edited

Loading

Nexesenex commented Feb 27, 2024

sorasoras commented Feb 27, 2024

ikawrakow commented Feb 27, 2024

ikawrakow commented Feb 27, 2024

sorasoras commented Feb 27, 2024 •

edited

Loading

CyborgArmy83 commented Feb 27, 2024

Artefact2 commented Feb 27, 2024 •

edited

Loading

ikawrakow commented Feb 27, 2024

sorasoras commented Apr 16, 2024

IQ4_XS: a 4.25 bpw quantization #5747

IQ4_XS: a 4.25 bpw quantization #5747

Conversation

ikawrakow commented Feb 27, 2024 • edited Loading

Nexesenex commented Feb 27, 2024

sorasoras commented Feb 27, 2024

ikawrakow commented Feb 27, 2024

ikawrakow commented Feb 27, 2024

sorasoras commented Feb 27, 2024 • edited Loading

CyborgArmy83 commented Feb 27, 2024

Artefact2 commented Feb 27, 2024 • edited Loading

ikawrakow commented Feb 27, 2024

sorasoras commented Apr 16, 2024

ikawrakow commented Feb 27, 2024 •

edited

Loading

sorasoras commented Feb 27, 2024 •

edited

Loading

Artefact2 commented Feb 27, 2024 •

edited

Loading