Tokenizer fixes #8379

jaime-m-p · 2024-07-08T23:34:38Z

More tokenizer fixes.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Examples of vocab differences:

INFO VOCABFILE: './models/ggml-vocab-t5.gguf'
ERROR  detokenize=False id=32000 expected='<extra_id_99>' result='[PAD32000]'
ERROR  detokenize=False id=32001 expected='<extra_id_98>' result='[PAD32001]'
ERROR  detokenize=False id=32002 expected='<extra_id_97>' result='[PAD32002]'
ERROR  detokenize=False id=32003 expected='<extra_id_96>' result='[PAD32003]'
ERROR  detokenize=False id=32004 expected='<extra_id_95>' result='[PAD32004]'
ERROR  detokenize=False id=32005 expected='<extra_id_94>' result='[PAD32005]'
ERROR  detokenize=False id=32006 expected='<extra_id_93>' result='[PAD32006]'
ERROR  detokenize=False id=32007 expected='<extra_id_92>' result='[PAD32007]'
ERROR  detokenize=False id=32008 expected='<extra_id_91>' result='[PAD32008]'
ERROR  detokenize=False id=32009 expected='<extra_id_90>' result='[PAD32009]'

INFO VOCABFILE: './models/ggml-vocab-deepseek-llm.gguf'
ERROR  detokenize=True id=100002 expected='�' result='ø'
ERROR  detokenize=True id=100003 expected='�' result='ö'
ERROR  detokenize=True id=100004 expected='�' result='ú'
ERROR  detokenize=True id=100005 expected='�' result='ÿ'
ERROR  detokenize=True id=100006 expected='�' result='õ'
ERROR  detokenize=True id=100007 expected='�' result='÷'
ERROR  detokenize=True id=100008 expected='�' result='û'
ERROR  detokenize=True id=100009 expected='�' result='ý'
ERROR  detokenize=True id=100010 expected='�' result='À'
ERROR  detokenize=True id=100011 expected='�' result='ù'

INFO VOCABFILE: './models/ggml-vocab-command-r.gguf'
ERROR  detokenize=True id=264 expected='\u200d' result='[UNK_BYTE_0xe2808d\u200d]'
ERROR  detokenize=True id=265 expected='‼' result='[UNK_BYTE_0xe280bc‼]'
ERROR  detokenize=True id=266 expected='⁉' result='[UNK_BYTE_0xe28189⁉]'
ERROR  detokenize=True id=267 expected='⃣' result='[UNK_BYTE_0xe283a3⃣]'
ERROR  detokenize=True id=268 expected='™' result='[UNK_BYTE_0xe284a2™]'
ERROR  detokenize=True id=269 expected='ℹ' result='[UNK_BYTE_0xe284b9ℹ]'
ERROR  detokenize=True id=270 expected='↔' result='[UNK_BYTE_0xe28694↔]'
ERROR  detokenize=True id=271 expected='↕' result='[UNK_BYTE_0xe28695↕]'
ERROR  detokenize=True id=272 expected='↖' result='[UNK_BYTE_0xe28696↖]'
ERROR  detokenize=True id=273 expected='↗' result='[UNK_BYTE_0xe28697↗]'

Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.

Fix pyparse problems: gcc inline functions Test l/r-strip for more than 4 spaces Improve mismatch range localization Compare vocabs Options to mange token text decoding errors: Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.

compilade · 2024-07-09T02:07:21Z

tests/test-tokenizer-random.py

+        max_token_id = max(self.model.get_vocab().values())
+        if detokenize:
+            ids = list(range(max_token_id + 1))
+            vocab = self.model.batch_decode(ids, skip_special_tokens=False)


Do you think this should be used in the convert script(s) instead of directly getting the strings from tokenizer.vocab?

EDIT: this might be a bad idea, since the tokenizer merges won't directly match with the strings from the vocab if that's done

compilade · 2024-07-09T02:16:06Z

tests/test-tokenizer-random.py

@@ -36,7 +36,7 @@ def __init__(self, path_llama_h: str = None, path_includes: list[str] = [], path
        self.lib.llama_backend_init()

    def _load_libllama_cffi(self, path_llama_h: str, path_includes: list[str], path_libllama: str):
-        cmd = ["gcc", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]
+        cmd = ["gcc", "-O0", "-fno-inline", "-E", "-P", "-D__restrict=", "-D__attribute__(x)=", "-D__asm__(x)="]


I think -fno-inline is redundant with -O0. And -O0 alone works, while -fno-inline alone doesn't.

Anyway, I suggest resolving the conflict with master.

Fix pyparse problems: gcc inline functions Test l/r-strip for more than 4 spaces Improve mismatch range localization Compare vocabs Options to mange token text decoding errors: Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.

compilade · 2024-07-09T02:21:02Z

INFO VOCABFILE: './models/ggml-vocab-deepseek-llm.gguf'
ERROR  detokenize=True id=100002 expected='�' result='ø'
ERROR  detokenize=True id=100003 expected='�' result='ö'
ERROR  detokenize=True id=100004 expected='�' result='ú'
ERROR  detokenize=True id=100005 expected='�' result='ÿ'
ERROR  detokenize=True id=100006 expected='�' result='õ'
ERROR  detokenize=True id=100007 expected='�' result='÷'
ERROR  detokenize=True id=100008 expected='�' result='û'
ERROR  detokenize=True id=100009 expected='�' result='ý'
ERROR  detokenize=True id=100010 expected='�' result='À'
ERROR  detokenize=True id=100011 expected='�' result='ù'

These are part of the added_tokens of deepseek-llm, and are exactly as in result. Not sure where expected takes its tokens, but this is not correct if it doesn't take into account the added_tokens.

INFO VOCABFILE: './models/ggml-vocab-t5.gguf'
ERROR  detokenize=False id=32000 expected='<extra_id_99>' result='[PAD32000]'
ERROR  detokenize=False id=32001 expected='<extra_id_98>' result='[PAD32001]'
ERROR  detokenize=False id=32002 expected='<extra_id_97>' result='[PAD32002]'
...

These are also part of the added tokens (of t5), but in this case it's llama.cpp which is wrong. This does seem useful for debugging the convert script(s)!

* test-tokenizer-random : add a failing edge case for falcon

test-tokenizer-random : reduce potential confilcts with ggerganov#8379

* llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from #8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with #8379 * test-tokenizer-random : add a failing edge case for falcon

) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon

by jaime-m-p

) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon

by jaime-m-p

jaime-m-p added 5 commits July 9, 2024 00:55

Fix pyparse problems: gcc inline functions

e8b3955

Test l/r-strip for more than 4 spaces

9307c3f

Improve mismatch range localization

a943b42

Compare vocabs

dec64ef

Options to mange token text decoding errors:

c184db7

Some models ('jais' and 'command-r') copy original utf8 on error. Others ('deepseek') seems to use the replacement character 0xFFFD.

jaime-m-p marked this pull request as draft July 8, 2024 23:34

github-actions bot added testing Everything test related python python script changes labels Jul 8, 2024

compilade reviewed Jul 9, 2024

View reviewed changes

jaime-m-p added 3 commits July 10, 2024 00:46

Skip literal UNUSED token checks

3eb1900

update test: fix special and added token lists

c4956e4

Merge commit 'f4444d99' into tokenizer-fixes

9b8e05b

compilade added a commit that referenced this pull request Jul 13, 2024

test-tokenizer-random : reduce potential confilcts with #8379

59ce853

* test-tokenizer-random : add a failing edge case for falcon

compilade mentioned this pull request Jul 13, 2024

llama : fix pre-tokenization of non-special added tokens #8228

Merged

5 tasks

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 13, 2024

Merge pull request #236 from ggerganov/compilade/fix-mpt-pretok

bc02f64

test-tokenizer-random : reduce potential confilcts with ggerganov#8379

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 13, 2024

Merge branch 'master' into tokenizer-fixes

3db5058

jaime-m-p force-pushed the tokenizer-fixes branch from fb46a15 to 3db5058 Compare July 19, 2024 15:24

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 25, 2024

Tokenizer fixes ggerganov#8379

a8dabfc

by jaime-m-p

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 27, 2024

Tokenizer fixes ggerganov#8379

48229e6

by jaime-m-p

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jul 27, 2024

Tokenizer fixes ggerganov#8379

0b1fb09

by jaime-m-p

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer fixes #8379

Tokenizer fixes #8379

jaime-m-p commented Jul 8, 2024 •

edited

Loading

compilade Jul 9, 2024 •

edited

Loading

compilade Jul 9, 2024

compilade commented Jul 9, 2024 •

edited

Loading

Tokenizer fixes #8379

Are you sure you want to change the base?

Tokenizer fixes #8379

Conversation

jaime-m-p commented Jul 8, 2024 • edited Loading

compilade Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

compilade Jul 9, 2024

Choose a reason for hiding this comment

compilade commented Jul 9, 2024 • edited Loading

jaime-m-p commented Jul 8, 2024 •

edited

Loading

compilade Jul 9, 2024 •

edited

Loading

compilade commented Jul 9, 2024 •

edited

Loading