Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : Added support for Tekken pre-tokenizer (#8577) #8579

Merged
merged 4 commits into from
Jul 20, 2024

Conversation

m18coppola
Copy link
Contributor

@m18coppola m18coppola commented Jul 18, 2024

Adding Tekken pre-tokenizer support for Mistral Nemo models.

  • Added tokenizer type for Mistral-Nemo-Base-2407 in convert-hf-to-gguf-update.py
  • Added the chkhsh for Mistral-Nemo-Base-2407 in convert-hf-to-gguf.py
  • Added LLAMA_VOCAB_PRE_TYPE_TEKKEN enum to llama.h
  • Added pre-tokenizer regex for LLAMA_VOCAB_PRE_TYPE_TEKKEN to llama.cpp
  • Ran ./tests/test-tokenizer-0 ./models/ggml-vocab-tekken.gguf. Tests passed.

Partially addresses issue #8577

@m18coppola m18coppola changed the title llama : Added support for Viking pre-tokenizer (#8577) llama : Added support for Tekken pre-tokenizer (#8577) Jul 18, 2024
@HanClinto
Copy link
Collaborator

Nice work!!

@HanClinto
Copy link
Collaborator

I've reviewed what I'm able, but I'm not very familiar enough with the process of adding new tokenizers to the system to fully verify this, and I'd like a corroborating review if we can.

Overall, I think this is probably good enough to merge in and work on the next step of Mistral NeMo?

Should this new tokenizer be added to the tests/test-tokenizer-0 target in the Makefile as well?

src/llama.cpp Outdated Show resolved Hide resolved
@m18coppola
Copy link
Contributor Author

m18coppola commented Jul 19, 2024

I do think it's "good enough", as it covers all the test cases in test-tokenizer-0, but I do have a suspicion that a test-case could be manufactured to cause my implementation to fail. Whatever regex engine used for llama.cpp does not support Unicode categories (\p{$CATEGORY}). There are workarounds that I cannot completely wrap my head around for categories L (Letters), N (Numbers) and P (Punctuation). The original Tekken pre-tokenizer uses subcategories Lu/Ll/Lt/Lo (Uppercase/Lowercase/Titlecase/Othercase letters), and the M (Mark) category. To try to solve this, I used only the available implemented workaround categories, and essentially "subtracted" the non-included characters. For instance:

  • [\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}] was changed to ((?=[\\p{L}])([^a-z]))
  • [\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}] was changed to ((?=[\\p{L}])([^A-Z]))

In these cases, I considered all letters using L, and subtracted the A-Z/a-z letters by using a lookahead assertion. This technically ignores all the Unicode characters in the M category. Despite this, I was unable to create a testcase that would cause these changes in the regex to fail, but I am not certain it can't be done. If someone can manage to create a string that pokes at these edges it would be helpful, although there's also a possibility that the regex provided by the tekken tokenizer is simply overly verbose and that my changes are functionally equivalent.

Removed uneeded `vocab.tokenizer_clean_spaces` assignment
@compilade
Copy link
Collaborator

compilade commented Jul 19, 2024

If someone can manage to create a string that pokes at these edges it would be helpful

@m18coppola I managed to find a few failing cases (though not much, impressively) with tests/test-tokenizer-random.py with generator_random_unicodes, but I didn't manage to extract the source strings yet.

INFO:test-tokenizer-random:generator_random_unicodes: ini
ERROR:test-tokenizer-random: Expected: [1236, 1143, 1163, 1406, 1133, 1010, 1236, 2265]
ERROR:test-tokenizer-random:   Result: [1236, 1143, 4898, 1133, 1010, 1236, 2265, 1240]
ERROR:test-tokenizer-random: encode_errors=1
ERROR:test-tokenizer-random: Expected: [1224, 1183, 11933, 1166, 4760, 1240, 1145, 1170]
ERROR:test-tokenizer-random:   Result: [1224, 1183, 1158, 1688, 1166, 4760, 1240, 1145]
ERROR:test-tokenizer-random: encode_errors=2
ERROR:test-tokenizer-random: Expected: [1013, 1218, 84610, 1240, 1146, 1145, 1140, 1240]
ERROR:test-tokenizer-random:   Result: [1013, 1218, 1132, 36167, 1240, 1146, 1145, 1140]
ERROR:test-tokenizer-random: encode_errors=3
...

To try to solve this, I used only the available implemented workaround categories, and essentially "subtracted" the non-included characters. For instance:

* `[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]` was changed to `((?=[\\p{L}])([^a-z]))`

* `[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]` was changed to `((?=[\\p{L}])([^A-Z]))`

This isn't comprehensive for all of unicode, and might be what is causing the above problems. For example, this is the full regex to match lower-case caracters (as in Ll):

[a-zµß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıijĵķ-ĸĺļľŀłńņň-ʼnŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌ-ƍƒƕƙ-ƛƞơƣƥƨƪ-ƫƭưƴƶƹ-ƺƽ-ƿdžljnjǎǐǒǔǖǘǚǜ-ǝǟǡǣǥǧǩǫǭǯ-ǰdzǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿ-ɀɂɇɉɋɍɏ-ʓʕ-ʯͱͳͷͻ-ͽΐά-ώϐ-ϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻ-ϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎ-ӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯՠ-ֈ-------ḿṿ-ếỿ---------------------------ⲿ--------ꮿ---𐐨-𐑏𐓘-𐓻𐖗-𐖡𐖣-𐖱𐖳-𐖹𐖻-𐖼𐳀-𐳲𑣀-𑣟𖹠-𖹿𝐚-𝐳𝑎-𝑔𝑖-𝑧𝒂-𝒛𝒶-𝒹𝒻𝒽-𝓃𝓅-𝓏𝓪-𝔃𝔞-𝔷𝕒-𝕫𝖆-𝖟𝖺-𝗓𝗮-𝘇𝘢-𝘻𝙖-𝙯𝚊-𝚥𝛂-𝛚𝛜-𝛡𝛼-𝜔𝜖-𝜛𝜶-𝝎𝝐-𝝕𝝰-𝞈𝞊-𝞏𝞪-𝟂𝟄-𝟉𝟋𝼀-𝼉𝼋-𝼞𝼥-𝼪𞤢-𞥃]

While for uppercase (as in Lu):

[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİIJĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸ-ŹŻŽƁ-ƂƄƆ-ƇƉ-ƋƎ-ƑƓ-ƔƖ-ƘƜ-ƝƟ-ƠƢƤƦ-ƧƩƬƮ-ƯƱ-ƳƵƷ-ƸƼDŽLJNJǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮDZǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺ-ȻȽ-ȾɁɃ-ɆɈɊɌɎͰͲͶͿΆΈ-ΊΌΎ-ΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹ-ϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀ-ӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-Ֆ----Ჿ----Ἷ----------------------𐐀-𐐧𐒰-𐓓𐕰-𐕺𐕼-𐖊𐖌-𐖒𐖔-𐖕𐲀-𐲲𑢠-𑢿𖹀-𖹟𝐀-𝐙𝐴-𝑍𝑨-𝒁𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒵𝓐-𝓩𝔄-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔸-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕬-𝖅𝖠-𝖹𝗔-𝗭𝘈-𝘡𝘼-𝙕𝙰-𝚉𝚨-𝛀𝛢-𝛺𝜜-𝜴𝝖-𝝮𝞐-𝞨𝟊𞤀-𞤡]

And Lt is:

[DžLjNjDz---]
Script used to generate the above (click to expand)
#!/usr/bin/env python3

start_range = None
end_range = None

ranges: list[tuple[int, int]] = []

# You need this file from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
with open("UnicodeData.txt", "r") as f:
    for line in f:
        fields = line.split(';')
        # Replace this with the category for which ranges should be found
        if fields[2] == "Lu":
            hex = int(fields[0], 16)
            if len(ranges) == 0:
                ranges.append((hex, hex))
            elif ranges[-1][1] == hex - 1:
                ranges[-1] = (ranges[-1][0], hex)
            else:
                ranges.append((hex, hex))

    print("[", end="")
    for range in ranges:
        if range[0] == range[1]:
            print(chr(range[0]), end="")
        else:
            print(f"{chr(range[0])}-{chr(range[1])}", end="")
    print("]")

I think this is fine for now with the shorter ASCII-only exclusions, but this should eventually be fixed by properly supporting at least some sub-categories in gen-unicode-data.py and unicode.h (although there is a limit to how many flags can fit in struct codepoint_flags (but from counting the categories in pcresyntax(3), I think it's possible to fit them all in 32 bits)).

cc @jaime-m-p

(Other proof of the hope to fit all the flags in 32 bits)
$ cut -d';' UnicodeData.txt -f3 | sort | uniq
Cc
Cf
Co
Cs
Ll
Lm
Lo
Lt
Lu
Mc
Me
Mn
Nd
Nl
No
Pc
Pd
Pe
Pf
Pi
Po
Ps
Sc
Sk
Sm
So
Zl
Zp
Zs

$ cut -d';' UnicodeData.txt -f3 | sort | uniq | wc -l
29

I do think it's "good enough"

I agree with this too. Keep it simple at first, especially since it seems to work for most cases. What's nice is that fixing the regex will not require reconverting the model, so it can safely be done later.

src/llama.cpp Show resolved Hide resolved
@maziyarpanahi
Copy link

Thanks @m18coppola for this PR.
I am trying to use your PR, but I am getting this error:

line 396, in apply_metadata_heuristic
    model_full_name_component, org_component, basename, finetune, version, size_label = Metadata.get_model_id_components(model_id, total_params
)
                                                                                        ^^^^^^^^^

   if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
                                      ~~~~^^^
IndexError: string index out of range

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most tokenization tests that I did are passing, though I found a few that fail:

src: 'Thai : สพรั ่ ง กั'
res: 'Thai : สพรั ่ ง กั'
tok: 2438 2464 1737 18430 13119 4026 4739 1032 5004 3341 1135 18031 4739 
main : failed test:    'Thai : สพรั ่ ง กั'
main : detokenized to: 'Thai : สพรั ่ ง กั' instead of 'Thai : สพรั ่ ง กั'
main : expected tokens:   2438 'Th',   2464 'ai',   1737 ' :',  18430 ' ส',  13119 'พ',  43134 'รั',   1032 ' ',   5004 '่',   3341 ' ',   1135 '',  18031 ' ก',   4739 'ั', 
main : got tokens:        2438 'Th',   2464 'ai',   1737 ' :',  18430 ' ส',  13119 'พ',   4026 'ร',   4739 'ั',   1032 ' ',   5004 '่',   3341 ' ',   1135 '',  18031 ' ก',   4739 'ั', 

This one also fails with the bert-bge model.

I think it is fine to merge and resolve later

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 19, 2024
@jaime-m-p
Copy link
Collaborator

@compilade

there is a limit to how many flags can fit in struct codepoint_flags

Currently it is set to 16 bits to save memory, but we can use 64 bits (unicode categories + helper flags: is_whitespace, etc).
Or we can drop some unusual categories until it fits the 32 bits saving half the memory.

I'm experimenting with gen-unicode-data.py, adding all unicode categories the range list grows from ~2000 to ~4000.

@danielhanchen
Copy link
Contributor

Just a heads up the tokenizer in the HF repo is a bit different from the original first upload from Mistral (so best to re download it)

  1. No EOS token appended by default - helped fix this with HF 22 hours ago
  2. clean_up_tokenization_spaces=False now not True
  3. Now PreTrainedTokenizerFast not GPT2Tokenizer

@HanClinto
Copy link
Collaborator

Just a heads up the tokenizer in the HF repo is a bit different from the original first upload from Mistral (so best to re download it)

1. No EOS token appended by default - helped fix this with HF 22 hours ago

2. `clean_up_tokenization_spaces=False` now not `True`

3. Now `PreTrainedTokenizerFast` not `GPT2Tokenizer`

Thanks for the alert -- just re-ran and can confirm that the chkhsh of the latest Tokenizer version on HF is different than it was before. I was getting the same chkhsh as this PR with the version from yesterday, and with the new version I'm getting chkhsh of 63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e.

We'll need to get this PR updated with the latest before we merge it in.

convert_hf_to_gguf.py Outdated Show resolved Hide resolved
@m18coppola
Copy link
Contributor Author

"Add EOS" is disabled by default, changed clean_up_tokenization_spaces to false. Updated chkhsh.

Let me know if I should squash this into a single commit.

@HanClinto
Copy link
Collaborator

Let me know if I should squash this into a single commit.

The default merge strategy for the repo is squash, so no need to squash in your PR -- it's fine (even ideal) to keep it separate here.

Y'all think we're good to merge?

@ggerganov ggerganov merged commit 9403622 into ggerganov:master Jul 20, 2024
55 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@ayttop
Copy link

ayttop commented Aug 20, 2024

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'tekken'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../Mistral-Nemo-Instruct-2407-Q4_K_M.gguf'
main: error: unable to load model

(a) C:\Users\ArabTech\Desktop\2\llama-b3419-bin-win-openblas-x64>

llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '../Mistral-Nemo-Instruct-2407-Q4_K_M.gguf'
main: error: unable to load model

(a) C:\Users\ArabTech\Desktop\2\llama-b3604-bin-win-openblas-x64 (1)>cd..

(a) C:\Users\ArabTech\Desktop\2>wasmedge --dir .:. llama-api-server.wasm --model-path Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --prompt-template mistral-instruct --ctx-size 128000
'wasmedge' is not recognized as an internal or external command,
operable program or batch file.

not run with lama cpp and LlamaEdge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants