llama : Added support for Tekken pre-tokenizer (#8577) #8579

m18coppola · 2024-07-18T22:08:01Z

Adding Tekken pre-tokenizer support for Mistral Nemo models.

Added tokenizer type for Mistral-Nemo-Base-2407 in convert-hf-to-gguf-update.py
Added the chkhsh for Mistral-Nemo-Base-2407 in convert-hf-to-gguf.py
Added LLAMA_VOCAB_PRE_TYPE_TEKKEN enum to llama.h
Added pre-tokenizer regex for LLAMA_VOCAB_PRE_TYPE_TEKKEN to llama.cpp
Ran ./tests/test-tokenizer-0 ./models/ggml-vocab-tekken.gguf. Tests passed.

Partially addresses issue #8577

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

HanClinto · 2024-07-18T22:17:44Z

Nice work!!

HanClinto · 2024-07-19T01:28:55Z

I've reviewed what I'm able, but I'm not very familiar enough with the process of adding new tokenizers to the system to fully verify this, and I'd like a corroborating review if we can.

Overall, I think this is probably good enough to merge in and work on the next step of Mistral NeMo?

Should this new tokenizer be added to the tests/test-tokenizer-0 target in the Makefile as well?

src/llama.cpp

m18coppola · 2024-07-19T02:00:31Z

I do think it's "good enough", as it covers all the test cases in test-tokenizer-0, but I do have a suspicion that a test-case could be manufactured to cause my implementation to fail. Whatever regex engine used for llama.cpp does not support Unicode categories (\p{$CATEGORY}). There are workarounds that I cannot completely wrap my head around for categories L (Letters), N (Numbers) and P (Punctuation). The original Tekken pre-tokenizer uses subcategories Lu/Ll/Lt/Lo (Uppercase/Lowercase/Titlecase/Othercase letters), and the M (Mark) category. To try to solve this, I used only the available implemented workaround categories, and essentially "subtracted" the non-included characters. For instance:

[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}] was changed to ((?=[\\p{L}])([^a-z]))
[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}] was changed to ((?=[\\p{L}])([^A-Z]))

In these cases, I considered all letters using L, and subtracted the A-Z/a-z letters by using a lookahead assertion. This technically ignores all the Unicode characters in the M category. Despite this, I was unable to create a testcase that would cause these changes in the regex to fail, but I am not certain it can't be done. If someone can manage to create a string that pokes at these edges it would be helpful, although there's also a possibility that the regex provided by the tekken tokenizer is simply overly verbose and that my changes are functionally equivalent.

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

compilade · 2024-07-19T06:13:16Z

If someone can manage to create a string that pokes at these edges it would be helpful

@m18coppola I managed to find a few failing cases (though not much, impressively) with tests/test-tokenizer-random.py with generator_random_unicodes, but I didn't manage to extract the source strings yet.

INFO:test-tokenizer-random:generator_random_unicodes: ini
ERROR:test-tokenizer-random: Expected: [1236, 1143, 1163, 1406, 1133, 1010, 1236, 2265]
ERROR:test-tokenizer-random:   Result: [1236, 1143, 4898, 1133, 1010, 1236, 2265, 1240]
ERROR:test-tokenizer-random: encode_errors=1
ERROR:test-tokenizer-random: Expected: [1224, 1183, 11933, 1166, 4760, 1240, 1145, 1170]
ERROR:test-tokenizer-random:   Result: [1224, 1183, 1158, 1688, 1166, 4760, 1240, 1145]
ERROR:test-tokenizer-random: encode_errors=2
ERROR:test-tokenizer-random: Expected: [1013, 1218, 84610, 1240, 1146, 1145, 1140, 1240]
ERROR:test-tokenizer-random:   Result: [1013, 1218, 1132, 36167, 1240, 1146, 1145, 1140]
ERROR:test-tokenizer-random: encode_errors=3
...

To try to solve this, I used only the available implemented workaround categories, and essentially "subtracted" the non-included characters. For instance:
* `[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]` was changed to `((?=[\\p{L}])([^a-z]))`

* `[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]` was changed to `((?=[\\p{L}])([^A-Z]))`

This isn't comprehensive for all of unicode, and might be what is causing the above problems. For example, this is the full regex to match lower-case caracters (as in Ll):

[a-zµß-öø-ÿāăąćĉċčďđēĕėęěĝğġģĥħĩīĭįıĳĵķ-ĸĺļľŀłńņň-ŉŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷźżž-ƀƃƅƈƌ-ƍƒƕƙ-ƛƞơƣƥƨƪ-ƫƭưƴƶƹ-ƺƽ-ƿǆǉǌǎǐǒǔǖǘǚǜ-ǝǟǡǣǥǧǩǫǭǯ-ǰǳǵǹǻǽǿȁȃȅȇȉȋȍȏȑȓȕȗșțȝȟȡȣȥȧȩȫȭȯȱȳ-ȹȼȿ-ɀɂɇɉɋɍɏ-ʓʕ-ʯͱͳͷͻ-ͽΐά-ώϐ-ϑϕ-ϗϙϛϝϟϡϣϥϧϩϫϭϯ-ϳϵϸϻ-ϼа-џѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧҩҫҭүұҳҵҷҹһҽҿӂӄӆӈӊӌӎ-ӏӑӓӕӗәӛӝӟӡӣӥӧөӫӭӯӱӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԛԝԟԡԣԥԧԩԫԭԯՠ-ֈა-ჺჽ-ჿᏸ-ᏽᲀ-ᲈᴀ-ᴫᵫ-ᵷᵹ-ᶚḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉẋẍẏẑẓẕ-ẝẟạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹỻỽỿ-ἇἐ-ἕἠ-ἧἰ-ἷὀ-ὅὐ-ὗὠ-ὧὰ-ώᾀ-ᾇᾐ-ᾗᾠ-ᾧᾰ-ᾴᾶ-ᾷιῂ-ῄῆ-ῇῐ-ΐῖ-ῗῠ-ῧῲ-ῴῶ-ῷℊℎ-ℏℓℯℴℹℼ-ℽⅆ-ⅉⅎↄⰰ-ⱟⱡⱥ-ⱦⱨⱪⱬⱱⱳ-ⱴⱶ-ⱻⲁⲃⲅⲇⲉⲋⲍⲏⲑⲓⲕⲗⲙⲛⲝⲟⲡⲣⲥⲧⲩⲫⲭⲯⲱⲳⲵⲷⲹⲻⲽⲿⳁⳃⳅⳇⳉⳋⳍⳏⳑⳓⳕⳗⳙⳛⳝⳟⳡⳣ-ⳤⳬⳮⳳⴀ-ⴥⴧⴭꙁꙃꙅꙇꙉꙋꙍꙏꙑꙓꙕꙗꙙꙛꙝꙟꙡꙣꙥꙧꙩꙫꙭꚁꚃꚅꚇꚉꚋꚍꚏꚑꚓꚕꚗꚙꚛꜣꜥꜧꜩꜫꜭꜯ-ꜱꜳꜵꜷꜹꜻꜽꜿꝁꝃꝅꝇꝉꝋꝍꝏꝑꝓꝕꝗꝙꝛꝝꝟꝡꝣꝥꝧꝩꝫꝭꝯꝱ-ꝸꝺꝼꝿꞁꞃꞅꞇꞌꞎꞑꞓ-ꞕꞗꞙꞛꞝꞟꞡꞣꞥꞧꞩꞯꞵꞷꞹꞻꞽꞿꟁꟃꟈꟊꟑꟓꟕꟗꟙꟶꟺꬰ-ꭚꭠ-ꭨꭰ-ꮿﬀ-ﬆﬓ-ﬗａ-ｚ𐐨-𐑏𐓘-𐓻𐖗-𐖡𐖣-𐖱𐖳-𐖹𐖻-𐖼𐳀-𐳲𑣀-𑣟𖹠-𖹿𝐚-𝐳𝑎-𝑔𝑖-𝑧𝒂-𝒛𝒶-𝒹𝒻𝒽-𝓃𝓅-𝓏𝓪-𝔃𝔞-𝔷𝕒-𝕫𝖆-𝖟𝖺-𝗓𝗮-𝘇𝘢-𝘻𝙖-𝙯𝚊-𝚥𝛂-𝛚𝛜-𝛡𝛼-𝜔𝜖-𝜛𝜶-𝝎𝝐-𝝕𝝰-𝞈𝞊-𝞏𝞪-𝟂𝟄-𝟉𝟋𝼀-𝼉𝼋-𝼞𝼥-𝼪𞤢-𞥃]

While for uppercase (as in Lu):

[A-ZÀ-ÖØ-ÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢĤĦĨĪĬĮİĲĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸ-ŹŻŽƁ-ƂƄƆ-ƇƉ-ƋƎ-ƑƓ-ƔƖ-ƘƜ-ƝƟ-ƠƢƤƦ-ƧƩƬƮ-ƯƱ-ƳƵƷ-ƸƼǄǇǊǍǏǑǓǕǗǙǛǞǠǢǤǦǨǪǬǮǱǴǶ-ǸǺǼǾȀȂȄȆȈȊȌȎȐȒȔȖȘȚȜȞȠȢȤȦȨȪȬȮȰȲȺ-ȻȽ-ȾɁɃ-ɆɈɊɌɎͰͲͶͿΆΈ-ΊΌΎ-ΏΑ-ΡΣ-ΫϏϒ-ϔϘϚϜϞϠϢϤϦϨϪϬϮϴϷϹ-ϺϽ-ЯѠѢѤѦѨѪѬѮѰѲѴѶѸѺѼѾҀҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҪҬҮҰҲҴҶҸҺҼҾӀ-ӁӃӅӇӉӋӍӐӒӔӖӘӚӜӞӠӢӤӦӨӪӬӮӰӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԚԜԞԠԢԤԦԨԪԬԮԱ-ՖႠ-ჅჇჍᎠ-ᏵᲐ-ᲺᲽ-ᲿḀḂḄḆḈḊḌḎḐḒḔḖḘḚḜḞḠḢḤḦḨḪḬḮḰḲḴḶḸḺḼḾṀṂṄṆṈṊṌṎṐṒṔṖṘṚṜṞṠṢṤṦṨṪṬṮṰṲṴṶṸṺṼṾẀẂẄẆẈẊẌẎẐẒẔẞẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼẾỀỂỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪỬỮỰỲỴỶỸỺỼỾἈ-ἏἘ-ἝἨ-ἯἸ-ἿὈ-ὍὙὛὝὟὨ-ὯᾸ-ΆῈ-ΉῘ-ΊῨ-ῬῸ-Ώℂℇℋ-ℍℐ-ℒℕℙ-ℝℤΩℨK-ℭℰ-ℳℾ-ℿⅅↃⰀ-ⰯⱠⱢ-ⱤⱧⱩⱫⱭ-ⱰⱲⱵⱾ-ⲀⲂⲄⲆⲈⲊⲌⲎⲐⲒⲔⲖⲘⲚⲜⲞⲠⲢⲤⲦⲨⲪⲬⲮⲰⲲⲴⲶⲸⲺⲼⲾⳀⳂⳄⳆⳈⳊⳌⳎⳐⳒⳔⳖⳘⳚⳜⳞⳠⳢⳫⳭⳲꙀꙂꙄꙆꙈꙊꙌꙎꙐꙒꙔꙖꙘꙚꙜꙞꙠꙢꙤꙦꙨꙪꙬꚀꚂꚄꚆꚈꚊꚌꚎꚐꚒꚔꚖꚘꚚꜢꜤꜦꜨꜪꜬꜮꜲꜴꜶꜸꜺꜼꜾꝀꝂꝄꝆꝈꝊꝌꝎꝐꝒꝔꝖꝘꝚꝜꝞꝠꝢꝤꝦꝨꝪꝬꝮꝹꝻꝽ-ꝾꞀꞂꞄꞆꞋꞍꞐꞒꞖꞘꞚꞜꞞꞠꞢꞤꞦꞨꞪ-ꞮꞰ-ꞴꞶꞸꞺꞼꞾꟀꟂꟄ-ꟇꟉꟐꟖꟘꟵＡ-Ｚ𐐀-𐐧𐒰-𐓓𐕰-𐕺𐕼-𐖊𐖌-𐖒𐖔-𐖕𐲀-𐲲𑢠-𑢿𖹀-𖹟𝐀-𝐙𝐴-𝑍𝑨-𝒁𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒵𝓐-𝓩𝔄-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔸-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕬-𝖅𝖠-𝖹𝗔-𝗭𝘈-𝘡𝘼-𝙕𝙰-𝚉𝚨-𝛀𝛢-𝛺𝜜-𝜴𝝖-𝝮𝞐-𝞨𝟊𞤀-𞤡]

And Lt is:

[ǅǈǋǲᾈ-ᾏᾘ-ᾟᾨ-ᾯᾼῌῼ]

Script used to generate the above (click to expand)

#!/usr/bin/env python3

start_range = None
end_range = None

ranges: list[tuple[int, int]] = []

# You need this file from https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
with open("UnicodeData.txt", "r") as f:
    for line in f:
        fields = line.split(';')
        # Replace this with the category for which ranges should be found
        if fields[2] == "Lu":
            hex = int(fields[0], 16)
            if len(ranges) == 0:
                ranges.append((hex, hex))
            elif ranges[-1][1] == hex - 1:
                ranges[-1] = (ranges[-1][0], hex)
            else:
                ranges.append((hex, hex))

    print("[", end="")
    for range in ranges:
        if range[0] == range[1]:
            print(chr(range[0]), end="")
        else:
            print(f"{chr(range[0])}-{chr(range[1])}", end="")
    print("]")

I think this is fine for now with the shorter ASCII-only exclusions, but this should eventually be fixed by properly supporting at least some sub-categories in gen-unicode-data.py and unicode.h (although there is a limit to how many flags can fit in struct codepoint_flags (but from counting the categories in pcresyntax(3), I think it's possible to fit them all in 32 bits)).

cc @jaime-m-p

(Other proof of the hope to fit all the flags in 32 bits)

$ cut -d';' UnicodeData.txt -f3 | sort | uniq
Cc
Cf
Co
Cs
Ll
Lm
Lo
Lt
Lu
Mc
Me
Mn
Nd
Nl
No
Pc
Pd
Pe
Pf
Pi
Po
Ps
Sc
Sk
Sm
So
Zl
Zp
Zs

$ cut -d';' UnicodeData.txt -f3 | sort | uniq | wc -l
29

I do think it's "good enough"

I agree with this too. Keep it simple at first, especially since it seems to work for most cases. What's nice is that fixing the regex will not require reconverting the model, so it can safely be done later.

src/llama.cpp

maziyarpanahi · 2024-07-19T08:41:15Z

Thanks @m18coppola for this PR.
I am trying to use your PR, but I am getting this error:

line 396, in apply_metadata_heuristic
    model_full_name_component, org_component, basename, finetune, version, size_label = Metadata.get_model_id_components(model_id, total_params
)
                                                                                        ^^^^^^^^^

   if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
                                      ~~~~^^^
IndexError: string index out of range

ggerganov

Most tokenization tests that I did are passing, though I found a few that fail:

src: 'Thai : สพรั ่ ง กั'
res: 'Thai : สพรั ่ ง กั'
tok: 2438 2464 1737 18430 13119 4026 4739 1032 5004 3341 1135 18031 4739 
main : failed test:    'Thai : สพรั ่ ง กั'
main : detokenized to: 'Thai : สพรั ่ ง กั' instead of 'Thai : สพรั ่ ง กั'
main : expected tokens:   2438 'Th',   2464 'ai',   1737 ' :',  18430 ' ส',  13119 'พ',  43134 'รั',   1032 ' ',   5004 '่',   3341 ' ',   1135 '',  18031 ' ก',   4739 'ั', 
main : got tokens:        2438 'Th',   2464 'ai',   1737 ' :',  18430 ' ส',  13119 'พ',   4026 'ร',   4739 'ั',   1032 ' ',   5004 '่',   3341 ' ',   1135 '',  18031 ' ก',   4739 'ั',

This one also fails with the bert-bge model.

I think it is fine to merge and resolve later

jaime-m-p · 2024-07-19T18:35:32Z

@compilade

there is a limit to how many flags can fit in struct codepoint_flags

Currently it is set to 16 bits to save memory, but we can use 64 bits (unicode categories + helper flags: is_whitespace, etc).
Or we can drop some unusual categories until it fits the 32 bits saving half the memory.

I'm experimenting with gen-unicode-data.py, adding all unicode categories the range list grows from ~2000 to ~4000.

danielhanchen · 2024-07-19T21:11:23Z

Just a heads up the tokenizer in the HF repo is a bit different from the original first upload from Mistral (so best to re download it)

No EOS token appended by default - helped fix this with HF 22 hours ago
clean_up_tokenization_spaces=False now not True
Now PreTrainedTokenizerFast not GPT2Tokenizer

HanClinto · 2024-07-19T22:49:02Z

Just a heads up the tokenizer in the HF repo is a bit different from the original first upload from Mistral (so best to re download it)
1. No EOS token appended by default - helped fix this with HF 22 hours ago

2. `clean_up_tokenization_spaces=False` now not `True`

3. Now `PreTrainedTokenizerFast` not `GPT2Tokenizer`

Thanks for the alert -- just re-ran and can confirm that the chkhsh of the latest Tokenizer version on HF is different than it was before. I was getting the same chkhsh as this PR with the version from yesterday, and with the new version I'm getting chkhsh of 63b97e4253352e6f357cc59ea5b583e3a680eaeaf2632188c2b952de2588485e.

We'll need to get this PR updated with the latest before we merge it in.

* Updated chkhsh for Tekken tokenizer

convert_hf_to_gguf.py

m18coppola · 2024-07-19T23:12:56Z

"Add EOS" is disabled by default, changed clean_up_tokenization_spaces to false. Updated chkhsh.

Let me know if I should squash this into a single commit.

HanClinto · 2024-07-20T02:22:54Z

Let me know if I should squash this into a single commit.

The default merge strategy for the repo is squash, so no need to squash in your PR -- it's fine (even ideal) to keep it separate here.

Y'all think we're good to merge?

* llama : Added support for Tekken pre-tokenizer (ggerganov#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <[email protected]>

ayttop · 2024-08-20T02:04:58Z

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'tekken'
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '../Mistral-Nemo-Instruct-2407-Q4_K_M.gguf'
main: error: unable to load model

(a) C:\Users\ArabTech\Desktop\2\llama-b3419-bin-win-openblas-x64>

llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '../Mistral-Nemo-Instruct-2407-Q4_K_M.gguf'
main: error: unable to load model

(a) C:\Users\ArabTech\Desktop\2\llama-b3604-bin-win-openblas-x64 (1)>cd..

(a) C:\Users\ArabTech\Desktop\2>wasmedge --dir .:. llama-api-server.wasm --model-path Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --prompt-template mistral-instruct --ctx-size 128000
'wasmedge' is not recognized as an internal or external command,
operable program or batch file.

not run with lama cpp and LlamaEdge

m18coppola changed the title ~~llama : Added support for Viking pre-tokenizer (#8577)~~ llama : Added support for Tekken pre-tokenizer (#8577) Jul 18, 2024

m18coppola force-pushed the master branch from b76150c to 003fcaf Compare July 18, 2024 22:17

HanClinto mentioned this pull request Jul 18, 2024

WIP for adding support for Tekken tokenizer needed for Mistral NeMo #8578

Closed

4 tasks

github-actions bot added the python python script changes label Jul 18, 2024

HanClinto reviewed Jul 19, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

7fc8505

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

m18coppola force-pushed the master branch from ea215c7 to 7fc8505 Compare July 19, 2024 02:08

compilade approved these changes Jul 19, 2024

View reviewed changes

src/llama.cpp Show resolved Hide resolved

llama : fix order of pre-tokenizers

447c080

ggerganov approved these changes Jul 19, 2024

View reviewed changes

Merge branch 'ggerganov:master' into master

8506e13

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 19, 2024

compilade mentioned this pull request Jul 19, 2024

gguf-py : fix some metadata name extraction edge cases #8591

Merged

3 tasks

invisietch mentioned this pull request Jul 19, 2024

Add support for Tekken pre-tokenizer to support Nemo 12B LostRuins/koboldcpp#1011

Open

iamlemec mentioned this pull request Jul 19, 2024

Support Mistral-Nemo-Instruct-2407 128K #8577

Open

4 tasks

atelepov approved these changes Jul 19, 2024

View reviewed changes

* Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces

dd5a0bf

* Updated chkhsh for Tekken tokenizer

HanClinto reviewed Jul 19, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

jmorganca mentioned this pull request Jul 20, 2024

Add temporary patch for the new Mistral Tekken tokenizer ollama/ollama#5807

Merged

ggerganov merged commit 9403622 into ggerganov:master Jul 20, 2024
55 checks passed

muhammadyusuf-kurbonov mentioned this pull request Jul 20, 2024

Mistral Nemo Please! ollama/ollama#5777

Closed

jaime-m-p mentioned this pull request Jul 20, 2024

llama : tokenizer unicode codepoint categories #8606

Open

4 tasks

Stillerman mentioned this pull request Jul 20, 2024

llama : Added support for SmolLm pre-tokenizer (#8608) #8609

Merged

4 tasks

ggerganov mentioned this pull request Jul 22, 2024

llama : move vocab, grammar and sampling into separate files #8508

Merged

7 tasks

offgridtech mentioned this pull request Jul 27, 2024

bug: Unable to Run Mistral Nemo janhq/jan#3209

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : Added support for Tekken pre-tokenizer (#8577) #8579

llama : Added support for Tekken pre-tokenizer (#8577) #8579

m18coppola commented Jul 18, 2024 •

edited

Loading

HanClinto commented Jul 18, 2024

HanClinto commented Jul 19, 2024

m18coppola commented Jul 19, 2024 •

edited

Loading

compilade commented Jul 19, 2024 •

edited

Loading

maziyarpanahi commented Jul 19, 2024

ggerganov left a comment

jaime-m-p commented Jul 19, 2024

danielhanchen commented Jul 19, 2024

HanClinto commented Jul 19, 2024

m18coppola commented Jul 19, 2024

HanClinto commented Jul 20, 2024

ayttop commented Aug 20, 2024

llama : Added support for Tekken pre-tokenizer (#8577) #8579

llama : Added support for Tekken pre-tokenizer (#8577) #8579

Conversation

m18coppola commented Jul 18, 2024 • edited Loading

HanClinto commented Jul 18, 2024

HanClinto commented Jul 19, 2024

m18coppola commented Jul 19, 2024 • edited Loading

compilade commented Jul 19, 2024 • edited Loading

maziyarpanahi commented Jul 19, 2024

ggerganov left a comment

Choose a reason for hiding this comment

jaime-m-p commented Jul 19, 2024

danielhanchen commented Jul 19, 2024

HanClinto commented Jul 19, 2024

m18coppola commented Jul 19, 2024

HanClinto commented Jul 20, 2024

ayttop commented Aug 20, 2024

m18coppola commented Jul 18, 2024 •

edited

Loading

m18coppola commented Jul 19, 2024 •

edited

Loading

compilade commented Jul 19, 2024 •

edited

Loading