Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf_dump.py: fix markddown kv array print #8588

Merged

Conversation

mofosyne
Copy link
Collaborator

The initial gguf dump didn't match the output of main.cpp so must have read it wrong. Adjusted the python script until it matched

POS TYPE Count Key Value
1 UINT32 1 GGUF.version 3
2 UINT64 1 GGUF.tensor_count 75
3 UINT64 1 GGUF.kv_count 33
4 STRING 1 general.architecture 'llama'
5 STRING 1 general.type 'model'
6 STRING 1 general.name 'TinyLLama'
7 STRING 1 general.author 'Maykeye'
8 STRING 1 general.version 'v0.0'
9 STRING 1 general.description 'This gguf is ported from a first version of Maykeye attempt '
10 STRING 1 general.quantized_by 'Mofosyne'
11 STRING 1 general.size_label '4.6M'
12 STRING 1 general.license 'apache-2.0'
13 STRING 1 general.url 'https://huggingface.co/mofosyne/TinyLLama-v0-llamafile'
14 STRING 1 general.source.url 'https://huggingface.co/Maykeye/TinyLLama-v0'
15 [STRING] 5 general.tags [ 'text generation', 'transformer', 'llama', 'tiny', 'tiny model' ]
16 [STRING] 1 general.languages [ 'en' ]
17 [STRING] 2 general.datasets [ 'https://huggin...GPT4-train.txt', 'https://huggin...GPT4-valid.txt' ]
18 UINT32 1 llama.block_count 8
19 UINT32 1 llama.context_length 2048
20 UINT32 1 llama.embedding_length 64
21 UINT32 1 llama.feed_forward_length 256
22 UINT32 1 llama.attention.head_count 16
23 FLOAT32 1 llama.attention.layer_norm_rms_epsilon 1e-06
24 UINT32 1 general.file_type 1
25 UINT32 1 llama.vocab_size 32000
26 UINT32 1 llama.rope.dimension_count 4
27 STRING 1 tokenizer.ggml.model 'llama'
28 STRING 1 tokenizer.ggml.pre 'default'
29 [STRING] 32000 tokenizer.ggml.tokens [ '', '', '', '<0x00>', '<0x01>', ... ]
30 [FLOAT32] 32000 tokenizer.ggml.scores [ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ]
31 [INT32] 32000 tokenizer.ggml.token_type [ 2, 3, 3, 6, 6, 6, 6, ... ]
32 UINT32 1 tokenizer.ggml.bos_token_id 1
33 UINT32 1 tokenizer.ggml.eos_token_id 2
34 UINT32 1 tokenizer.ggml.unknown_token_id 0
35 UINT32 1 tokenizer.ggml.padding_token_id 0
36 UINT32 1 general.quantization_version 2

From main.cpp:

llama_model_loader: loaded meta data with 33 key-value pairs and 75 tensors from Tinyllama-4.6M-v0.0-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = TinyLLama
llama_model_loader: - kv   3:                             general.author str              = Maykeye
llama_model_loader: - kv   4:                            general.version str              = v0.0
llama_model_loader: - kv   5:                        general.description str              = This gguf is ported from a first vers...
llama_model_loader: - kv   6:                       general.quantized_by str              = Mofosyne
llama_model_loader: - kv   7:                         general.size_label str              = 4.6M
llama_model_loader: - kv   8:                            general.license str              = apache-2.0
llama_model_loader: - kv   9:                                general.url str              = https://huggingface.co/mofosyne/TinyL...
llama_model_loader: - kv  10:                         general.source.url str              = https://huggingface.co/Maykeye/TinyLL...
llama_model_loader: - kv  11:                               general.tags arr[str,5]       = ["text generation", "transformer", "l...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                           general.datasets arr[str,2]       = ["https://huggingface.co/datasets/ron...
llama_model_loader: - kv  14:                          llama.block_count u32              = 8
llama_model_loader: - kv  15:                       llama.context_length u32              = 2048
llama_model_loader: - kv  16:                     llama.embedding_length u32              = 64
llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 256
llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv  19:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                          general.file_type u32              = 1
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 4
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  26:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  32:               general.quantization_version u32              = 2

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 19, 2024
@github-actions github-actions bot added the python python script changes label Jul 19, 2024
@mofosyne mofosyne force-pushed the gguf-dump-fix-markdown-kv-array-print branch from 834de1a to d99a34b Compare July 19, 2024 13:36
@mofosyne mofosyne added the bugfix fixes an issue or bug label Jul 19, 2024
@mofosyne mofosyne requested a review from compilade July 20, 2024 02:30
gguf-py/scripts/gguf_dump.py Outdated Show resolved Hide resolved
gguf-py/scripts/gguf_dump.py Outdated Show resolved Hide resolved
gguf-py/scripts/gguf_dump.py Outdated Show resolved Hide resolved
gguf-py/scripts/gguf_dump.py Outdated Show resolved Hide resolved
@mofosyne
Copy link
Collaborator Author

@compilade thanks. This is how it will look like now

POS TYPE Count Key Value
1 UINT32 1 GGUF.version 3
2 UINT64 1 GGUF.tensor_count 75
3 UINT64 1 GGUF.kv_count 33
4 STRING 1 general.architecture "llama"
5 STRING 1 general.type "model"
6 STRING 1 general.name "TinyLLama"
7 STRING 1 general.author "Maykeye"
8 STRING 1 general.version "v0.0"
9 STRING 1 general.description "This gguf is ported from a first version of Maykeye attempt "
10 STRING 1 general.quantized_by "Mofosyne"
11 STRING 1 general.size_label "4.6M"
12 STRING 1 general.license "apache-2.0"
13 STRING 1 general.url "https://huggingface.co/mofosyne/TinyLLama-v0-llamafile"
14 STRING 1 general.source.url "https://huggingface.co/Maykeye/TinyLLama-v0"
15 [STRING] 5 general.tags [ "text generation", "transformer", "llama", "tiny", "tiny model" ]
16 [STRING] 1 general.languages [ "en" ]
17 [STRING] 2 general.datasets [ "https://hugging...-GPT4-train.txt", "https://hugging...-GPT4-valid.txt" ]
18 UINT32 1 llama.block_count 8
19 UINT32 1 llama.context_length 2048
20 UINT32 1 llama.embedding_length 64
21 UINT32 1 llama.feed_forward_length 256
22 UINT32 1 llama.attention.head_count 16
23 FLOAT32 1 llama.attention.layer_norm_rms_epsilon 1e-06
24 UINT32 1 general.file_type 1
25 UINT32 1 llama.vocab_size 32000
26 UINT32 1 llama.rope.dimension_count 4
27 STRING 1 tokenizer.ggml.model "llama"
28 STRING 1 tokenizer.ggml.pre "default"
29 [STRING] 32000 tokenizer.ggml.tokens [ "<unk>", "<s>", "</s>", "<0x00>", "<0x01>", ... ]
30 [FLOAT32] 32000 tokenizer.ggml.scores [ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ]
31 [INT32] 32000 tokenizer.ggml.token_type [ 2, 3, 3, 6, 6, 6, 6, ... ]
32 UINT32 1 tokenizer.ggml.bos_token_id 1
33 UINT32 1 tokenizer.ggml.eos_token_id 2
34 UINT32 1 tokenizer.ggml.unknown_token_id 0
35 UINT32 1 tokenizer.ggml.padding_token_id 0
36 UINT32 1 general.quantization_version 2

@@ -249,21 +249,29 @@ def dump_markdown_metadata(reader: GGUFReader, args: argparse.Namespace) -> None
if len(field.types) == 1:
curr_type = field.types[0]
if curr_type == GGUFValueType.STRING:
value = repr(str(bytes(field.parts[-1]), encoding='utf-8')[:60])
value = "\"`{strval}`\"".format(strval=str(bytes(field.parts[-1]), encoding='utf-8')[:60])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quotes render a bit weird, and what if the string contains `? I suggest to remove the quotes or to include them inside the inline code blocks, and... Hmm not sure how to escape ` except by adding more surrounding ` than the longest inner occurrence, and separate the delimiters by spaces if the string happens to start or finish with `.

I don't know if there's a limit, let's see: ```````````````````` (20 inner, 21 outer `) seems to work, so there might be no limit.

Copy link
Collaborator Author

@mofosyne mofosyne Jul 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the inline code blocks because of <unk> rendering weirdly... I'm inclined to just remove "

else:
array_elements.append(value_string)
value_array_inner = ["\"`{strval}`\"".format(strval=strval) for strval in array_elements]
value = f'[ {", ".join(value_array_inner).strip()}{", ..." if total_elements > len(array_elements) else ""} ]'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but conditionally appending "..." to value_array_inner might be better than inserting the string ", ..." after.

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look reasonable, but you might want to fix escaping and/or change the truncation of inner strings in lists of strings.

@mofosyne
Copy link
Collaborator Author

The changes look reasonable, but you might want to fix escaping and/or change the truncation of inner strings in lists of strings.

POS TYPE Count Key Value
1 UINT32 1 GGUF.version 3
2 UINT64 1 GGUF.tensor_count 75
3 UINT64 1 GGUF.kv_count 33
4 STRING 1 general.architecture llama
5 STRING 1 general.type model
6 STRING 1 general.name TinyLLama
7 STRING 1 general.author Maykeye
8 STRING 1 general.version v0.0
9 STRING 1 general.description This gguf is ported from a fir...M but using Llama architecture
10 STRING 1 general.quantized_by Mofosyne
11 STRING 1 general.size_label 4.6M
12 STRING 1 general.license apache-2.0
13 STRING 1 general.url https://huggingface.co/mofosyne/TinyLLama-v0-llamafile
14 STRING 1 general.source.url https://huggingface.co/Maykeye/TinyLLama-v0
15 [STRING] 5 general.tags [ text generation, transformer, llama, tiny, tiny model ]
16 [STRING] 1 general.languages [ en ]
17 [STRING] 2 general.datasets [ https://hugging...-GPT4-train.txt, https://hugging...-GPT4-valid.txt ]
18 UINT32 1 llama.block_count 8
19 UINT32 1 llama.context_length 2048
20 UINT32 1 llama.embedding_length 64
21 UINT32 1 llama.feed_forward_length 256
22 UINT32 1 llama.attention.head_count 16
23 FLOAT32 1 llama.attention.layer_norm_rms_epsilon 1e-06
24 UINT32 1 general.file_type 1
25 UINT32 1 llama.vocab_size 32000
26 UINT32 1 llama.rope.dimension_count 4
27 STRING 1 tokenizer.ggml.model llama
28 STRING 1 tokenizer.ggml.pre default
29 [STRING] 32000 tokenizer.ggml.tokens [ <unk>, <s>, </s>, <0x00>, <0x01>, ... ]
30 [FLOAT32] 32000 tokenizer.ggml.scores [ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ]
31 [INT32] 32000 tokenizer.ggml.token_type [ 2, 3, 3, 6, 6, 6, 6, ... ]
32 UINT32 1 tokenizer.ggml.bos_token_id 1
33 UINT32 1 tokenizer.ggml.eos_token_id 2
34 UINT32 1 tokenizer.ggml.unknown_token_id 0
35 UINT32 1 tokenizer.ggml.padding_token_id 0
36 UINT32 1 general.quantization_version 2

How about this?

@mofosyne
Copy link
Collaborator Author

FYI, I'm pretty happy with this now. If you are happy with the adjustments, you can press merge whenever.

>>> escape_markdown_inline_code("hello world")
'`hello world`'
>>> escape_markdown_inline_code("hello ` world")
'``hello ` world``'
@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 20, 2024
@mofosyne mofosyne merged commit c3776ca into ggerganov:master Jul 20, 2024
8 checks passed
@mofosyne mofosyne deleted the gguf-dump-fix-markdown-kv-array-print branch July 20, 2024 07:35
@mofosyne
Copy link
Collaborator Author

mofosyne commented Jul 20, 2024

On a side note, added the dump to https://huggingface.co/mofosyne/TinyLLama-v0-5M-F16-llamafile/blob/main/TinyLLama-4.6M-v0.0-F16.dump.md so you can see how it appears in huggingface as well.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* gguf_dump.py: fix markddown kv array print

* Update gguf-py/scripts/gguf_dump.py

Co-authored-by: compilade <[email protected]>

* gguf_dump.py: refactor kv array string handling

* gguf_dump.py: escape backticks inside of strings

* gguf_dump.py: inline code markdown escape handler added

>>> escape_markdown_inline_code("hello world")
'`hello world`'
>>> escape_markdown_inline_code("hello ` world")
'``hello ` world``'

* gguf_dump.py: handle edge case about backticks on start or end of a string

---------

Co-authored-by: compilade <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants