Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding vocab_size consistency #1012

Merged
merged 1 commit into from
May 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions python/sentencepiece_python_module_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@
"- **user defined symbols**: Always treated as one token in any context. These symbols can appear in the input sentence. \n",
"- **control symbol**: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.\n",
"\n",
"For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text."
"For experimental purposes, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text."
]
},
{
Expand All @@ -273,7 +273,7 @@
"\n",
"# ids are reserved in both mode.\n",
"# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4\n",
"# user defined symbols allow these symbol to apper in the text.\n",
"# user defined symbols allow these symbols to appear in the text.\n",
"print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))\n",
"print(sp_user.piece_to_id('<sep>')) # 3\n",
"print(sp_user.piece_to_id('<cls>')) # 4\n",
Expand Down Expand Up @@ -605,7 +605,7 @@
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')\n",
"\n",
"# Can obtain different segmentations per request.\n",
"# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.\n",
"# There are two hyperparameters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.\n",
"for n in range(10):\n",
" print(sp.sample_encode_as_pieces('hello world', -1, 0.1))\n",
"\n",
Expand Down Expand Up @@ -760,7 +760,7 @@
"Sentencepiece supports character and word segmentation with **--model_type=char** and **--model_type=character** flags.\n",
"\n",
"In `word` segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized.\n",
"We can apply different segmentation algorithm transparently without changing pre/post processors."
"We can apply different segmentation algorithms transparently without changing pre/post processors."
]
},
{
Expand All @@ -775,7 +775,7 @@
},
"cell_type": "code",
"source": [
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')\n",
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=2000')\n",
"\n",
"sp_char = spm.SentencePieceProcessor()\n",
"sp_char.load('m_char.model')\n",
Expand Down Expand Up @@ -884,7 +884,7 @@
"cell_type": "markdown",
"source": [
"The normalization is performed with user-defined string-to-string mappings and leftmost longest matching.\n",
"We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalziation rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.\n",
"We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalization rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.\n",
"\n",
"Here's the example of custom normalization. The TSV file is fed with **--normalization_rule_tsv=&lt;FILE&gt;** flag."
]
Expand Down Expand Up @@ -921,7 +921,7 @@
"sp = spm.SentencePieceProcessor()\n",
"# m.model embeds the normalization rule compiled into an FST.\n",
"sp.load('m.model')\n",
"print(sp.encode_as_pieces(\"I'm busy\")) # normalzied to `I am busy'\n",
"print(sp.encode_as_pieces(\"I'm busy\")) # normalized to `I am busy'\n",
"print(sp.encode_as_pieces(\"I don't know it.\")) # normalized to 'I do not know it.'"
],
"execution_count": 0,
Expand Down Expand Up @@ -995,7 +995,7 @@
"source": [
"## Vocabulary restriction\n",
"\n",
"We can encode the text only using the tokens spececified with **set_vocabulary** method. The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt)."
"We can encode the text only using the tokens specified with **set_vocabulary** method. The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt)."
]
},
{
Expand Down
Loading