Skip to content

Commit

Permalink
Merge pull request #1012 from Cassini-chris/patch-2
Browse files Browse the repository at this point in the history
adding vocab_size consistency
  • Loading branch information
taku910 committed May 22, 2024
2 parents 33b01c8 + d7a25aa commit 58b5508
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions python/sentencepiece_python_module_example.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@
"- **user defined symbols**: Always treated as one token in any context. These symbols can appear in the input sentence. \n",
"- **control symbol**: We only reserve ids for these tokens. Even if these tokens appear in the input text, they are not handled as one token. User needs to insert ids explicitly after encoding.\n",
"\n",
"For experimental purpose, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text."
"For experimental purposes, user defined symbols are easier to use since user can change the behavior just by modifying the input text. However, we want to use control symbols in the production setting in order to avoid users from tweaking the behavior by feeding these special symbols in their input text."
]
},
{
Expand All @@ -273,7 +273,7 @@
"\n",
"# ids are reserved in both mode.\n",
"# <unk>=0, <s>=1, </s>=2, <sep>=3, <cls>=4\n",
"# user defined symbols allow these symbol to apper in the text.\n",
"# user defined symbols allow these symbols to appear in the text.\n",
"print(sp_user.encode_as_pieces('this is a test<sep> hello world<cls>'))\n",
"print(sp_user.piece_to_id('<sep>')) # 3\n",
"print(sp_user.piece_to_id('<cls>')) # 4\n",
Expand Down Expand Up @@ -605,7 +605,7 @@
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m --vocab_size=2000')\n",
"\n",
"# Can obtain different segmentations per request.\n",
"# There are two hyperparamenters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.\n",
"# There are two hyperparameters for sampling (nbest_size and inverse temperature). see the paper [kudo18] for detail.\n",
"for n in range(10):\n",
" print(sp.sample_encode_as_pieces('hello world', -1, 0.1))\n",
"\n",
Expand Down Expand Up @@ -760,7 +760,7 @@
"Sentencepiece supports character and word segmentation with **--model_type=char** and **--model_type=character** flags.\n",
"\n",
"In `word` segmentation, sentencepiece just segments tokens with whitespaces, so the input text must be pre-tokenized.\n",
"We can apply different segmentation algorithm transparently without changing pre/post processors."
"We can apply different segmentation algorithms transparently without changing pre/post processors."
]
},
{
Expand All @@ -775,7 +775,7 @@
},
"cell_type": "code",
"source": [
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=400')\n",
"spm.SentencePieceTrainer.train('--input=botchan.txt --model_prefix=m_char --model_type=char --vocab_size=2000')\n",
"\n",
"sp_char = spm.SentencePieceProcessor()\n",
"sp_char.load('m_char.model')\n",
Expand Down Expand Up @@ -884,7 +884,7 @@
"cell_type": "markdown",
"source": [
"The normalization is performed with user-defined string-to-string mappings and leftmost longest matching.\n",
"We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalziation rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.\n",
"We can also define the custom normalization rules as TSV file. The TSV files for pre-defined normalization rules can be found in the data directory ([sample](https://raw.githubusercontent.com/google/sentencepiece/master/data/nfkc.tsv)). The normalization rule is compiled into FST and embedded in the model file. We don't need to specify the normalization configuration in the segmentation phase.\n",
"\n",
"Here's the example of custom normalization. The TSV file is fed with **--normalization_rule_tsv=&lt;FILE&gt;** flag."
]
Expand Down Expand Up @@ -921,7 +921,7 @@
"sp = spm.SentencePieceProcessor()\n",
"# m.model embeds the normalization rule compiled into an FST.\n",
"sp.load('m.model')\n",
"print(sp.encode_as_pieces(\"I'm busy\")) # normalzied to `I am busy'\n",
"print(sp.encode_as_pieces(\"I'm busy\")) # normalized to `I am busy'\n",
"print(sp.encode_as_pieces(\"I don't know it.\")) # normalized to 'I do not know it.'"
],
"execution_count": 0,
Expand Down Expand Up @@ -995,7 +995,7 @@
"source": [
"## Vocabulary restriction\n",
"\n",
"We can encode the text only using the tokens spececified with **set_vocabulary** method. The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt)."
"We can encode the text only using the tokens specified with **set_vocabulary** method. The background of this feature is described in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt)."
]
},
{
Expand Down

0 comments on commit 58b5508

Please sign in to comment.