Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I run ner on other language like chinese? #31

Open
SuMarsss opened this issue Jul 9, 2019 · 18 comments
Open

How do I run ner on other language like chinese? #31

SuMarsss opened this issue Jul 9, 2019 · 18 comments

Comments

@SuMarsss
Copy link

SuMarsss commented Jul 9, 2019

I have pretrained xlnet on a large chinese corpus, but how do I run the ner.py and what is label.vocab.
Here is my parameters to train the Sentence Piece model

spm_train \
	--input=data/wiki_all.txt \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.9995 \
	--model_type=char \
	--control_symbols='<cls>,<sep>,<pad>,<mask>,<eod>' \
	--user_defined_symbols='<eop>,。' \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

This my pretrained result.

I0708 01:51:08.929747 140337454118720 train_gpu.py:300] [99500] | gnorm 5.37 lr 0.000000 | loss 2.08 | pplx    8.01, bpc  3.0017
I0708 01:52:52.577970 140337454118720 train_gpu.py:300] [99600] | gnorm 4.98 lr 0.000000 | loss 2.03 | pplx    7.60, bpc  2.9265
I0708 01:54:36.169189 140337454118720 train_gpu.py:300] [99700] | gnorm 5.21 lr 0.000000 | loss 2.04 | pplx    7.73, bpc  2.9500
I0708 01:56:19.727979 140337454118720 train_gpu.py:300] [99800] | gnorm 5.06 lr 0.000000 | loss 2.05 | pplx    7.79, bpc  2.9625
I0708 01:58:03.187680 140337454118720 train_gpu.py:300] [99900] | gnorm 5.06 lr 0.000000 | loss 2.01 | pplx    7.47, bpc  2.9009
I0708 01:59:46.560450 140337454118720 train_gpu.py:300] [100000] | gnorm 5.51 lr 0.000000 | loss 2.00 | pplx    7.38, bpc  2.8840

So the label.vocabshould be like this ?

<cls>
<sep>
<pad>
<mask>
<eod>
B-AnatomyPart
I-AnatomyPart
B-Diagnosis
I-Diagnosis
B-Drug
I-Drug
B-Lab
I-Lab
B-Procedure
I-Procedure
B-Radiology
I-Radiology
O
@stevezheng23
Copy link
Owner

stevezheng23 commented Jul 9, 2019

@SuMarsss great to see you have trained Chinese XLNet model and build your own Sentence Piece model

To prepare your label.vocab (which is different from your Sentence Piece control_symbols), you can use the following one,

<pad>
O
X
<cls>
<sep>
B-AnatomyPart
I-AnatomyPart
B-Diagnosis
I-Diagnosis
B-Drug
I-Drug
B-Lab
I-Lab
B-Procedure
I-Procedure
B-Radiology
I-Radiology

@stevezheng23
Copy link
Owner

And you should also make sure the special_vocab_list in run_ner.py align with your Sentence Piece control_symbols,
self.special_vocab_list = ["<unk>", "<s>", "</s>", "<cls>", "<sep>", "<pad>", "<mask>", "<eod>", "<eop>"]

@SuMarsss
Copy link
Author

special_vocab_list

When I tried the label.vocal as you said , another error occured.

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at xlnet/model_utils.py:147) ]]
[[node replica_1/loss/truediv (defined at run_ner.py:608) ]]

xlnet/model_utils.py:147:
clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip)

run_ner.py:608:
loss = tf.reduce_sum(cross_entropy * label_mask) / tf.reduce_sum(tf.reduce_max(label_mask, axis=-1))

@stevezheng23
Copy link
Owner

stevezheng23 commented Jul 10, 2019 via email

@SuMarsss
Copy link
Author

I have fiix the buged, but I want do output f1_score and precison

@stevezheng23
Copy link
Owner

@SuMarsss , you can run the following command to get precision/recall/f1 score

python tool/convert_token.py \
--input_file=${OUTPUTDIR}/data/predict.${PREDICTTAG}.json \
--output_file=${OUTPUTDIR}/data/predict.${PREDICTTAG}.txt

python tool/eval_token.py \
< ${OUTPUTDIR}/data/predict.${PREDICTTAG}.txt \
> ${OUTPUTDIR}/data/predict.${PREDICTTAG}.token

@SuMarsss
Copy link
Author

Sorry, I thought I have fixed the gradient exploding issue but it occured again.
2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm.
I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result.
image
I think the result __ "缘" __ "于" is wrong,which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject".
image

In the last, I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:

<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

@stevezheng23
Copy link
Owner

stevezheng23 commented Jul 11, 2019

@SuMarsss , Yes, I think it should be _于 instead of _ and

I never did Chinese sentence piece model training before, maybe you can refer to this post for more insight

@charlesXu86
Copy link

Sorry, I thought I have fixed the gradient exploding issue but it occured again.
2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm.
I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result.
image
I think the result __ "缘" __ "于" is wrong,which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject".
image

In the last, I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:

<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

how did you fix this problem?

@stevezheng23
Copy link
Owner

stevezheng23 commented Sep 28, 2019

@charlesXu86 actually I couldn't reproduce this issue, no clue how to resolve it

@youbingchenyoubing
Copy link

Sorry, I thought I have fixed the gradient exploding issue but it occured again.
2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm.
I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result.
image
I think the result __ "缘" __ "于" is wrong,which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject".
image

In the last, I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:

<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

this issue that you fix already or not, I got this problem too.

@stevezheng23
Copy link
Owner

stevezheng23 commented Nov 2, 2019

@youbingchenyoubing no fix is applied yet, since I couldn't reproduce this issue. Could you provide more details for your problem?

@youbingchenyoubing
Copy link

@youbingchenyoubing no fix is applied yet, since I couldn't reproduce this issue. Could you provide more details for your problem?

File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/home/chenyoubing/nlp/resume_entity/entity_model/build_model/xlnet_model.py", line 135, in model_fn
train_op, _, _ = model_utils.get_train_op(self.args, loss)
File "/home/chenyoubing/nlp/resume_entity/entity_model/xlnet/model_utils.py", line 147, in get_train_op
clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
return verify_tensor_all_finite_v2(t, msg, name)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
verify_input = array_ops.check_numerics(x, message=message)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at /home/chenyoubing/nlp/resume_entity/entity_model/xlnet/model_utils.py:147) ]]

@stevezheng23
Copy link
Owner

@youbingchenyoubing Sorry, based on the error message, I can't figure out how run_ner.py is used by your pipeline. BTW, which dataset does this experiment run with? English or Chinese?

@youbingchenyoubing
Copy link

@youbingchenyoubing Sorry, based on the error message, I can't figure out how run_ner.py is used by your pipeline. BTW, which dataset does this experiment run with? English or Chinese?

chinese resume ner used in my experiment.

@youbingchenyoubing
Copy link

can xlnet support no fixed context?

@stevezheng23
Copy link
Owner

@SuMarsss / @charlesXu86 / @youbingchenyoubing, sorry, I still can't repro this issue on CoNLL2003 dataset and I think I'll not support Chinese NER in the near future

@youbingchenyoubing
Copy link

@SuMarsss / @charlesXu86 / @youbingchenyoubing, sorry, I still can't repro this issue on CoNLL2003 dataset and I think I'll not support Chinese NER in the near future

awsome, thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants