How do I run ner on other language like chinese? #31

SuMarsss · 2019-07-09T06:47:41Z

I have pretrained xlnet on a large chinese corpus, but how do I run the ner.py and what is label.vocab.
Here is my parameters to train the Sentence Piece model

spm_train \
	--input=data/wiki_all.txt \
	--model_prefix=sp10m.cased.v3 \
	--vocab_size=32000 \
	--character_coverage=0.9995 \
	--model_type=char \
	--control_symbols='<cls>,<sep>,<pad>,<mask>,<eod>' \
	--user_defined_symbols='<eop>,。' \
	--shuffle_input_sentence \
	--input_sentence_size=10000000

This my pretrained result.

I0708 01:51:08.929747 140337454118720 train_gpu.py:300] [99500] | gnorm 5.37 lr 0.000000 | loss 2.08 | pplx    8.01, bpc  3.0017
I0708 01:52:52.577970 140337454118720 train_gpu.py:300] [99600] | gnorm 4.98 lr 0.000000 | loss 2.03 | pplx    7.60, bpc  2.9265
I0708 01:54:36.169189 140337454118720 train_gpu.py:300] [99700] | gnorm 5.21 lr 0.000000 | loss 2.04 | pplx    7.73, bpc  2.9500
I0708 01:56:19.727979 140337454118720 train_gpu.py:300] [99800] | gnorm 5.06 lr 0.000000 | loss 2.05 | pplx    7.79, bpc  2.9625
I0708 01:58:03.187680 140337454118720 train_gpu.py:300] [99900] | gnorm 5.06 lr 0.000000 | loss 2.01 | pplx    7.47, bpc  2.9009
I0708 01:59:46.560450 140337454118720 train_gpu.py:300] [100000] | gnorm 5.51 lr 0.000000 | loss 2.00 | pplx    7.38, bpc  2.8840

So the label.vocabshould be like this ?

<cls>
<sep>
<pad>
<mask>
<eod>
B-AnatomyPart
I-AnatomyPart
B-Diagnosis
I-Diagnosis
B-Drug
I-Drug
B-Lab
I-Lab
B-Procedure
I-Procedure
B-Radiology
I-Radiology
O

The text was updated successfully, but these errors were encountered:

stevezheng23 · 2019-07-09T16:34:55Z

@SuMarsss great to see you have trained Chinese XLNet model and build your own Sentence Piece model

To prepare your label.vocab (which is different from your Sentence Piece control_symbols), you can use the following one,

<pad>
O
X
<cls>
<sep>
B-AnatomyPart
I-AnatomyPart
B-Diagnosis
I-Diagnosis
B-Drug
I-Drug
B-Lab
I-Lab
B-Procedure
I-Procedure
B-Radiology
I-Radiology

stevezheng23 · 2019-07-09T16:39:43Z

And you should also make sure the special_vocab_list in run_ner.py align with your Sentence Piece control_symbols,
self.special_vocab_list = ["<unk>", "<s>", "</s>", "<cls>", "<sep>", "<pad>", "<mask>", "<eod>", "<eop>"]

SuMarsss · 2019-07-10T07:18:41Z

special_vocab_list

When I tried the label.vocal as you said , another error occured.

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at xlnet/model_utils.py:147) ]]
[[node replica_1/loss/truediv (defined at run_ner.py:608) ]]

xlnet/model_utils.py:147:
clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip)

run_ner.py:608:
loss = tf.reduce_sum(cross_entropy * label_mask) / tf.reduce_sum(tf.reduce_max(label_mask, axis=-1))

stevezheng23 · 2019-07-10T16:15:45Z

Looks like gradient exploding issue, could you provide more details (e.g. all vocab list, hyperparam, sentence piece model, etc.) for debugging?

…

On Wed, Jul 10, 2019 at 12:18 AM SuMarsss ***@***.***> wrote: special_vocab_list When I tried the label.vocal as you said , another error occured. InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values [[node VerifyFinite/CheckNumerics (defined at xlnet/model_utils.py:147) ]] [[node replica_1/loss/truediv (defined at run_ner.py:608) ]] xlnet/model_utils.py:147: clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip) run_ner.py:608: loss = tf.reduce_sum(cross_entropy * label_mask) / tf.reduce_sum(tf.reduce_max(label_mask, axis=-1)) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#31?email_source=notifications&email_token=ABYXYMZTJ5HD363JPI3GJ7LP6WENDA5CNFSM4H7CE2WKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSRNCQ#issuecomment-509941386>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYXYM5CUCY5BAYUKO4LWFDP6WENDANCNFSM4H7CE2WA> .

-- Best, Mingzhi

SuMarsss · 2019-07-11T02:38:00Z

I have fiix the buged, but I want do output f1_score and precison

stevezheng23 · 2019-07-11T03:07:17Z

@SuMarsss , you can run the following command to get precision/recall/f1 score

python tool/convert_token.py \
--input_file=${OUTPUTDIR}/data/predict.${PREDICTTAG}.json \
--output_file=${OUTPUTDIR}/data/predict.${PREDICTTAG}.txt

python tool/eval_token.py \
< ${OUTPUTDIR}/data/predict.${PREDICTTAG}.txt \
> ${OUTPUTDIR}/data/predict.${PREDICTTAG}.token

SuMarsss · 2019-07-11T09:15:24Z

Sorry, I thought I have fixed the gradient exploding issue but it occured again.
2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm.
I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result.

I think the result __ "缘" __ "于" is wrong，which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject".

In the last， I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:

<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

stevezheng23 · 2019-07-11T17:31:05Z

@SuMarsss , Yes, I think it should be _于 instead of _ and 于

I never did Chinese sentence piece model training before, maybe you can refer to this post for more insight

charlesXu86 · 2019-09-18T03:28:05Z

Sorry, I thought I have fixed the gradient exploding issue but it occured again.
2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm.
I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result.

I think the result __ "缘" __ "于" is wrong，which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject".

In the last， I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:
<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

how did you fix this problem?

stevezheng23 · 2019-09-28T16:39:37Z

@charlesXu86 actually I couldn't reproduce this issue, no clue how to resolve it

youbingchenyoubing · 2019-11-02T02:06:26Z

Sorry, I thought I have fixed the gradient exploding issue but it occured again.
2019-07-11 10:06:26.659641: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f65eb46c500 = {1, 0} Found Inf or NaN global norm.
I think there are some problems with my sentence piece model or chinese tokenizer. Here is the result of my tokenized result.

I think the result __ "缘" __ "于" is wrong，which splits __ and "缘" and the correct result may be "_缘""_于". Cuz the english tokenized result is "_EU" "_reject".

In the last， I don't konw how to provide details of all vocab list which is a too large txt and sentence piece model which is a binary file. I can only provide detail like this.

sample of all vocab list:
<unk>   0
<s>     0
</s>    0
<cls>   0
<sep>   0
<pad>   0
<mask>  0
<eod>   0
<eop>   0
。      0
,       -3.29251
▁       -3.45567
的      -3.76215
1       -4.30766
0       -4.54219
年      -4.64991
2       -4.74569
、      -4.8037
一      -4.90536
在      -4.91364
为      -4.94451
是      -5.03084
中      -5.04317
9       -5.05516
国      -5.06382
)       -5.0947
(       -5.09492
人      -5.09874
于      -5.26198

this issue that you fix already or not, I got this problem too.

stevezheng23 · 2019-11-02T03:16:37Z

@youbingchenyoubing no fix is applied yet, since I couldn't reproduce this issue. Could you provide more details for your problem?

youbingchenyoubing · 2019-11-02T10:58:00Z

@youbingchenyoubing no fix is applied yet, since I couldn't reproduce this issue. Could you provide more details for your problem?

File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1323, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1593, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "/home/chenyoubing/nlp/resume_entity/entity_model/build_model/xlnet_model.py", line 135, in model_fn
train_op, _, _ = model_utils.get_train_op(self.args, loss)
File "/home/chenyoubing/nlp/resume_entity/entity_model/xlnet/model_utils.py", line 147, in get_train_op
clipped, gnorm = tf.clip_by_global_norm(gradients, FLAGS.clip)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
return verify_tensor_all_finite_v2(t, msg, name)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
verify_input = array_ops.check_numerics(x, message=message)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/chenyoubing/virtualplace/xlnet/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node VerifyFinite/CheckNumerics (defined at /home/chenyoubing/nlp/resume_entity/entity_model/xlnet/model_utils.py:147) ]]

stevezheng23 · 2019-11-02T17:15:12Z

@youbingchenyoubing Sorry, based on the error message, I can't figure out how run_ner.py is used by your pipeline. BTW, which dataset does this experiment run with? English or Chinese?

youbingchenyoubing · 2019-11-03T12:22:49Z

@youbingchenyoubing Sorry, based on the error message, I can't figure out how run_ner.py is used by your pipeline. BTW, which dataset does this experiment run with? English or Chinese?

chinese resume ner used in my experiment.

youbingchenyoubing · 2019-11-05T00:09:35Z

can xlnet support no fixed context？

stevezheng23 · 2019-11-06T21:37:29Z

@SuMarsss / @charlesXu86 / @youbingchenyoubing, sorry, I still can't repro this issue on CoNLL2003 dataset and I think I'll not support Chinese NER in the near future

youbingchenyoubing · 2019-11-08T01:18:15Z

@SuMarsss / @charlesXu86 / @youbingchenyoubing, sorry, I still can't repro this issue on CoNLL2003 dataset and I think I'll not support Chinese NER in the near future

awsome, thx

stevezheng23 closed this as completed Jul 15, 2019

stevezheng23 reopened this Aug 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I run ner on other language like chinese? #31

How do I run ner on other language like chinese? #31

SuMarsss commented Jul 9, 2019

stevezheng23 commented Jul 9, 2019 •

edited

Loading

stevezheng23 commented Jul 9, 2019

SuMarsss commented Jul 10, 2019

stevezheng23 commented Jul 10, 2019 via email

SuMarsss commented Jul 11, 2019

stevezheng23 commented Jul 11, 2019

SuMarsss commented Jul 11, 2019

stevezheng23 commented Jul 11, 2019 •

edited

Loading

charlesXu86 commented Sep 18, 2019

stevezheng23 commented Sep 28, 2019 •

edited

Loading

youbingchenyoubing commented Nov 2, 2019

stevezheng23 commented Nov 2, 2019 •

edited

Loading

youbingchenyoubing commented Nov 2, 2019

stevezheng23 commented Nov 2, 2019

youbingchenyoubing commented Nov 3, 2019

youbingchenyoubing commented Nov 5, 2019

stevezheng23 commented Nov 6, 2019

youbingchenyoubing commented Nov 8, 2019

How do I run ner on other language like chinese? #31

How do I run ner on other language like chinese? #31

Comments

SuMarsss commented Jul 9, 2019

stevezheng23 commented Jul 9, 2019 • edited Loading

stevezheng23 commented Jul 9, 2019

SuMarsss commented Jul 10, 2019

stevezheng23 commented Jul 10, 2019 via email

SuMarsss commented Jul 11, 2019

stevezheng23 commented Jul 11, 2019

SuMarsss commented Jul 11, 2019

stevezheng23 commented Jul 11, 2019 • edited Loading

charlesXu86 commented Sep 18, 2019

stevezheng23 commented Sep 28, 2019 • edited Loading

youbingchenyoubing commented Nov 2, 2019

stevezheng23 commented Nov 2, 2019 • edited Loading

youbingchenyoubing commented Nov 2, 2019

stevezheng23 commented Nov 2, 2019

youbingchenyoubing commented Nov 3, 2019

youbingchenyoubing commented Nov 5, 2019

stevezheng23 commented Nov 6, 2019

youbingchenyoubing commented Nov 8, 2019

stevezheng23 commented Jul 9, 2019 •

edited

Loading

stevezheng23 commented Jul 11, 2019 •

edited

Loading

stevezheng23 commented Sep 28, 2019 •

edited

Loading

stevezheng23 commented Nov 2, 2019 •

edited

Loading