-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with Sejong Treebank corpus #4
Comments
Okay, for the full corpus it seems like c2d was failing to create a deptree file in build_tree()...because of an unaligned sentence. Not sure what's going on there. So when I manually copied some files around and ran just with the sample corpus, I finally got something going again, but running into what seems like the same error.
|
before executing
|
Thanks for your quick response! When I use the SyntaxNet for English, the result is a root word. In the training data the root of each training sentence is clearly specified. However in the training data for the Korean text for Sejong TreeBank I don't see how the root is specified. How does that work? English SyntaxNet training data (the root is clearly indicated on node 3)
Where is root indicated in training data here and what is the difference between v2 and v3 deptree? I guess v3 is going into SyntaxNet for training but I'm not sure about v2, and your sample output seems closer to v2. (10 nodes) deptree.txt.v2.training
deptree.txt.v3.training
sample output
How is 열렸다 chosen as the root when SyntaxNet trains on deptree v3? Is VV just considered automatically the root or is there some other mechanism? Thanks for your hard work, |
original sejong constituent tree is filtered by tree2con() in c2d.py filtered sejong constituent tree is converted to dependency tree by tree2dep() in c2d.py then, make eoj-based dependency tree to morph-based dependency tree via align.py now, we have training data(deptree.txt.v3.*) for syntaxnet. i thought your question is 'where ROOT comes from?' the answer is from the tree2dep() function. https://github.com/dsindex/blog/wiki/%5Bparsing%5D-visualizer-for-the-Sejong-Tree-Bank
in this case, head of '틀이' is '있다'. but check_vx_rule() determines '있다' as auxiliary verb. |
Thanks for your response! Your explanation makes sense. However, I don't understand how to use the model. test.sh requires tagger-params(?) which I don't have, and I would like to enter a sentence like the English SyntaxNet and see a tree. I am wondering if this is possible with this treebank model. Do I need to use some other model first to get the part of speech tags, and then plug it into this model? (How did you test your TreeBank model?) |
unfortunately, it is not possible to tag sentences without having a Korean morphological analyzer. if we have '먹은' with several possible segmentation '먹다+은', '먹+은' in the corpus, it can't be directly trained from Sejong tagged corpus using SyntaxNet. so, i recommend you to use an available morphological analyzer : http://eunjeon.blogspot.sg/ it worth to use. by the way, i thought you have sejong corpus and it may differ from mine. |
@dsindex I got Sejong TreeBank corpus (구문분석말뭉치?) from the sejong.or.kr miscellaneous files section by combining 15 different files BGAA0001.txt, BGAA0164.txt into a single UTF-8 file and then I wrote a script to try to remove invalid or weird things in the files. I think the number you are asking for is accuracy 0.886990.
I have used KoNLPy before. I am going to try to develop something to connect the morphological analyzer to this structured syntactic analyzer. If I have the input tagged with POS tags like you mentioned, how do I run the model or which command should I use for creating the tree? I was experimenting with the following command, but I'm confused about what the input needs to be if I have '가계부/NNG+의/JKG 틀/NNG+이/JKO 달라지다/VV+고/EC 있다/VX+다/EP'
From analyzing English SyntaxNet, I thought I should input something like this:
But then, the resulting beam-parsed-test-corpus has some weird output like:
|
i have made a modification for testing korean parser. if you make input as the and for convenience, using Komoran tagger in konlpy, you can input raw sentence.
|
Thanks! I eventually figured this out. And, I ended up using Komoran too. I have another question for you actually. After running on a custom corpus, I made a program to recombine parts of each eojeol, but as you can see, there is a problem with the word "위해" --> (위하아 VP). Komoran splits it into
I can't figure out why it wants to do this. 위해 may be 위하+ㅏ or 위하+ㅓ, but python-jamo doesn't want to make 해 anyway. All I want to do is change this into "위해" in a consistent way. I made a script to regenerate the tree based on parts. I think you regenerated the parts based on the corpus, but because this is a new sentence I wasn't able to do this. I had to recombine output from the POS tagger tree. $ echo "수영할 때 눈을 보호하기 위해 쓰는 물안경은 렌즈의 굴절력과 고무 밴드의 내구성이 제품 선택의 포인트다." | ./demo.sh
The intermediate step was this:
which gets converted to:
|
@xtknight there are also rules for combining root and functional word(eomi). but those are difficult to implement. |
Yes, I tried to combine them, but it's not working great. At komoran-2.0-master/KOMORAN_2.0_beta/corpus_build/dic.irregular there is a list of irregular forms. It fixed "위하+아" but it doesn't handle inflection(conjugation) of verbs. I've dealt with conjugation before so I can probably figure it out, but I'm curious why the original information about the complete word is destroyed. What about training the model on the 어절 instead of all the individual parts? I mean training the model based on "선택의"(NP_MOD) instead of "선택/NNG + 의/JKG", for example. Is there a good reason to separate them? I am curious why they decided to separate them all in the corpus. I made a sample here that attempts recombination. But it fails sometimes because of the eoj problems or disagreement between Komoran POS tagger and Sejong corpus. My next project is to hack the Komoran tagger to return eoj index. |
@xtknight |
Okay, that makes sense. There is another thing I am curious about. When eval.py runs it does not consider the phrase structure tag 'ptst' (like NP or NP_OBJ or VP) as part of the matching criteria. It seems like only seq, analyzed, and gov are important. (sejong/eval.py)
What is the reason for this? Is the ptst tag returned unimportant? Because when I run the model on the following sentence I'm having trouble with the ptst tag: 서울에 4일 밤부터 100㎜ 넘는 폭우가 쏟아져 곳곳에서 도로함몰, 교통사고 등 비 피해가 속출했다.
I get this output from the model, but 도로 is marked as NP. But it seems like it should be MOD. It enters the model as NNG,NNG,SP. For my tree generation I expected that everything in a tagged Eojeol would be marked as MOD except for the last part (for example, MOD,MOD,NP_CNJ for 도로함몰). But I don't know why 도로 is being separated as NP. I wasn't aware the model would reclassify those parts. |
labeled attachment score(LAS) is less important than unlabeled attachment score(UAS) in practical reason. so, as you may know, Korean language is flexible for spacing within compound noun. for example, '도로함몰' or '도로 함몰' are acceptable. that is why trained model often mis-classify 'MOD' as 'NP'. but i think that is not important. if we get the eoj index from POS tagger, we can make it correct-classified. |
Yes, compound nouns seem to be the primary issue I run into. So, if that compound noun's first piece (도로) gets classified as NP, we can just change it to MOD according to the Komoran POS tagger Eojeol group and everything should be fine. I wonder if this misclassification can be prevented while training the model if we use the POS Eojeol data. But what happens if 도로's HEAD value is also classified wrong? Then the tree would be broken as the compound noun's pieces would be far away in the tree? Should I fix the HEAD and DEPREL values based on the POS tagger (modify HEAD=18, DEPREL=MOD)? I'm not sure how I should solve the problem when it happens even if I have the eoj index. And I am curious what happens if multiple parts of an Eojeol get misclassified or something that is not a compound noun... (Possible misclassification example)
I have updated my online example to show logs and also the PSG tree. Although I don't fully understand everything yet, I appreciate your explanations and hope there is some way for me to contribute to your project, especially for the Korean language related tagger and parsers. |
@xtknight : i'm not sure :) but ... since Syntaxnet uses NN classifiers not rules, we could not drive to 100% accuracy for such label('MOD').
: as you mentioned, you should fix inner-eoj relation('MOD') based on eoj index. the governor of last morph in a eoj is only our concern. and that classification is solely responsible for parser(Synstaxnet).
: No problem :)
: it will be very by the way, would you mind if i add your `psg_tree.htm' in README.md ? |
I am still trying to fix the tree issues via the POS tagger. I have made a hack to Komoran to always split words with spaces so that I can match the eoj index to the original sentence, but I decided it was probably the wrong way to fix it. So now I am trying to properly return the eoj index. For parser_eval.... Sure, you can add a link to my psg_tree.htm (I will still be fixing some of the errors using POS tagger eoj). Basically it's just running the Python to run the demo.sh script on the backend and returns the tree with JSON along with the error log. |
I added some features and also the link of the PSG tree got changed due to my EC2 server crashing. http://sejongpsg.ddns.net/syntaxnet/psg_tree.htm I tried to investigate implementing what you were talking about with parser_eval, but unfortunately I have no idea where to begin. |
@xtknight i can print out 'python_path' and 'program, args' in try:
sys.stdout.flush()
sys.stderr.write('[python_path] : ' + python_path + '\n')
sys.stderr.write('[execv()] : ' + program + ' '.join(args) +'\n')
os.execv(program, args)
except EnvironmentError as e:
# This exception occurs when os.execv() fails for some reason.
if not getattr(e, 'filename', None):
e.filename = program # Add info to error message
raise so, got information for executing parser_eval.py
and i thought i could modify 'parser_eval.py'. def Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims):
...
parser.AddSaver(FLAGS.slim_model)
sess.run(parser.inits.values())
parser.saver.restore(sess, FLAGS.model_path)
# -------> initialization model end
# read conll input from stdin
# can we change below logic to while loop style?
sink_documents = tf.placeholder(tf.string)
sink = gen_parser_ops.document_sink(sink_documents,
task_context=FLAGS.task_context,
corpus_name=FLAGS.output)
t = time.time()
num_epochs = None
num_tokens = 0
num_correct = 0
num_documents = 0
while True:
tf_eval_epochs, tf_eval_metrics, tf_documents = sess.run([
parser.evaluation['epochs'],
parser.evaluation['eval_metrics'],
parser.evaluation['documents'],
])
if len(tf_documents):
logging.info('Processed %d documents', len(tf_documents))
num_documents += len(tf_documents)
sess.run(sink, feed_dict={sink_documents: tf_documents})
num_tokens += tf_eval_metrics[0]
num_correct += tf_eval_metrics[1]
if num_epochs is None:
num_epochs = tf_eval_epochs
elif num_epochs < tf_eval_epochs:
break
...
def main(unused_argv):
logging.set_verbosity(logging.INFO)
with tf.Session() as sess:
feature_sizes, domain_sizes, embedding_dims, num_actions = sess.run(
gen_parser_ops.feature_size(task_context=FLAGS.task_context,
arg_prefix=FLAGS.arg_prefix))
with tf.Session() as sess:
Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
if __name__ == '__main__':
tf.app.run() but not tested yet ;;) |
@xtknight |
Oh, thank you for the links. I'm going to try the PL #250 one first and see how easy it is to use. On a side note, do you have any idea how to setup CUDA with SyntaxNet, or do you have it working? Mine seems to not work at all when using SyntaxNet, yet other TensorFlow examples work fine with CUDA. I don't know why that is. I guess using GPU could make not only training faster but evaluation as well? |
i checked GPU allocation for 'parser_eval.py' : def main(unused_argv):
logging.set_verbosity(logging.INFO)
with tf.Session() as sess:
feature_sizes, domain_sizes, embedding_dims, num_actions = sess.run(
gen_parser_ops.feature_size(task_context=FLAGS.task_context,
arg_prefix=FLAGS.arg_prefix))
config=tf.ConfigProto(log_device_placement=True)
with tf.Session(config=config) as sess:
Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
it seems that parser_eval.py does not use GPU. |
Well, despite my best efforts, I keep running into weird different errors and environment problems with bazel. And it recompiles and redownloads a million packages even if I just want to compile one file. It's sometimes a different error each time because the build order is different. It's a disaster. Otherwise, I think I'm going to try to rework the Tensorflow Serving code and make something myself. I got the Serving code for MNIST digits working fine but can't compile the parsey api... |
i agree with you. i didn't investigate deeply the codes but expect that we are able to merge below codes :)
|
@dsindex I finally had some luck getting parsey_api to compile thanks to the help of the author. Now my next project is to 'export???' the Sejong model. Not sure what export is actually doing because I thought the exported model was already being used by SyntaxNet but maybe I don't understand the terminology. I'll investigate it! |
@xtknight good job! you may refer this : (i wrote ;; ) i think we need to modify |
now i catch that dmansfield already done it for exporting model via and now
i think syntaxnet/tensorflow_serving are such a messy;; i am not sure i will hate a product using bazel or not :) |
@xtknight @dmansfield (https://github.com/dmansfield) thanks for your instructions and dmansfield works, detail instructions :
but as i mentioned before, we need to change parsey_api protocol.
|
A great success!! Currently there is way to get it to work without protocol changes, I think. But I don't understand the significance of the protocol. Isn't it just going to match CONLL?
Client side index.js:
parsey_mcparseface.py (changing corpus name to stdin-conll is crucial! (and model path to latest-model)). And for task_context I just put 'context', not context.pbtxt. Also I removed brain-tagger which we don't have.
$ bazel-bin/tensorflow_serving/example/parsey_mcparseface --export_path=exported |
great!! a long way to here :) |
But the problem is, you're right ..I think the protocol does need changing. The fields don't match what I see from my test website. I can see the CONLL output properly in parsey_api.cc , but , I have no idea how it transforms into the json format... :\ any clue what this does?
|
i made a python client program.
see : https://github.com/dsindex/syntaxnet/blob/master/README_api.md and i couldn't understand exactly what it means by 'mis-match' between output fields and your test website. ;;) |
@dsindex But actually, it seems like the HEAD in the node.js is always 1 less than the HEAD shown with my normal demo.sh and website. Maybe that means it's working, but just as a 0-based index. "내 가 집 에 가 ㄴ다 ." Yes protobuf_json seems like it will be useful! In the meantime, I have been trying to train the Korean POS tagger for SyntaxNet. I trained a little bit but I noticed the main problem is figuring out the word components to be input into the tagger. (e..g, splitting "수영할" into "수영","하","ㄹ").. apparently Komoran does this all entirely statistically and takes the best match across the whole sentence with probability. I didn't know even the word components were split probabilistically. But this means it will probably be difficult to train SyntaxNet for the POS tagger. What do you think? |
i got it. i am also confused to figure out where the conversion takes place;; as i mentioned before, training Korean POS tagger using Syntaxnet is very tricky. but there is some abnormal way to tagging. in training step,
in evaluation step,
but notice that this tagger following above steps is likely to weak to unknown eojeols(ex, '의상디자이너', '엠마누엘웅가로가') and eojeols which have grammatical errors like mis-spaces. |
@dsindex Seems like SyntaxNet hasn't considered agglutinative languages like Korean. Do you know if SyntaxNet can perform anything beyond dependency parsing, like semantic role labeling? I haven't been able to find much information about it. And not sure where to obtain resources for semantic role labeling for Korean (that might be beyond the scope of this Issue but..) Maybe this is a dumb question but I am curious if it is possible to convert between the Sejong corpus format and the output of the dependency parser. I was analyzing this sentence. And it seems like it is organized like this in the corpus ("엠마누엘 웅가로는"):
And then there is the example of "실내 장식품을 디자인할때"
But when I run the dependency tagger it seems like tags without leaves are missing, like I labeled. Did I do something wrong? Maybe I am just getting dizzy from looking at the tree. I notice that the leaf-less nodes NP_AJT,VP_MOD,NP_OBJ are always shown in the last leaf node. Is that always the case?? Also it seems like 웅가로는 is becoming a child of 엠마누엘, instead of being on the same level. Is this intentional or is it just variation from the model? Recombined
Individual
Full Sejong tree
P.S. Also for VP_CMP which contains VP and X_CMP, I don't see VP_CMP in the tree. |
(an example semantic information for '가다') |
I see! thank you for the detailed information! Well these days I am off on trying doc2vec and other technologies for Korean. I haven't really found an application for the dependency parsing itself. Seems like I need to get a licensed semantically labeled corpus through my university.. but not sure how that would be trained anyways. |
Hello,
I was trying to train using the sejong_treebank.sample file, so I ran the following commands:
$ ./sejong/split.sh
$ ./sejong/c2d.sh
$ ./train_sejong.sh
But had an error (same as one below -- "Assign requires shapes of both tensors to match").
So then, I tried downloading a larger treebank corpus from sejong.or.kr (it seems to be the full version of the sejong_treebank.sample in your repository, but then again I'm not sure...) But, the same thing happened.
My input file (tried both sample and full) is just an long stream of the following in UTF-8, just like your sample Sejong file. Is there somewhere else I need to put this? Or is there something else I need other than saving this as sejong/sejong_treebank.txt.v1 and running the scripts?
Here's the logs with all the verbose options.
The text was updated successfully, but these errors were encountered: