Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample data, pre-trained word embedding #3

Closed
hpduong opened this issue Jul 20, 2017 · 36 comments
Closed

sample data, pre-trained word embedding #3

hpduong opened this issue Jul 20, 2017 · 36 comments

Comments

@hpduong
Copy link

hpduong commented Jul 20, 2017

I'm getting this issue when I run training on a08 entity network and a06 seq2seq models.

Can I get or train this file?

zhihu-word2vec-title-desc.bin-100

screenshot_20170720_171417

Also, do you have sample datasets compatible with these models?

@brightmart
Copy link
Owner

brightmart commented Jul 25, 2017

  1. for zhihu-word2vec-title-desc.bin-100, please find it in:
    https://pan.baidu.com/s/1jIP9e6q

  2. [OLD, TO BE DELETE]
    for sample data (multi-label. file name:test-zhihu6-title-desc.txt), you can find it in:
    https://pan.baidu.com/s/1gf49auB

  3. [OLD, TO BE DELETE]
    train-zhihu4-only-title-all.txt (single-label):
    https://pan.baidu.com/s/1jI7R4X4

  4. [NEW, try use this, update 2018-08-12]

zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files:

https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA

  1. train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels)

  2. test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels)

  3. train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.)

@ghost
Copy link

ghost commented Jul 25, 2017

Could you please upload it in a different service? I am having a hard time downloading the data and word2vec model. Thanks!

@brightmart
Copy link
Owner

you can follow two steps to get the file.
step1: click the place for download
default

step2: download the file
step2

@brightmart brightmart changed the title ImportError: No module named data_util_zhihu sample data, pre-trained word embedding Aug 3, 2017
Repository owner deleted a comment from yinjianhong Aug 7, 2017
@Cauchyzhou
Copy link

When I run TextRNN model. Terminal reports :
IOError: [Errno 2] No such file or directory: '../zhihu-word2vec.bin-100'
There is no link about zhihu-word2vec.bin-100.

@brightmart
Copy link
Owner

brightmart commented Sep 5, 2017

you may use zhihu-word2vec-title-desc.bin-100 or file from your own.
.bin file is just a word embedding file trained from word2vec.

@jacky20172017
Copy link

jacky20172017 commented Nov 10, 2017

where is the file: /test-zhihu-forpredict-title-desc-v6.txt
when running the a8_predict.py:
IOError: [Errno 2] No such file or directory: '../test-zhihu-forpredict-title-desc-v6.txt'

@jacky20172017
Copy link

and also: train-zhihu6-title-desc.txt

@jacky20172017
Copy link

there're 2 data_util_zhihu.py in folders aa1_data_util and a07_Transformer
which one to import?

@jacky20172017
Copy link

what's the data_type ?

train, test, _ = load_data(vocabulary_word2index, vocabulary_word2index_label,data_type='train')
TypeError: load_data() got an unexpected keyword argument 'data_type'

@pmahend1
Copy link

pmahend1 commented Nov 30, 2017

I cant see files there in the links mentioned. They give below error in Chinese

Oh, the page you visit does not exist.

possible reason:

  1. Enter the wrong address in the address bar.

  2. A link you clicked has expired.

#########################

for zhihu-word2vec-title-desc.bin-100, please find it in:
https://pan.baidu.com/s/1kVgdDD9

for sample data (multi-label. file name:test-zhihu6-title-desc.txt), you can find it in:
https://pan.baidu.com/s/1gf49auB

train-zhihu4-only-title-all.txt (single-label):
https://pan.baidu.com/s/1jI7R4X4

#########################################

Could you please upload them to https://github.com/brightmart/text_classification in sample data folder?

@pmahend1
Copy link

pmahend1 commented Nov 30, 2017

Nevermind. After few tries the above links worked.

But bin100 is not downloading for some reason/

@deatherving
Copy link

@pmahend1 Same. The bin100 is not downloading even in China.

@brightmart
Copy link
Owner

@pmahend1 @deatherving
for zhihu-word2vec-title-desc.bin-100, please use this:
https://pan.baidu.com/s/1jIP9e6q

@deatherving
Copy link

@brightmart Thanks. The file is accessible.

@searchlink
Copy link

searchlink commented Dec 6, 2017

@brightmart
when I run p5_fastTextB_predict.py ,where come with the error below:
FileNotFoundError: [Errno 2] No such file or directory: 'test-zhihu-forpredict-v4only-title.txt'
In addition,where is zhihu-word2vec-multilabel.bin-100?

@brightmart
Copy link
Owner

brightmart commented Dec 6, 2017 via email

@pmahend1
Copy link

pmahend1 commented Dec 7, 2017

@brightmart I could download the file now. Thanks 👍

@behnazeslami
Copy link

behnazeslami commented Dec 10, 2017

@pmahend1
@brightmart Hi,
After running p8_TextRNN_train.py 've got this error :

File "./p8_TextRNN_train.py", line 117, in main
    test_loss, test_acc = do_eval(sess, textRNN, testX, testY, batch_size,vocabulary_index2word_label)
  File "./p8_TextRNN_train.py", line 167, in do_eval
    return eval_loss/float(eval_counter),eval_acc/float(eval_counter)
ZeroDivisionError: float division by zero

How can I solve this issue?
I checked do_eval function in train, there is a [for loop] that it doesn't work.
I attached images as below :
111

222

@iqrasafder
Copy link

above links to download "zhihu-word2vec-title-desc.bin-100' are not working.
please share some working links to download data.

Thanks

@iqrasafder
Copy link

@brightmart please share links to download dataset.

@arunarn2
Copy link

@brightmart Do I need an account on pan.baidu.com to download the dataset. Can you please upload the data to the repo?

@brightmart
Copy link
Owner

no need account.

@parahaoer
Copy link

what directory should I put the file ‘zhihu-word2vec-title-desc.bin-100’?
Thanks!

@liangtianxin
Copy link

thank you so much

@JaeZheng
Copy link

@parahaoer I think put it in the same directory with the "model_train.py". For example, when you use TextCNN in the directory"a02_TextCNN", you process p7_TextCNN_train.py to train the model, at this time you should put the file 'zhihu-word2vec-title-desc.bin-100’ in the same directory.

@harirajeev
Copy link

Could you please upload data 'zhihu-word2vec-title-desc.bin-100’ as well. The links do not work. Appreciate any quick response.

@kevinsay
Copy link

kevinsay commented Jul 27, 2018

@brightmart I load the zhihu-word2vec-title-desc.bin-100 as the wordvector file,train-zhihu4-only-title-all.txt as the trainning file,set multi_label_flag=false,use_embedding=true,
a01_FastText,a03_TextRNN,a04_TextRCNN,a05_HierarchicalAttentionNetwork,a06_Seq2seqWithAttention,these models can run,but the accuracy is very low,i don't know why.
and predict,also set multi_label_flag=false,use_embedding=true,there will be more than one prediction label,i need you help.thanks.

@brightmart
Copy link
Owner

brightmart commented Aug 12, 2018

hi. thanks for your feedback. as long as you can see that training and validation loss during training process is decreasing, it will be fine. the previous reported f1 score is not a right indicator of accuracy. I am updating the way of how to compute f1 score today.

it is good to see that you can make it work for these several models. can you commit your version to this repository as a new branch?

@brightmart brightmart reopened this Aug 12, 2018
@hlshao
Copy link

hlshao commented Aug 24, 2018

Respect Sir:

It is a good project!
Could you please provide the file "zhihu-word2vec-title-desc.bin-100" in some where?
and the link below is out of date too...
Many thanks if you can help.

[NEW, try use this, update 2018-08-12]
zhihu-title-desc-multiple-label-v6.txt.zip, it contains three files:
https://pan.baidu.com/s/1mHgELJUHewQZ9zHDo_uhmA
train-zhihu-title-desc-multiple-label-v6.txt(around 3 million training data, multiple labels)
test-zhihu-title-desc-multiple-label-v6.txt(around 70k validation/test data,multiple labels)
train-zhihu-title-desc-multiple-label-200k-v6.txt( 200k training data,multiple labels, a subset of file one.)

@brightmart
Copy link
Owner

brightmart commented Aug 24, 2018 via email

@peterHeuz
Copy link

peterHeuz commented Aug 28, 2018

I am using the TextRNN (a03) and cannot find this flag. Neither the downloads are working.

I have changed following in p8_TextRNN_train.py:
tf.app.flags.DEFINE_boolean("use_embedding", False, "whether to use embedding or not.")

but the error is still the same:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

I have also changed this flag in a02_TextCNN since the TextRNN uses code (data_util_zhihu.py) from this part. The error is still the same.

Can you please share the pretrained embeddings or point me to the right place?

edit: This one seems to be up to date: https://pan.baidu.com/s/1jIP9e6q. I am using your instructions to download @brightmart . After step 1 I get a windows that tells me to download the netdisk client from Baidu:

screen shot 2018-08-29 at 08 05 16

The installer is in Chinese, which I don't speak, and there is no english version.
Could anyone who has it please upload with another service? Like google, dropbox, onedrive? @deatherving @liangtianxin and anyone else who has it. Would be appreciated.

@Gunjitbedi
Copy link

Gunjitbedi commented Sep 6, 2018

@brightmart Cant down load file 'test-zhihu-forpredict-title-desc-v6.txt', please share it on any other platform

@brightmart
Copy link
Owner

re-generated data, and save as cached file, available to download. check this session in README.md:

#Sample data: cached file

@wuliuyuedetian
Copy link

where is the file: /test-zhihu-forpredict-title-desc-v6.txt
when running the a8_predict.py:
IOError: [Errno 2] No such file or directory: '../test-zhihu-forpredict-title-desc-v6.txt'
Me too

@Aivi001
Copy link

Aivi001 commented Feb 14, 2020

I am using the TextRNN (a03) and cannot find this flag. Neither the downloads are working.

I have changed following in p8_TextRNN_train.py:
tf.app.flags.DEFINE_boolean("use_embedding", False, "whether to use embedding or not.")

but the error is still the same:
IOError: [Errno 2] No such file or directory: 'zhihu-word2vec.bin-100'

I have also changed this flag in a02_TextCNN since the TextRNN uses code (data_util_zhihu.py) from this part. The error is still the same.

Can you please share the pretrained embeddings or point me to the right place?

edit: This one seems to be up to date: https://pan.baidu.com/s/1jIP9e6q. I am using your instructions to download @brightmart . After step 1 I get a windows that tells me to download the netdisk client from Baidu:

screen shot 2018-08-29 at 08 05 16

The installer is in Chinese, which I don't speak, and there is no english version.
Could anyone who has it please upload with another service? Like google, dropbox, onedrive? @deatherving @liangtianxin and anyone else who has it. Would be appreciated.

Hi, I'm facing the same problem, how did you solve it? Thanks in advance

@litomvv
Copy link

litomvv commented Aug 18, 2020

hi.where is the file: /test-zhihu-forpredict-title-desc-v6.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests