Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data process #3

Open
chrislouis0106 opened this issue Nov 25, 2021 · 2 comments
Open

data process #3

chrislouis0106 opened this issue Nov 25, 2021 · 2 comments

Comments

@chrislouis0106
Copy link

Hi, there,
Would you suggest opening the code about how to get the NELL23K and WD-singer datasets?
And, did you download your Wikipedia data directly from the official website “https://www.wikidata.org/wiki/Wikidata:Database_download/en”? Then, did you create the triple by integrate the entity and corresponding to the concept?
If this, such a dataset is also too sloppy!

@davidlvxin
Copy link
Member

This work was done early last year, and I can't find the original generation code, but I still remember the processing idea.

We were building the Wikidata dataset based on KACC-large. Specifically, we filtered out concepts with "singer" words in the labels, and then identified entities belonging to those concepts as the seed entity set. In addition, there are fewer direct concatenated edges between these entities, which do not fully reflect the knowledge related to singers, such as what is the birthplace of a singer. Thus, we randomly added some entities to the seed entities among the high frequency entities connected to these seed entities. The ratio of the number of newly added entities to the number of the original set of seed entities is about 2:5. After that, we formed the set of relations by keeping only the more high-frequency relations between entities. Finally, we used the entity and relation sets to extract the corresponding triples from the KACC-large entity triples as our dataset.

We acknowledge that constructing the domain dataset in this way is rather crude. However, the construction of the dataset is not the main contribution of our work. We aim to validate the effectiveness of our model on sparse knowledge graphs. Finally, we call for more work to construct accurate datasets with manual quality control to advance the field.

@chrislouis0106
Copy link
Author

By your clear explanation and reading the KACC paper, I have already known the dealing process of dataset. In fact, most knowledge graphs in the open domain are sparse graphs, and in particular Wikidata-based knowledge graphs are also sparse graphs. I guess you only extracted singer-related knowledge at that time just to reduce the knowledge size, if you do not add to the extracted graph, the sparsity is so high that there is no way to model reasoning process.
Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants