data process #3

chrislouis0106 · 2021-11-25T08:48:06Z

Hi, there,
Would you suggest opening the code about how to get the NELL23K and WD-singer datasets?
And, did you download your Wikipedia data directly from the official website “https://www.wikidata.org/wiki/Wikidata:Database_download/en”? Then, did you create the triple by integrate the entity and corresponding to the concept?
If this, such a dataset is also too sloppy!

davidlvxin · 2021-11-29T02:52:38Z

This work was done early last year, and I can't find the original generation code, but I still remember the processing idea.

We were building the Wikidata dataset based on KACC-large. Specifically, we filtered out concepts with "singer" words in the labels, and then identified entities belonging to those concepts as the seed entity set. In addition, there are fewer direct concatenated edges between these entities, which do not fully reflect the knowledge related to singers, such as what is the birthplace of a singer. Thus, we randomly added some entities to the seed entities among the high frequency entities connected to these seed entities. The ratio of the number of newly added entities to the number of the original set of seed entities is about 2:5. After that, we formed the set of relations by keeping only the more high-frequency relations between entities. Finally, we used the entity and relation sets to extract the corresponding triples from the KACC-large entity triples as our dataset.

We acknowledge that constructing the domain dataset in this way is rather crude. However, the construction of the dataset is not the main contribution of our work. We aim to validate the effectiveness of our model on sparse knowledge graphs. Finally, we call for more work to construct accurate datasets with manual quality control to advance the field.

chrislouis0106 · 2021-12-06T09:16:28Z

By your clear explanation and reading the KACC paper, I have already known the dealing process of dataset. In fact, most knowledge graphs in the open domain are sparse graphs, and in particular Wikidata-based knowledge graphs are also sparse graphs. I guess you only extracted singer-related knowledge at that time just to reduce the knowledge size, if you do not add to the extracted graph, the sparsity is so high that there is no way to model reasoning process.
Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data process #3

data process #3

chrislouis0106 commented Nov 25, 2021

davidlvxin commented Nov 29, 2021

chrislouis0106 commented Dec 6, 2021

data process #3

data process #3

Comments

chrislouis0106 commented Nov 25, 2021

davidlvxin commented Nov 29, 2021

chrislouis0106 commented Dec 6, 2021