Skip to content

This repository contains files and information about step 4 of Kaphta Architecture: System for the retrieval and ranking of indexed information, using the R language.

Notifications You must be signed in to change notification settings

ramongsilva/System-for-the-retrieval-and-ranking-of-indexed-information

Repository files navigation

System for the retrieval and ranking of indexed information

This repository contains files and information about step 4 of Kaphta Architecture: System for the retrieval and ranking of indexed information. In this stage are presented 5 algorithms for 5 search types: search for polyphenol, search for cancer, search for gene, search for polyphenol-cancer, e search for polyphenol-gene. According to the type of search performed, the system retrieves indexed abstracts in the past stage (Indexing of Extracted Information step) and submits them to the ranking algorithm that returns five scores (S1, S2, S3, S4, and S5) for each abstract. In the algorithm, the calculation of S1, S2, and S3 varies according to the type of search. Scores S1 and S2 are calculated considering the number and type of sentence (PC – polyphenol-cancer sentences; PG – polyphenol-gene sentences; P – only polyphenol sentences; C – only cancer sentences; G – only gene sentences) containing recognized entities and rules. The points assigned are different for the different types of sentences (PC, PG, P, C, and G).

For more information about this and other steps of the Kaphta Architecture, see sections of the Kaptha Web Tool available in https://portal.ifsuldeminas.edu.br/kaphtawebtool/.

See below Table of points and Algorithm for ranking indexed PubMed abstracts.

Table of points for S1 and S2 scores calculation on the retrieval and ranking algorithm

Description of the algorithm for ranking indexed PubMed abstracts

The algorithm performs the retrieval and ranking of indexed articles based on the user's search type: search for polyphenol, search for cancer, search for gene, search for polyphenol-cancer, or search for polyphenol-gene. From there, the execution continues:

Search for polyphenol

  • Input: id of the Polyphenol (P) searched;
  • Retrieval of indexed PubMed abstracts related to the P searched, on df_polyphenol_individual_indexation.tsv file;
  • Loop start: for each PubMed abstract are calculated the scores:
    • S1 = s1_score_calc_PC() + s1_score_calc_P(); // where P refers to the polyphenol entity searched, and C can be anything cancer entity
    • S2 = s2_score_calc(); // where P refers to the polyphenol entity searched, and C and G can be anything cancer and gene entities
    • S3 = sum of P entities recognized in PubMed abstract;
  • Loop end

Search for Cancer

  • Input: id of the Cancer (C) searched;
  • Retrieval of indexed PubMed abstracts related to the C searched, on df_cancers_individual_indexation.tsv file;
  • Loop start: for each PubMed abstract are calculated the scores:
    • S1 = s1_score_calc_PC() + s1_score_calc_P(); // where C refers to the cancer entity searched, and P can be anything polyphenol entity
    • S2 = s2_score_calc(); // where C refers to the cancer entity searched, and P and G can be anything polyphenol and gene entities
    • S3 = sum of C entities recognized in PubMed abstract;
  • Loop end

Search for Gene

  • Input: id of the Gene (G) searched;
  • Retrieval of indexed PubMed abstracts related to the G searched, on df_genes_individual_indexation.tsv file;
  • Loop start: for each PubMed abstract are calculated the scores:
    • S1 = s1_score_calc_PC() + s1_score_calc_P(); // where P and C can be anything polyphenol and cancer entity searched
    • S2 = s2_score_calc(); // where P and C can be anything polyphenol and cancer entity searched, and G refers to the entity gene searched
    • S3 = sum of G entities recognized in PubMed abstract;
  • Loop end

Search for polyphenol-cancer

  • Input: Ids of the Polyphenol (P) and Cancer (C) searched;
  • Retrieval of indexed PubMed abstracts related to the P and C searched, on df_cross_indexation_polyphenol_cancer_association.tsv file;
  • Loop start: for each PubMed abstract are calculated the scores:
    • S1 = s1_score_calc_PC() + s1_score_calc_P(); // where P e C refers to the searched entities
    • S2 = s2_score_calc(); // where P and C refers to the searched entities, and G can be anything gene entity
    • S3 = sum of P and C entities recognized in PubMed abstract;
  • Loop end

Search for polyphenol-gene

  • Input: Ids of the Polyphenol (P) and Gene (G) searched
  • Retrieval of indexed PubMed abstracts related to the P and G searched, on df_cross_indexation_gene_polyphenol_association.tsv file;
  • Loop start: for each PubMed abstract are calculated the scores:
    • S1 = s1_score_calc_PC() + s1_score_calc_P(); // where P refers to the searched polyphenol entity, and C can be anything cancer entity
    • S2 = s2_score_calc(); // where P e G refers to the searched entities, and C can be anything cancer entity
    • S3 = sum of P and G entities recognized in PubMed abstract;
  • Loop end

//Final processing, after loop

Normalization of S1, S2 and S3 scores (0 to 1);

S4 → (5*S1 + 2*S2 + 3*S3) / 10;

S5 → result of text classification based on ensemble;

Result (output) → list of the PubMed abstracts with extracted information and ranking scores calculated.

Algorithm for search of polyphenol-cancer (example of part of the algorithm)

The following files listed are an example of the Algorithm for ranking indexed PubMed abstracts for a polyphenol-cancer search. In the Kaphta Web Tool, there is the implementation complete of the algorithm.

  • polyphenol_cancer_search_ranking_algorithm_gh.R: R script with an example of the algorithm for retrieval and ranking of indexed PubMed abstracts for a polyphenol-cancer search.
  • functions.R: script with auxiliary functions. Save this file in the same folder of polyphenol_cancer_search_ranking_algorithm_gh.R script, because it is needed to execute this script.
  • db_total_project.db: SQLite Database needed to execute all R scripts of kaphta architecture steps. This database contains tables with the Entity dictionary, Total PubMed abstracts textual corpus, and Pubmed abstracts classified as positive in text classification. Save this file in the same folder of polyphenol_cancer_search_ranking_algorithm_gh.R script, because it is needed to execute this script.
  • entities-recognized: folder with files resulted from NER task, containing extracted information about named entities (polyphenols, cancers and genes) recognized on PubMed abstracts and indexed in the previous stage (Indexing of extracted information step). Save this folder with the files in the same folder of polyphenol_cancer_search_ranking_algorithm_gh.R script, because it is needed to execute this script.
  • Rule_associations_recognized.rar: compacted file resulted of AR task in Information Extraction stage, containing the PubMed abstract sentences with at least one rule from rules dictionary recognized. Save this file in the same folder of polyphenol_cancer_search_ranking_algorithm_gh.R script, because it is needed to execute this script.

About

This repository contains files and information about step 4 of Kaphta Architecture: System for the retrieval and ranking of indexed information, using the R language.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages