Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of gap-filling in refineGEMs #52

Open
18 of 22 tasks
GwennyGit opened this issue Jan 19, 2023 · 27 comments · Fixed by #61, #66 or #73
Open
18 of 22 tasks

Improvement of gap-filling in refineGEMs #52

GwennyGit opened this issue Jan 19, 2023 · 27 comments · Fixed by #61, #66 or #73
Assignees
Labels
enhancement New feature or request

Comments

@GwennyGit
Copy link
Collaborator

GwennyGit commented Jan 19, 2023

In this issue all current gap-filling tools implemented in refineGEMs are summarised and possible enhancements explored.

Current gap-filling modules:

  • genecomp (now: kegg_analysis):
    ⇾ Extracts KEGG gene identifiers from model
    ⇾ Compares KEGG genes in model with the strain-specific ones in KEGG
    ⇾ Extracts RefSeq IDs (GPR) from the .gff file
    ⇾ Maps BiGG to KEGG IDs
    ⇒ Returns a table containing missing reactions with locus tag, EC number, KEGG ID, BiGG ID and RefSeq ID (GPR)
  • curate:
    1. With option gapfill
      ⇾ Adds reactions with the corresponding IDs, stoichiometric coefficients, educts, products, upper & lower bound to the model from a manually obtained table
    2. With option metabs
      ⇾ Adds metabolites with the corresponding IDs, formulae, and name to the model from a manually obtained table
      ⇾ Synchronises the metabolite information over all compartments

Creation of gapfill module for BioCyc (& Adjustment of genecomp to gapfill):

  • Add modules gapfill, entities & analysis_biocyc
  • Rename genecomp module to analysis_kegg
  • Adjust code in analysis_kegg
  • Adjust code in curate
  • Extract all functions that could be used for analysis_kegg & analysis_biocyc into analysis_db
  • Add the following functions to gapfill:
    • Function I: Obtains missing genes, metabolites & reactions (→ Function gapfill_analysis)
      • By comparing genes in model with genes from BioCyc (→ Module analysis_biocyc)
        • Adds charges and chemical formulas from the CHEBI API to the metabolites
        • Retrieves stoichiometric values from the BiGG API
      • By comparing genes in the model with KEGG genes (→ Module analysis_kegg, only obtains reactions & genes/proteins)
      • By comparing genes in the model with KEGG, then with BioCyc and combining the results (Currently, no complete merge!)
      • Filter missing reactions by metabolite compartments & missing metabolites by compartment
    • Function II: Adds the missing genes, metabolites & reactions (→ Module entities, See script: https://github.com/draeger-lab/py4gems/blob/main/Reihaneh/1.%20BioCyc_Comparison.ipynb)

Further improvements:

  • Retrieve missing metabolites via 'bigg_reaction' column from reactions table instead of from the 'Reactants' and 'Products' columns
  • Add functionality to apply BioCyc comparison to models where organism does not occurr in any database:
    • From the GFF & FASTA files with DIAMOND & BioCyc SmartTables (→ lab_strain) obtain missing genes → Already fulfilled now by GeneGapFiller
  • Adjust kegg_analysis to also return tables with missing genes & metabolites
    → Then the result from kegg_analysis can also be added to a model.
    → Already fulfilled now by KEGGapFiller
  • Do a complete merge on all BioCyc & KEGG tables on user request
  • Make update_annottions_from_table part of GapFiller.fill_model
  • Add a check similar to verifyGapfilledReactions to GapFiller.fill_model

Further ideas:

  • a type of template gap filling that utilises a genome and a template model to extract information of homologous genes and reactions to improve the current model
@GwennyGit GwennyGit added the enhancement New feature or request label Jan 19, 2023
@famosab
Copy link
Member

famosab commented Jan 31, 2023

Regarding Function I: we already have a parser for gff files integrated (it is in the function get_locus_gpr from genecomp). Maybe we can expand from there - the only obstacle would be to make sure we have similar IDs that can be compared to each other. At the moment I have the problem that the GPRs in my models cannot be found in my gff file because the naming is completely different.

@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Feb 1, 2023

Regarding function I: For strains that are not in KEGG but in BioCyc I think it will be better to use the BioCyc SmartTables as reference. However, for lab strains this function could be still useful. 🤔 For the BioCyc option I will add a comment to Function I. Maybe for that the script from Reihaneh (@Biomathsys) could be used (or maybe adjusted), see Code here: https://github.com/draeger-lab/py4gems/blob/main/Reihaneh/1.%20BioCyc_Comparison.ipynb.

@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Feb 2, 2023

The current module curate does only add genes and reactions to the model. However, for gapfill the metabolites should be added too. In Reihaneh's script COBRA is used and the implementation to add reactions and metabolites is already properly implemented. Thus, I will use her approach for that. In addition, I will use the function to add the missing genes from curate and extend the tables from KEGG/BioCyc/GFF file with the BiGG identifiers for each reaction and metabolite, respectively.


As the function to add reactions will not be used from curate this module will be kept as such. However, the following three new modules will be generated in addition to the gapfill module: entities, analysis_kegg and analysis_biocyc.

GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
GwennyGit added a commit that referenced this issue Feb 2, 2023
Removed a function that was generalised to work with KEGG and BioCyc and is now in gapfill.
@GwennyGit
Copy link
Collaborator Author

GwennyGit commented Feb 2, 2023

For the comparison between already existing metabolites & reactions I realised that if I add the BiGG identifiers to the table the checks from Reihaneh's script are not necessary. Thus, I will extend the functionality of entities.


In analysis_kegg the function get_locus_gpr (line 167) can be adjusted to get the protein IDs from the Genbank GFF/FASTA (CarveMe input) file. The currently retrieved GPRs are basically the RefSeq identifiers from the RefSeq GFF file. Additionally, the function should be moved to gapfill as it can be used for analysis_kegg and analysis_biocyc.
Note to myself: Are there maybe more functions that can be used for both modules? Potentially the function retrieving the BiGG IDs?

@GwennyGit
Copy link
Collaborator Author

Extracting the functions required for analysis_kegg and analysis_biocyc from analysis_kegg into gapfill created a cycle as these functions were called in each of the other two modules and these modules in return were called within the gapfill module. To still reduce redundancy another module analysis_db is created that now contains the overlapping functions for analysis_kegg and analysis_biocyc.

@famosab
Copy link
Member

famosab commented Feb 3, 2023

Maybe the following publication / code is of interest NICEgame. They mention in their manuscript that they also worked with Python, however in the gh repo I only found Matlab scripts.

@GwennyGit
Copy link
Collaborator Author

From the paper I understand that the authors use media for which it is known that the bacterium should grow on to fill gaps in the model. This approach would be similar to the one from the CarveMe documentation or also the gap filling approach from COBRApy. This would be a nice addition to the gap filling via the genes I think. I already considered adding the call for the CarveMe gap filling after using the gap filling from the genes. However, as far as I understood these programs the user needs to know exactly on which media the bacterium would grow. Thus, I find it rather difficult to use any of the tools as we have strain-specific models. For which I suppose that not every strain of e.g. Staphylococcus haemolyticus grows on the same media, especially, if microbiome media are used like SNM3. 🤔

@famosab
Copy link
Member

famosab commented Feb 7, 2023

We can use requests to access the BiGG database.

Here is an example how to use it with BiGG:

import requests
import refinegems as rg

reac_url = 'http://bigg.ucsd.edu/api/v2/universal/reactions/'
metab_url = 'http://bigg.ucsd.edu/api/v2/universal/metabolites/'

mod = rg.load.load_model_cobra('../../models/Cstr_14.xml')

# requests.get(metab_url+'o2').json()['charges']

for metab in mod.metabolites:
    id = metab.id[:-2]
    print(id, requests.get(metab_url+id).json()['name'])

For metabolites these field can be accessed ['database_links', 'bigg_id', 'formulae', 'old_identifiers', 'charges', 'name', 'compartments_in_models']. Metabolites need to be entered without the compartment information and the beginning M so instead of M_o2_c use o2.

For reactions ['models_containing_reaction', 'reaction_string', 'metabolites', 'database_links', 'bigg_id', 'old_identifiers', 'name', 'pseudoreaction'].

@GwennyGit
Copy link
Collaborator Author

To have all parsing functions combined the module parse was created. However, not all functions that would potentially fit into this module have been added yet.


The function add_charges_chemical_formulae_to_metabs in the module analysis_biocyc currently causes a KeyError which should be solved in the next commit.

cb-Hades added a commit that referenced this issue Jul 19, 2024
cb-Hades added a commit that referenced this issue Jul 24, 2024
cb-Hades added a commit that referenced this issue Jul 29, 2024
cb-Hades added a commit that referenced this issue Jul 30, 2024
added get_missing_reactions to GeneGapFiller
cb-Hades added a commit that referenced this issue Jul 31, 2024
cb-Hades added a commit that referenced this issue Jul 31, 2024
cb-Hades added a commit that referenced this issue Aug 1, 2024
cb-Hades added a commit that referenced this issue Aug 7, 2024
cb-Hades added a commit that referenced this issue Aug 7, 2024
cb-Hades added a commit that referenced this issue Aug 8, 2024
first version of KEGG,MNX,BiGG reconstruction of metabolites and reactions
cb-Hades added a commit that referenced this issue Aug 9, 2024
Added functions for adding reac / metabs per database id
cb-Hades added a commit that referenced this issue Aug 12, 2024
some clean up, some new docstrings, started func for fill_model
GwennyGit added a commit that referenced this issue Aug 14, 2024
GwennyGit added a commit that referenced this issue Aug 15, 2024
cb-Hades added a commit that referenced this issue Aug 19, 2024
changed missing genes and reacs to attributes instead of return values
GwennyGit added a commit that referenced this issue Aug 20, 2024
GwennyGit added a commit that referenced this issue Aug 21, 2024
- Adjusted due to new db_access set up
- Started writing code for mapping from BioCyc IDs to other databases for the BioCycGapFiller
GwennyGit added a commit that referenced this issue Aug 21, 2024
cb-Hades added a commit that referenced this issue Aug 22, 2024
GwennyGit added a commit that referenced this issue Aug 22, 2024
GwennyGit added a commit that referenced this issue Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants