Skip to content
This repository has been archived by the owner on Sep 28, 2023. It is now read-only.
/ quadtree Public archive

Quadtree - gradient-boosted decision tree model used to predict guanine quadruplexes in DNA sequences

License

Notifications You must be signed in to change notification settings

patrikkaura/quadtree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quadtree


package_version python_version node_js_version react nextjs

The Quadtree is a gradient-boosted decision tree model used to predict guanine quadruplexes in DNA sequences. It's developed on top of the LightGBM python library. Each sequence base is encoded based on a given encoding prescription. The model was trained to be used with a sliding window and analyses the whole sequence. Machine learning model can be used as python script or thru preview website quadtree.vercel.app

Repository structure

quadtree
    └─ web -> preview website source code
    └─ python
          └─ model -> lightgbm model params
          └─ train -> example files how training was performed
          └─ quadtree.py -> predictor

Requirements

  • lightgbm==3.3.2
  • numpy==1.21.2

Install dependencies

Before using install the requirements:

  pip install -r requirements.txt

Usage

Create model instance

  from quadtree import Quadtree
  
  model = Quadtree()

Run analysis - algorithm inputs

  • sequence as a string (maximum length is not limited)
  • threshold (recommended values is 0.2)
  • quadnet model file path
result = quadtree.analyse(
    sequence='ATTAATACTTTTAACAATTGTAGTATATAAAAAAGGGAGTAACC...', 
    model_path='/path/to/quadnet_model.txt',', 
    score_threshold=0.1
)

Results are then returned in given form which can be loaded into pandas DataFrame.

import pandas as pd

df = pd.DataFrame(result)
index position sequence length
0 0 907 GCAACAATGGCTGATCCAGAAGGTACAGACGGGGAGGGCACGGGTTGTAACGGCTGGTTTTATGTACAAGCTATTGTAGACAAAAAAACAGGAGATGTAATATCA 105
1 1 1184 GAGGCAGCACAGAAAACAGTCCATTAGGGGAGCGGCTGGAGGTGGATACAGAGTTAAGTCCACGGTTACAAGAAATATCTTTAAATAGTGGGCAGA 96
2 2 1389 ATGTAGTGGCGGCAGTACGGAGGCTATAGACAACGGGGGCACAGAGGGCAACAACAGCAGTGTAGACGGTACAAGTGACAATAGCAATATAGAAAATGTAAATCCAC 107
3 3 1635 AGATTGGGTTACAGCTATATTTGGAGTAAACCCAACAATAGCAGAAGGATTTAAAACACTAATACAGCCATTTAT 75
4 4 2229 AATAGATGAAGGGGGAGATTGGAGACCAATAGTGCAATTCCTGCGATACCAACAAATAGAGTTTATAACATTTTTAG 77

Model scheme

LAYOUT_LEFT_RIGHT Quadtree

Training parameters

These parameter were used to train lightgbm model

LGBM Classifier value
colsample bytree 0.817574864502621
learning rate 0.03744835808549148
max bin 127
min child sample 3
number of estimators 1000
number of leaves 74
regularization alpha 0.0033803043003857677
regularization lambda 0.7013136087939289
objective binary

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details. # quadtree

Releases

No releases published

Packages

No packages published