ETL of the workflow to create a mutation-centric index in Elasticsearch 6.7.x

see clin-mutation-centric.json

= 0..1 > mutation = 0..n > donor

donors is of "nested type"

For description of the annotations

ElasticSearch Set-up

To create the index 'mutations', run the following commands lines where ElasticSearch is available (ssh -L or ssh thru environment)

The delete is because the devops have decided to own elasticsearch and pre-configure settings which make the creation of the index crash

curl -XDELETE "http://localhost:9200/mutations"
curl -XDELETE "http://localhost:9200/genes"
curl -XPUT "http://localhost:9200/genes" -H 'Content-Type: application/json' -d @clin-genes-centric.json
curl -XPUT "http://localhost:9200/mutations" -H 'Content-Type: application/json' -d @clin-mutation-centric.json

ETL Genomic Algorithm

Initiate Spark
Open VCF and split into partitions
For each partition:
    Read a line / Prepare ES payload from line
        clinical data fetching (per specimen/donor)
        Lire les ligne du VCF
        Lire les annotations de VEP (comma; pipe; &)
        external db analysis
        transcripts grouping (gene, aaChange, consequence, CDNAChange, strand)
        VEP imapct scoring 0-4 
        Gene analysis (cellbase/Redis)
        donor/specimen analysis (FHIR)
        internal & laboratory frequencies
        family & transmission analysis (AD, AR, DeNovo, XD, XR)
     Group by a number of ES transactions (parameters)
“benign”, “possibly damaging”, “probably damaging”

  • B : benign
  • P : possibly damaging
  • D : probably damaging


  • N : predicted N(eutral)
  • D : predicted D(eleterious)
  • U : U(nknown)


  • T : T(olerated)
  • D : D(amaging)


internal cohort & by Lab (LDx) & eventually by studies

  • any result with a dot (.) i.e ./. or 1/. is discraded
  • PN est le nombre de patient ayant une mutation (1/0, 0/1, 1/1) sur l'allèle en question
  • PN is the number of patient that have a mutation on the allele (1/0, 0/1, 1/1)
  • AC est le nombre d'allèle muté (1 seulement)
  • AC is the count of mutated allele found (1)
  • AN est le nombre total d'allèle rencontré (0 ou 1)
  • AN is the total number of allele found (0 or 1)
  • HC est le nombre de Homozygote rencontré (1/1; 1|1)
  • HC is the total number of Homozygous individual (1/1; 1|1)
  • AF = AC / AN #####NOTA BENE
  • Les vcfs on été normalisés... donc, on pas de 2+
  • We treat normalized VCFs, we no longer have value greater than 1 on the genotype of a patient

Donor annotations

  • adAlt : Allelic depths for the alt alleles
  • adTotal : Total Allelic depths for the ref and alt alleles
  • gq : Genotype Quality (integer)
  • gt : Genotype
  • adFreq : Ratio between adAlt and adTotal
  • qd : Variant Confidence/Quality by Depth (float)


  • Ensembl_transcriptid=(from dbNSFP) Ensembl transcript ids (Multiple entries separated by ";")
  • FeatureId (from VEP) Feature - Ensembl stable ID of feature


  • Fev 21, 2020 on ne garde pas le ensemblTranscriptId provenant de vep/dbNSFP; décision prise par Vincent sur Slack. Enlever ensemblTranscriptID -- next version de l’index - check Puisqu’on utilise pas le champ Picked, on devrait l’enlever. On va essayer d’épurer un peu l’index

  • March 24, 2020:

Alex DL  10:44 AM
Je confirme avec toi, x linked dominant c'est seulement pour les filles
J'ai mis a jour deux transmissions dans mon fichier, obligeant que ce soit pour les filles seulement
(“0/1”, “0/0”, “0/1”) -> 	x_linked_dominant [if female proband with affected mother and unaffected father]
(“0/1”, “0/1”, “0/1”) -> 	x_linked_dominant [if female proband with both parents affected]
Donc, les garcons ne peuvent qu'être recessif sur le X

Alex DL  10:51 AM
Oui, j'ai eu une bonne discussion avec Fadi a ce sujet. Il faut faire la distinction entre la transmission de la maladie et du génotype.

Alex DL  10:51 AM
Une maladie peut etre classée comme récéssive mais se tramsettre de facon dominante
bref, de notre coté pour l'instant on catégorise les transmission de génotypes
comme les garcons n'ont qu'une seule copie du X, ca se veut donc récessif (comme si les deux alleles etaient touchees)

To run etl

To compile and build runtime:

mvn clean install

Step 0a indexation de cellbase (only once)

To execute etl for the cellbase; make sure cellbase is available (port 6379)

java -jar target/ExtractTLoad-1.0-SNAPSHOT-jar-with-dependencies.jar Homo_sapiens.gene_info.txt 9201

Step 1a edit file if necessary

Default values is:
annotationTool=VEP 97

Step 1b indexation

To execute etl with an extracted vcfs into column delimited files; it's pedigree need to be available

~/bin/spark-2.4.3/bin/spark-submit --class org.chusj.VEPSparkDriverProgram --deploy-mode client --master 'local[*]' \
target/ExtractTLoad-1.0-SNAPSHOT-jar-with-dependencies.jar vcf.txt true local 'local[12]' 12g 12 51 pedigreeTest1.ped 9201

Step 2 Exomiser

To index the exomiser report for a proband;

~/bin/spark-2.4.3/bin/spark-submit --class org.chusj.ExomiserETL --deploy-mode client --master 'local[*]' \
target/ExtractTLoad-1.0-SNAPSHOT-jar-with-dependencies.jar exomiser/FAM_C3_92.json SP00011 6 45 9201