Discussion on Masking the regions with coverage less than 20X #35

Rohit-Satyam · 2022-11-10T13:26:13Z

While going through the pipeline, I realized that the VCF2FASTA does not masks the regions of low coverage. In amplicon sequencing data, there might be regions deprived of reads. I think it make sense to mask such regions with Ns that we know are covered by less than 20 reads as we can't say about variant status in that region. Using the reference sequence in such regions might indicate that there is no variant in that region.

Besides, could you enable the pipeline to copy these fasta assemblies as well to the result directory. I tried using --keep_intermediate but it didn't help. I can see there is no publishDir "${params.output}", mode: "copy" for VCF2FASTA process.

EDIT 1: Also, I think the code take the normalized VCFs and without filtering the Low Frequency and Subclonal variants, it includes all of them in the assembly. Is this assembly good for GISAID submission? Because as you say in your documentation Other low quality mutations are removed from the output.

The text was updated successfully, but these errors were encountered:

Rohit-Satyam · 2022-11-15T14:11:37Z

I came across a recent collaborative preprint by NCBI and several other institutes on best protocols for generation of Consensus assembly (see here). They recommend the following protocol

Rohit-Satyam · 2022-11-15T16:06:46Z

I also observed that you assign the Variants just based on VAF and not Allele Depth. For Example a variant might be supported by just 10X reads out of which 9 support it's presence. In that case VAF will be 0.8 but the depth at that region is not enough to say with confidence.

Another example from real data

MN908947.3	4184	.	G	A	42.872	PASS	DP=69;DPS=30,39;Pool=2;vafator_af=0.88406;vafator_ac=61;vafator_dp=69;vafator_eaf=0.5;vafator_pu=1;vafator_pw=1;vafator_k=4;vafator_bq=4.5,13;vafator_mq=60,60;vafator_pos=182,175;vafator_rsmq=-0.055;vafator_rsmq_pv=0.95646;vafator_rsbq=2.088;vafator_rsbq_pv=0.03677;vafator_rspos=0.491;vafator_rspos_pv=0.62317	GT:GQ	1:43

Here we have DP=69 and if we were to set a Depth cutoff of 100 reads (As suggested by paper above) as well with VAF cutoff this variant won't pass.

… in #35

priesgo · 2022-11-18T16:35:20Z

Just added in #37 the FASTA sequences to the output.

Will follow up on the rest, the filtering by depth requires a thorough analysis from our side. The whole dataset of mutations in tabular format is publicly available here https://covigator.tron-mainz.de/download/variant_observation.csv.gz, that contains the VAF, DP and AC annotations for every variant call so the impact of such filtering could be evaluated there.

Rohit-Satyam · 2023-03-03T05:46:43Z

Hi @priesgo. Were you able to check the impact of such filtering. I realized that using criteria INFO/vafator_af < 0.5 || INFO/vafator_dp < 100 || INFO/vafator_ac < 50 affects clade assignment in some samples and filters a lot of reverse substitutions (Private Mutations) according to NextClade.

priesgo · 2023-07-21T04:26:17Z

Apologies for ultra-late response @Rohit-Satyam , we have not looked into this yet. We will eventually get here.

FYI @johausmann

corneliusroemer · 2023-10-10T06:34:00Z

Stumbling upon this here. As a consensus sequence consumer - I'm very aware of some pipelines having a strong tendency to erroneously call reference when we know that it's unlikely to be the case due to phylogenetic placement.

What @Rohit-Satyam explains appears to be a possible mechanism to explain that.

@priesgo do you have a rough indication of how many sequences in GISAID were produced with this pipeline? Are there some labs that you know are using it? Unfortunately, submitters don't usually point out the pipeline they used.

Rohit-Satyam mentioned this issue Nov 15, 2022

Regarding filtering of Low Frequency and Subclonal Mutations epi2me-labs/wf-artic#68

Closed

priesgo added a commit that referenced this issue Nov 18, 2022

add to the output the FASTA sequence build from the VCFs as requested…

4ef6bb8

… in #35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on Masking the regions with coverage less than 20X #35

Discussion on Masking the regions with coverage less than 20X #35

Rohit-Satyam commented Nov 10, 2022 •

edited

Loading

Rohit-Satyam commented Nov 15, 2022

Rohit-Satyam commented Nov 15, 2022

priesgo commented Nov 18, 2022

Rohit-Satyam commented Mar 3, 2023

priesgo commented Jul 21, 2023

corneliusroemer commented Oct 10, 2023

Discussion on Masking the regions with coverage less than 20X #35

Discussion on Masking the regions with coverage less than 20X #35

Comments

Rohit-Satyam commented Nov 10, 2022 • edited Loading

Rohit-Satyam commented Nov 15, 2022

Rohit-Satyam commented Nov 15, 2022

priesgo commented Nov 18, 2022

Rohit-Satyam commented Mar 3, 2023

priesgo commented Jul 21, 2023

corneliusroemer commented Oct 10, 2023

Rohit-Satyam commented Nov 10, 2022 •

edited

Loading