Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on Masking the regions with coverage less than 20X #35

Open
Rohit-Satyam opened this issue Nov 10, 2022 · 6 comments
Open

Discussion on Masking the regions with coverage less than 20X #35

Rohit-Satyam opened this issue Nov 10, 2022 · 6 comments

Comments

@Rohit-Satyam
Copy link

Rohit-Satyam commented Nov 10, 2022

Dear @priesgo

While going through the pipeline, I realized that the VCF2FASTA does not masks the regions of low coverage. In amplicon sequencing data, there might be regions deprived of reads. I think it make sense to mask such regions with Ns that we know are covered by less than 20 reads as we can't say about variant status in that region. Using the reference sequence in such regions might indicate that there is no variant in that region.

Besides, could you enable the pipeline to copy these fasta assemblies as well to the result directory. I tried using --keep_intermediate but it didn't help. I can see there is no publishDir "${params.output}", mode: "copy" for VCF2FASTA process.

EDIT 1: Also, I think the code take the normalized VCFs and without filtering the Low Frequency and Subclonal variants, it includes all of them in the assembly. Is this assembly good for GISAID submission? Because as you say in your documentation Other low quality mutations are removed from the output.

@Rohit-Satyam
Copy link
Author

I came across a recent collaborative preprint by NCBI and several other institutes on best protocols for generation of Consensus assembly (see here). They recommend the following protocol

DAVID

@Rohit-Satyam
Copy link
Author

I also observed that you assign the Variants just based on VAF and not Allele Depth. For Example a variant might be supported by just 10X reads out of which 9 support it's presence. In that case VAF will be 0.8 but the depth at that region is not enough to say with confidence.

Another example from real data

MN908947.3	4184	.	G	A	42.872	PASS	DP=69;DPS=30,39;Pool=2;vafator_af=0.88406;vafator_ac=61;vafator_dp=69;vafator_eaf=0.5;vafator_pu=1;vafator_pw=1;vafator_k=4;vafator_bq=4.5,13;vafator_mq=60,60;vafator_pos=182,175;vafator_rsmq=-0.055;vafator_rsmq_pv=0.95646;vafator_rsbq=2.088;vafator_rsbq_pv=0.03677;vafator_rspos=0.491;vafator_rspos_pv=0.62317	GT:GQ	1:43

Here we have DP=69 and if we were to set a Depth cutoff of 100 reads (As suggested by paper above) as well with VAF cutoff this variant won't pass.

@priesgo
Copy link
Member

priesgo commented Nov 18, 2022

Just added in #37 the FASTA sequences to the output.

Will follow up on the rest, the filtering by depth requires a thorough analysis from our side. The whole dataset of mutations in tabular format is publicly available here https://covigator.tron-mainz.de/download/variant_observation.csv.gz, that contains the VAF, DP and AC annotations for every variant call so the impact of such filtering could be evaluated there.

@Rohit-Satyam
Copy link
Author

Hi @priesgo. Were you able to check the impact of such filtering. I realized that using criteria INFO/vafator_af < 0.5 || INFO/vafator_dp < 100 || INFO/vafator_ac < 50 affects clade assignment in some samples and filters a lot of reverse substitutions (Private Mutations) according to NextClade.

@priesgo
Copy link
Member

priesgo commented Jul 21, 2023

Apologies for ultra-late response @Rohit-Satyam , we have not looked into this yet. We will eventually get here.

FYI @johausmann

@corneliusroemer
Copy link

Stumbling upon this here. As a consensus sequence consumer - I'm very aware of some pipelines having a strong tendency to erroneously call reference when we know that it's unlikely to be the case due to phylogenetic placement.

What @Rohit-Satyam explains appears to be a possible mechanism to explain that.

@priesgo do you have a rough indication of how many sequences in GISAID were produced with this pipeline? Are there some labs that you know are using it? Unfortunately, submitters don't usually point out the pipeline they used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants