Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input data metrics explanation, % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA_follow up #222

Open
AliBasuony2022 opened this issue Feb 6, 2024 · 8 comments

Comments

@AliBasuony2022
Copy link

Dear Nicolas,

Just a follow up question for the issue #216

How do I know the right % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA? I'm doing a comparison between the performance of NOVOPlasty and other de novo assemblies and this information is so important.

When I used adifferent memmory settings (all other settings are fixed), I have got the same lenght of the largest contig, but with differnt number for assembled reads, aligned and total reads.

Does the subsampled fraction: 99.99 % when setting the Max memory= Null is right? if so, the number of total reads is over the number of reads in the raw data. I'm still confused, sorry.

max memory Null
log_mito_1_375_12_6_max memory Null.txt

max memory 100
log_mito_1_375_12_3_max memory 100.txt

memory 64
log_mito_1_375_12_max memory 64.txt

Kind regards,
Ali

@ndierckx
Copy link
Owner

ndierckx commented Feb 9, 2024

Not sure what your question is, the 99,99% is the amount of reads used, without the max memory it is usually 100%, but sometimes the reads are too short or have missing read pairs and then get discarded. So no subsampling was done in this case

Forward reads without pair: 41965
Reverse reads without pair: 17442

This means some reads had not matching pair...

How many reads are there in total, less than than 434743096?

@AliBasuony2022
Copy link
Author

Hi Nicolas,

raw data: 218,591,573 read pairs (adapter-free)
trimmed data: 216,237,628 read pairs (adapter and low qulaity reads free).

Both raw and trimmed data produced the same largest contig,16521 bp, but % of organelle genome is 0.43 for raw and 0.44 for trimmed data. see below log files

log_mito_1_375_raw data.txt
log_mito_1_375_trimmed data.txt

Is it expected that % of recovered organelle genome in case if NOVOPlasty overcomes that from reference-based ones (% 0.39)?

Below is the pipeline for reference-based approach:

Approaches 3 and 4: Reference-based read mapping was performed using two different parameter settings (see below), by aligning the trimmed sequencing reads against the V. vulpes reference genome (assembly version: GCF_003160815.1_VulVul2.2; Kukekova et al., 2018) using BWA-MEM v0.7.17 (Li and Durbin, 2009) with default parameters. We then used SAMtools v1.10 (Li et al., 2009) to obtain sorted bam files, followed by using GATK v4.2.2.0 (https://gatk.broadinstitute.org/hc/en-us) to remove PCR duplicates using MARKDUPLICATESSPARK and to filter out bad read mates, reads with mapping quality zero and reads which mapped ambiguously (Nater et al., 2017), and again SAMtools to extract the mitochondrial reads that mapped to the mtDNA scaffold (NC_008434, Arnason et al. 2006) of the reference genome. HAPLOTYPECALLER in GATK was used to call variants, using two different parameter settings, using as values for the flag --sample-ploidy: 1 for haploid (ploidy 1; Approach 3), and 2 for diploid (ploidy 2; Approach 4), each yielding a separate VCF file. Finally, FastaAlternateReferenceMaker from GATK was used to convert the two VCF files from Approaches 3 and 4 to FASTA format.

Kind regards,
Ali

@ndierckx
Copy link
Owner

ndierckx commented Feb 9, 2024

Do you have a problem with the 0.01 difference? :)
If you subsample there will be some fluctuations and it is just an estimation. If your other method gave 0.39%, then I have to say the estimation is pretty good, what does it matter if it is 0.39 or 0.43 ...

@AliBasuony2022
Copy link
Author

No, I don't have a problem with this difference- I'm joust surperising how de novo (NOVOPlasty) produce more % of mtDNA reads than reference-based approach.

I'm doing a comparison between the different assemblies, and I should give some explecit recommondation for the performence of each method, hene any diffrence even if it's samll is importatnt.

I think, we must leave max memory empty, thus, it will not subsample and then use all the reads, is this right?

I apprecite your time and help.

Regards,
Ali

@ndierckx
Copy link
Owner

It is a de novo assembler, I didn't put much effort in those statistics, I just added them because I thought they could be informative for the users, but it is still an estimation. Does the mt% have to be that accurate? Max memory setting is to reduce memory usage and increase the speed, the assembly will still be as good, because you have very high coverage. Maybe %mt will be a bit less accurate, but I thought nobody would care about that.. And in your case you have some small contigs, this could explain the elevated %, because the reads used for assembling those are also included and maybe they are not mitochondrial. Although it is more probable that the assembly is incomplete because of some repetitive region in the control region, in those cases you often see those small contigs. Other graph based assemblers will collapse that region, which seems that the assembly is complete but it isn't. Hence, a lot of mt genomes on NCBI are not fully assembled (collapsed repetitive regions)... You can send me your log files and contigs, I can have a look if this is the case

@AliBasuony2022
Copy link
Author

Thanks for this explanation, below are the log files

But, you didn't response to this question, yet. " I think, we must leave max memory empty, thus, it will not subsample and then use all the reads, is this right?"

log_mito_1_375_12_3_max memory 100.txt
log_mito_1_375_12_max memory 64.txt
log_mito_1_375_12_6_max memory Null.txt

@ndierckx
Copy link
Owner

If max memory is empty it doesn't subsample, but not sure what you mean by we must.., you can do whatever you prefer

And I need the extended log to see, but since you didn't use that option you can send me the Contigs fiel, probaply can see it from there

@AliBasuony2022
Copy link
Author

Ok, I meant by "must." to get use of all the reads, then the estimation of recovered mtDNA reads will be an accuarte than when you subsample.

Below is the extended log files

log_extended_mito_1_375_12_3_max memory 100.txt
log_extended_mito_1_375_12_max memory 64.txt

log_extended_mito_1_375_12_6_max memory Null.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants