Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abundance | samtools view: failed to add PG line to the header #164

Open
kunaljaani opened this issue Jul 1, 2024 · 2 comments
Open

abundance | samtools view: failed to add PG line to the header #164

kunaljaani opened this issue Jul 1, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@kunaljaani
Copy link

kunaljaani commented Jul 1, 2024

Hi! Francisco
While running the -t abundance I am running into a problem with the sam to bam conversation due to a duplicate header. When I checked the grep "NODE_1753_length_683_cov_1.602310" FDSGM6.sam -c I found more than one occurrence.

Converting SAM to BAM with samtools view ...
[W::sam_hdr_create] Duplicated sequence "NODE_1753_length_683_cov_1.602310" in file "FDSGM6.sam"
[E::sam_hrecs_update_hashes] Duplicate entry "NODE_1753_length_683_cov_1.602310" in sam header
samtools view: failed to add PG line to the header

I am also surprised to see that 2 of 35 samples ran fine. and if there was some issue it would have created problems for all the samples. So I am not able to figure out the issue. Could you please suggest a solution?

Thank you
Kunal

@kunaljaani kunaljaani added the question Further information is requested label Jul 1, 2024
@franciscozorrilla
Copy link
Owner

Hi Kunal, thanks again for reporting this issue and apologies for my delay in responding.

I believe that this problem occurs due to the fact that contig headers are not guaranteed to be unique across MAGs. It is unlikely, but evidently it is possible that two different contigs are assembled with the same node_length_coverage values across different genomes from the same sample, in which case you will see those errors you have reported.
I will have to think about the best way to deal with this in the future, but for now you can simply add a unique MAG ID at the start of each contig header when concatenating MAGs to avoid this issue.

metaGEM/workflow/Snakefile

Lines 1066 to 1067 in 57cfd60

echo -e "\nConcatenating all bins into one FASTA file ... "
cat *.fa > $(basename {output}).fa

Instead of using cat, use a loop to append MAG ID at the start of each contig e.g. assuming your mags for a given sample are in subfolder mags with .fa extension:

ls mags|grep ".fa"|while read genome;do paste mags/$genome| sed "s/>/>${genome%.*}_/g";done > sample.fa

Best wishes,
Francisco

@kunaljaani
Copy link
Author

Hi Francisco,

Thank you very much for your reply and the detailed explanation.
Thanks a lot
Kunal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants