Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about SAM output tags #25

Closed
npavlovikj opened this issue Sep 9, 2017 · 22 comments
Closed

Question about SAM output tags #25

npavlovikj opened this issue Sep 9, 2017 · 22 comments
Labels

Comments

@npavlovikj
Copy link

Hi,

I generated "sam" file using "minimap2", and I noticed there are some additional tags in the output compared to the standard "sam" format, e.g. "tp:A:P cm:i:8 s1:i:46 s2:i:0 NM:i:264 ms:i:194 AS:i:194 nn:i:0". I was wondering if you can tell me what do the tags "tp", "cm", "ms" and "nn" mean.
Next, I want to extract the number of matches and mismatches from the cigar alignment. However, the "sam" file contains only the tag M, which stands for both matches and mismatches, so I was wondering if any of these additional tags can help me differentiate between the both. If not, can you please suggest me a way to do that ?

Thank you,
Natasha

@lh3
Copy link
Owner

lh3 commented Sep 9, 2017

For tags, see the table at the bottom of man ./minimap2.1.

NM gives the number of mismatches and gaps. You can count the number of gaps from CIGAR. The difference gives the number of mismatches.

@npavlovikj
Copy link
Author

Thank you @lh3 , that worked perfectly!

@lh3 lh3 closed this as completed Sep 10, 2017
@lh3 lh3 added the question label Oct 12, 2017
@Blosberg
Copy link

Blosberg commented Nov 11, 2017

Hi Dr. Li:

You can count the number of gaps from CIGAR

I initially assumed this meant N from the CIGAR, however, when I took NM-N I obtained an array with negative values, so clearly I had been mistaken. I am now assuming that you are referring to D from the CIGAR ('Deletion', in which the nucleotide is present in the reference but not in the read),

This would imply that # mismatches = NM-D

If that's incorrect, could you please set me straight? Thank you.

answered below: # mismatches = NM-D-I-"ambiguous bases"

@npavlovikj
Copy link
Author

@Blosberg , according my understanding, the number of gaps is the number of insertions and deletions from the CIGAR output, so I ended up using #mismatches=NM-I-D.

@lh3
Copy link
Owner

lh3 commented Nov 11, 2017

I forgot what the SAM spec says, but for minimap2:

NM = #mismatches + #I + #D + #ambiguous_bases

NM does not count reference skip N.

@Blosberg
Copy link

@npavlovikj , @lh3 ,
ok. Thank you both very much for your help.

@tseemann
Copy link
Contributor

For people who encounter this issue via Google in the future, here is the table of tags as of minimap 2.10:

                        ┌────┬──────┬───────────────────────────────────────────────────────┐
                        │Tag │ Type │                      Description                      │
                        ├────┼──────┼───────────────────────────────────────────────────────┤
                        │ tp │  A   │ Type of aln: P/primary, S/secondary and I,i/inversion │
                        │ cm │  i   │ Number of minimizers on the chain                     │
                        │ s1 │  i   │ Chaining score                                        │
                        │ s2 │  i   │ Chaining score of the best secondary chain            │
                        │ NM │  i   │ Total number of mismatches and gaps in the alignment  │
                        │ MD │  Z   │ To generate the ref sequence in the alignment         │
                        │ AS │  i   │ DP alignment score                                    │
                        │ ms │  i   │ DP score of the max scoring segment in the alignment  │
                        │ nn │  i   │ Number of ambiguous bases in the alignment            │
                        │ ts │  A   │ Transcript strand (splice mode only)                  │
                        │ cg │  Z   │ CIGAR string (only in PAF)                            │
                        │ cs │  Z   │ Difference string                                     │
                        └────┴──────┴───────────────────────────────────────────────────────┘

@armintoepfer
Copy link
Contributor

Tag dv:f:

Approximately estimate per-base sequence divergence (i.e. 1-identity) without performing base-level alignment, using a MashMap-like method. The estimate is written to a new dv:f tag.

lh3 added a commit that referenced this issue Apr 24, 2018
Also added asm20 to command line help (#151)
@lh3
Copy link
Owner

lh3 commented Apr 24, 2018

Just added "dv" to the man page via aef7b07. BTW, you can also find the table here.

@stopalopa
Copy link

Hi,
I noticed that some alignments don't have the dv tag. Under what circumstances will this tag not be present?

Thanks,
Natasha

@aaronphillips7493
Copy link

I also cannot see the dv tag in my whole-genome alignments. Have there been any updates on why this occurs?

Thanks,
Aaron

@lh3
Copy link
Owner

lh3 commented Jul 28, 2021

dv is approximate. It is only outputted when you don't perform base alignment.

@Raanaroohanitaziani
Copy link

Raanaroohanitaziani commented Jul 12, 2022

how to decode a sma file output? what this header means? there was no explanation in the manual.

15e6db9f-7f75-400a-9e8e-a49e5892710d 256 gi|253771435|ref|NC_012947.1| 3013928 0

I hope someone can help me.

@lh3
Copy link
Owner

lh3 commented Jul 14, 2022

@Raanaroohanitaziani please read the SAM spec.

@Raanaroohanitaziani
Copy link

@Raanaroohanitaziani please read the SAM spec.

thanks for your reply. I am not sure how I can find the SAM spec for minimap2. I appreciate it if you can help me with this.

@chucknordy
Copy link

I just noticed that there is a mysterious SAM annotation tag ("zd:i", with small integer values) present in some of my alignments, which doesn't seem to be documented in the man page. (I have minimap 2.24) What does this "zd" tag represent?

@jdmontenegro
Copy link

I just noticed this today too, cannot seem to find the "zd" flag anywhere. mybe something to do with the z-drop score?

@CharlesARoy
Copy link

@lh3 is there a way to disable the output of the Minimap2-specific tags? They can cause issues in other tools such as GATK, e.g.:
https://gatk.broadinstitute.org/hc/en-us/community/posts/11440622639387-Unable-to-trim-uncertain-bases-without-flow-order-information

@lh3
Copy link
Owner

lh3 commented Nov 2, 2023

That was a GATK bug and has been fixed.

@CharlesARoy
Copy link

Thank you @lh3, I had seen that that was a GATK bug that had been fixed -- I mostly provided that example to illustrate one reason why someone might want to disable the minimap2-specific tags in the output. I take it that's not an option?

@lh3
Copy link
Owner

lh3 commented Nov 2, 2023

Not an option. It is not worth increasing tech debt for such rare bugs in downstream tools. Complicating the code base for these will make minimap2 harder to maintain in future.

@alisamatisse
Copy link

Hello, sorry for this question, but how is supplementary alignment determined (the algorithm)? I know it is chimeric reads, but I was really surprised to have 60% of chimeric reads in my samples 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests