Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate Contig Alignments and Mapping Quality Zero #15

Open
DarioS opened this issue Feb 15, 2022 · 12 comments
Open

Alternate Contig Alignments and Mapping Quality Zero #15

DarioS opened this issue Feb 15, 2022 · 12 comments

Comments

@DarioS
Copy link

DarioS commented Feb 15, 2022

The majority of reads (particularly over exons) in a region of chr17 appear to have MAPQ being zero.

KANSL1alignments

Copying and pasting one of their sequences into UCSC BLAT, it becomes apparent that they also map perfectly to ALT contigs.

QUERY  SCORE START   END QSIZE IDENTITY  CHROM                 STRAND  START       END   SPAN
-------------------------------------------------------------------------------------------------------------
KANSL1   151     1   151   151   100.0%  chr17_GL000258v2_alt  -      721003    721153    151
KANSL1   151     1   151   151   100.0%  chr17_KI270908v1_alt  +      769645    769795    151
KANSL1   151     1   151   151   100.0%  chr17                 +    46067546  46067696    151

But, the MAPQ there is also all zero (reads are white, not dark grey).

image

Isn't bwa-postalt.js supposed to lift-over these alignments to the main chromosomes and increase their MAPQ?

It reads all potential hits reported in the XA tag, lifts ALT hits to the chromosomal positions using the ALT-to-ref alignment, groups them based on overlaps between their lifted positions, and then re-estimates mapQ across the best scoring hit in each group.

It looks like SNVs in hg38 regions with alternate contigs might always be missed. @r0sies @tracychew can you investigate?

@calizilla
Copy link
Collaborator

Hi Dario

I have checked your data and can see that alt-aware mapping was carried out. You have the .alt index file in place, the alt contigs have the AH tag present in the SAM header indicating alt-aware handling, and you have reads with pa tags indicating alt aware handling.

I have checked the alt contig in question and its corresponding location on chr17 primary reference sequence, and there are primary reads with non-zero mapping quality that also map to the alt contig.

Have a look at section 3 here https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/tutorials/8017-how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38 for an example of reads that map to both the primary and alt contig, and a description of the PA tag.

It looks like SNVs in hg38 regions with alternate contigs might always be missed

No this is not correct. You can verify this by counting the number of SNVs in your final VCF in a hg38 primary region with alt contig.

@DarioS
Copy link
Author

DarioS commented Feb 18, 2022

I still think that this is a real issue. Let's use ST-E00127:578:HJKY7ALXX:4:2106:25723:5089 from CSCC_0004-M1.final.bam as an example. It has three perfect alignments; chr17 and two alternate contigs. You can quickly verify its hg38 alignments using BLAT.

Read sequence

AGCTCAGTACAGGACGTGTCCGGGCTGCCACACAGGTGCCATCAGATGATGAAGAGACGAGATTCAGTCGTTGCTTCTTATGTAATTGTTCCTCAGCATCAGAGCTGTCACCTGGAATGTGGTCTGCCAAGACAGGCTGAAGACTATACAA

Alignments of read to genome

image

It seems to be lifted back to chr17, but it has a MAPQ of 0 on both chr17 and the alt. So, it wouldn't be used for SNV calling nor copy number.

Compare

Read name = ST-E00127:578:HJKY7ALXX:4:2106:25723:5089
Sample = CSCC_0004-M1
Library = CSCC_0004-M1_1
Read group = HJKY7ALXX.2_CSCC_0004-M1_1
Read length = 151bp
Flags = 147
----------------------
Mapping = Primary @ MAPQ 0
Reference span = chr17:46,066,590-46,066,740 (-) = 151bp
Cigar = 151M

to

Read name = ST-E00127:578:HJKY7ALXX:4:2106:25723:5089
Sample = CSCC_0004-M1
Library = CSCC_0004-M1_1
Read group = HJKY7ALXX.2_CSCC_0004-M1_1
Read length = 151bp
Flags = 2195
----------------------
Mapping = Supplementary @ MAPQ 0
Reference span = chr17_KI270908v1_alt:768,689-768,839 (-) = 151bp
Cigar = 151M

It's missing from the third perfect mapping location on chr17_GL000258v2_alt.

We have found problems with copy number variant calling and there would also be a problem with SNVs here in cancer samples which had SNVs in a gene such as KANSL1. I think this explains the enrichment of MAPQ 0 reads over exons (exons are more conserved than introns and more likely to result in score ties) and MAPQ > 0 reads enriched in introns. But, reads like the one above would be fine if hg19 or a transcriptome FASTA reference was used. Do you agree that it is a problem?

@calizilla
Copy link
Collaborator

It does appear odd that the mapping quality of the primary alignment has not been assigned > 0. I am also interested to note that both alt locations are listed in the XA tag, but only the alignment to chr17_KI270908v1_alt is present in the BAM file. I would suggest reporting this to https://github.com/lh3/bwa/tree/master/bwakit for the developer's comment.

@DarioS
Copy link
Author

DarioS commented Feb 20, 2022

O.K. issue reported to Heng Li.

@georgiesamaha
Copy link
Member

Hi Dario,
While we wait to hear back from the developers, we suggest taking a closer look at other potential regions in the BAMs that have an elevated number of reads with post-alt processing tags and MAPQ 0, if you are interested. You can then perform variant calling separately on these regions.

@Calliza suggests doing this by extracting the reads that map to the chr17 region from the BAM file, convert them back to fastq format and remap them using bwa-mem without post-alt processing.

Let us know how you go, we're keen to understand the impact this issue might have.

@georgiesamaha georgiesamaha reopened this Mar 3, 2022
@DarioS
Copy link
Author

DarioS commented Mar 10, 2022

I tried out the suggestion with GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz (as recommended by Heng Li) and now all of the reads have high mapping quality (i.e. MAPQ 60) across the entire region. So, the MAPQ 0 problem seems to be caused simply by having alternate contigs (with identical DNA stretches) in the reference and bwa-postalt.js seemingly being buggy.
image

@calizilla
Copy link
Collaborator

Let's wait to hear what Heng Li has to say regarding lh3/bwa#342 before assuming the bwa-kit is buggy. There may be a sound reason as to why the MAPQ of these reads were not lifted during post-alt processing.

@DarioS
Copy link
Author

DarioS commented Nov 23, 2022

Could create_project.bash offer an easy way to switch between reference genomes with and without alternate contigs? I am working on a new project where accurate copy numbers are a higher priority than HLA typing, for instance.

@calizilla
Copy link
Collaborator

Thanks for the suggestion Dario. The simplest solution would be to clone a new repo and re-align using the different reference genome. As the mapping qualities and tags will be different for reads that have aligned to 'alt', simply removing the alt component of the refseq won't do the trick. Back-tracking to calculate those values without alt contigs is beyond the scope of this pipeline, but sounds like a useful utility to have. Perhaps add it to the bwa-kit github as a suggestion?

@DarioS
Copy link
Author

DarioS commented Nov 10, 2023

Heng Li now recommends that people do not use hs38DH.

Name Coordinate Brief description
hs37 GRCh37 1000 Genomes Project (1000G) reference in 2010
hs37d5 GRCh37 hs37 with decoy (recommended GRCh37)
hs38 GRCh38 GRCh38 no-alt analysis set (recommended GRCh38)
hs38DH GRCh38 GRCh38 with ALT, decoy and HLA genes (not recommended)
chm13v2 CHM13v2 T2T CHM13 plus HG002 chrY (recommended CHM13)

hs38DH: This genome includes GRCh38 alternate contigs, GRCh38 decoy sequences and HLA alleles. As GRCh38 is more complete than GRCh37, GRCh38 decoy sequences are not as important as GRCh37 decoy. Furthermore, most tools would not work well with this version as they are not ALT-aware. Improper use of this genome would hurt variant sensitivity.

@calizilla
Copy link
Collaborator

Hi Dario. Heng Li has recomemnded against this genome ONLY IF the user is not using ALT-aware tools. this repository contains a workflow that uses an alt-aware tool (bwa-kit) so use of hs38DH is OK.

We have not forgotten about this issue, and do have plans to investigate further. I apologise its taking so long, but as you can understand we have a large workload with a small team.

I will let you know once we have an answer for you regarding the original issue.

@DarioS
Copy link
Author

DarioS commented Dec 3, 2023

Excellent. There is certainly a bioinformatics staff shortage everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants