Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge VAT TSV files into single bgzipped file [VS-304] #7848

Merged
merged 16 commits into from
May 13, 2022

Conversation

rsasch
Copy link

@rsasch rsasch commented May 12, 2022

@codecov
Copy link

codecov bot commented May 12, 2022

Codecov Report

❗ No coverage uploaded for pull request base (ah_var_store@4e7b1f8). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             ah_var_store     #7848   +/-   ##
================================================
  Coverage                ?   86.304%           
  Complexity              ?     35189           
================================================
  Files                   ?      2170           
  Lines                   ?    164837           
  Branches                ?     17775           
================================================
  Hits                    ?    142261           
  Misses                  ?     16252           
  Partials                ?      6324           

done

echo_date "making header.gz"
echo "vid transcript contig position ref_allele alt_allele gvs_all_ac gvs_all_an gvs_all_af gvs_all_sc gvs_max_af gvs_max_ac gvs_max_an gvs_max_sc gvs_max_subpop gvs_afr_ac gvs_afr_an gvs_afr_af gvs_afr_sc gvs_amr_ac gvs_amr_an gvs_amr_af gvs_amr_sc gvs_eas_ac gvs_eas_an gvs_eas_af gvs_eas_sc gvs_eur_ac gvs_eur_an gvs_eur_af gvs_eur_sc gvs_mid_ac gvs_mid_an gvs_mid_af gvs_mid_sc gvs_oth_ac gvs_oth_an gvs_oth_af gvs_oth_sc gvs_sas_ac gvs_sas_an gvs_sas_af gvs_sas_sc gene_symbol transcript_source aa_change consequence dna_change_in_transcript variant_type exon_number intron_number genomic_location dbsnp_rsid gene_id gene_omim_id is_canonical_transcript gnomad_all_af gnomad_all_ac gnomad_all_an gnomad_failed_filter gnomad_max_af gnomad_max_ac gnomad_max_an gnomad_max_subpop gnomad_afr_ac gnomad_afr_an gnomad_afr_af gnomad_amr_ac gnomad_amr_an gnomad_amr_af gnomad_asj_ac gnomad_asj_an gnomad_asj_af gnomad_eas_ac gnomad_eas_an gnomad_eas_af gnomad_fin_ac gnomad_fin_an gnomad_fin_af gnomad_nfr_ac gnomad_nfr_an gnomad_nfr_af gnomad_sas_ac gnomad_sas_an gnomad_sas_af gnomad_oth_ac gnomad_oth_an gnomad_oth_af revel splice_ai_acceptor_gain_score splice_ai_acceptor_gain_distance splice_ai_acceptor_loss_score splice_ai_acceptor_loss_distance splice_ai_donor_gain_score splice_ai_donor_gain_distance splice_ai_donor_loss_score splice_ai_donor_loss_distance omim_phenotypes_id omim_phenotypes_name clinvar_classification clinvar_last_updated clinvar_phenotype" | gzip > header.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this pr, for the future---I hate that this is hardcoded, but I dont see a way around this since it's also hard coded for the export query (also not good). Like maybe run the query twice, once with a limit of 0 and just grab the header?!?! I dunno

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generate the TSVs with a header line in the EXPORT command, and then you can get this header from the first TSV instead (and grep it out of the others when you concatenate)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but it was complexity that I wasn't sure would add much. If you have the bash code on hand to do it, I'd be happy to add it 😉 .

scripts/variantstore/wdl/GvsCreateVAT.wdl Outdated Show resolved Hide resolved
scripts/variantstore/wdl/GvsCreateVAT.wdl Outdated Show resolved Hide resolved
echo_date "concatenating $files"
cat $(echo $files) > vat_complete.tsv.gz
echo_date "bgzipping concatenated file"
cat vat_complete.tsv.gz | gunzip | bgzip > vat_complete.bgz.tsv.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should output be named 'vat_complete.tsv.bgz'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, too, but everything I saw that showed the handling of bgzipped files had them with a .gz suffix.

scripts/variantstore/wdl/GvsCreateVAT.wdl Show resolved Hide resolved
echo_date "concatenating $files"
cat $(echo $files) > vat_complete.tsv.gz
echo_date "bgzipping concatenated file"
cat vat_complete.tsv.gz | gunzip | bgzip > vat_complete.bgz.tsv.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought this would end up vat_complete.tsv.bgz?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, too, but everything I saw that showed the handling of bgzipped files had them with a .gz suffix.

scripts/variantstore/wdl/GvsCreateVAT.wdl Show resolved Hide resolved
@rsasch rsasch merged commit 900651f into ah_var_store May 13, 2022
@rsasch rsasch deleted the rsa_merge_vat_tsvs branch May 13, 2022 14:15
This was referenced Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants