Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block compressed gVCF file is rejected by GenomicsDBImport --bypass-feature-reader due to non-standard file extension #7691

Closed
1 task
lessdata opened this issue Feb 23, 2022 · 1 comment · Fixed by #7692

Comments

@lessdata
Copy link

lessdata commented Feb 23, 2022

Bug Report

Affected tool(s) or class(es)

GenomicsDBImport

Affected version(s)

  • Latest public release version [4.2.5.0]

Description

My gVCF files are block compressed and indexed, but the files have the file extension ".gvcf.gz" rather than ".vcf.gz". When I run GenomicsDBImport with --bypass-feature-reader, the ".gvcf.gz" file cannot be recognized as a block compressed vcf file. The code of GenomicsDBImport validates if input is block compressed by checking if the file extension is ".vcf.gz".

    private static void assertVariantFileIsCompressedAndIndexed(final Path path) {
        if (!path.toString().toLowerCase().endsWith(FileExtensions.COMPRESSED_VCF)) {
            throw new UserException("Input variant files must be block compressed vcfs when using " +
                BYPASS_FEATURE_READER + ", but " + path.toString() + " does not appear to be");
        }
        Path indexPath = path.resolveSibling(path.getFileName() + FileExtensions.COMPRESSED_VCF_INDEX);
        IOUtils.assertFileIsReadable(indexPath);
    }

I understand that this is an issue on my side because I did not name my gVCF files with the standard extension ".vcf.gz". Is it possible to make this check less stringent in a future release? Maybe make any ".gz"/".bgz" file acceptable, or check the ".tbi" index file to identify block compression (existing index typically means the file is block compressed and indexed).

Thank you.


@droazen
Copy link
Collaborator

droazen commented Feb 23, 2022

@lessdata In many cases we need to rely on the file extensions to check the file format, because actually opening the files and reading the first few bytes to determine the format gets expensive when the files are hosted in the cloud and there are many VCFs. I do agree that this error message could be improved, however -- it should mention the file extensions that are allowed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants