Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Idea] Experiment with indexing transcriptomics data on NeMO #73

Open
rly opened this issue May 25, 2024 · 0 comments
Open

[Idea] Experiment with indexing transcriptomics data on NeMO #73

rly opened this issue May 25, 2024 · 0 comments
Labels
category: proposal proposed enhancements or new features

Comments

@rly
Copy link
Contributor

rly commented May 25, 2024

Transcriptomics data on the NeMO archive are often stored as ascii text files (fastq, fasta, mex) that are sometimes tarballed, and sometimes gzipped. I have also found tarballed BAM files (binary).

You can index the files in a tarball with byte ranges using the tarball header. And supposedly you can also index gzipped files and decompress byte ranges of those as well.

Example BICCN data:
https://data.nemoarchive.org/biccn/grant/u01_lein/lein/transcriptome/sncell/10x_v3/
https://data.nemoarchive.org/biccn/grant/u01_lein/linnarsson/transcriptome/sncell/10x_v2/human/processed/CellRanger5/
https://data.nemoarchive.org/biccn/grant/u19_huang/arlotta/transcriptome/sncell/10x_v2/mouse/processed/align/

Some of these data files can be very large, and a user may want to access only particular elements of the data file without having to download the entire file. I wonder if we can use LINDI to create an efficient JSON index of specific data elements within a NeMO-hosted dataset for streaming and local access. Just an idea right now as we brainstorm for the grant proposal.

BDBags can be used to index and download particular files of a dataset but I don't know if this works within a tarball or within a FASTQ file.

@rly rly added the category: proposal proposed enhancements or new features label May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: proposal proposed enhancements or new features
Projects
None yet
Development

No branches or pull requests

1 participant