Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input reads have incorrect file format #206

Open
cebos opened this issue Apr 19, 2023 · 10 comments
Open

Input reads have incorrect file format #206

cebos opened this issue Apr 19, 2023 · 10 comments

Comments

@cebos
Copy link

cebos commented Apr 19, 2023

Hi Nicolas,
I'm using novoplastty to de novo assemble mitochondrial genomes for a dataset of 29 sets of paired end short read data.
For over a third of my samples, I've gotten the following error:

THE INPUT READS HAVE AN INCORRECT FILE FORMAT! PLEASE SEND ME THE ID STRUCTURE!

I've attached an example of some of the reads from one sample below, please let me know if there is / what other information you require. I filtered the raw data for the entire dataset with fastp using default parameters and am giving novoplasty the filtered forward and reverse read files generated by fastp. Thank you for your time, your help and advice is greatly appreciated!
Best,
cebos

Example:
zless Microrhombophryne_Ca39_ZCMV-12404_L001_R1.out.fastq.gz

@J00138:141:HN23TBBXX:5:1101:23520:1068 2:N:0:ATAGCGAC+ATTACTCG
CCCTGAATGTCTACGTGGCTCTTTGTTACTATAAACTTGATTACTATGATGTGTCACAGGAAGTTCTTGCAGTATATTTGCAACAGGTTCCTGACAGTACGATTGCTCTTAATTTGAAGGCCTGCAATCATTTTCGTCTTTACAATGGGAA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJ--AFJFJJ7JF-F-FAJFFAFFJF7AJFJJFJ7-AFFJJJJJJJJAJJFJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJFJJJJJJJJJJAJFAAAJJFJJJ7AJJJJJ<7
@J00138:141:HN23TBBXX:5:1101:30076:1068 2:N:0:ATAGCGAG+ATTACTCG
GCCAACAAAAGGTATCGCCTTATTTCTTCACTTTTCTATTGAATTCAATGGCCAACACGGTACAACACATCACTGCTACATATCGAATAGATAGCTTGGCCGTAGGCCTGTGTGTTTGGGGAAGGGCTGATCAGAACCCATCGGGATAGCT
+
A<AFFJJJJJJJJJJJJFJJJJ-F7FFJJJJJF7J7FJF-<FJ<F7JJFA--7--AFF7-AFJJFFJ<7JFJJF7FFJJJFJA<FJJFFFJJJJJJJFJFJ7JF<FFFAFJ--AFJAFFFJJJJJJFJAF-F7FFJJJA7-))7-FFF<F<
@J00138:141:HN23TBBXX:5:1101:18873:1103 2:N:0:ATAGCGAC+ATTACTCG
TGGATACTGGAGAAGATTCGAGTGGTAGATTCTATTCAGAACCTTGGAGATGATCTCACTGCAGTCATGTCAATTCAGAGAAAACTCTGTGGCATTGAGAAAGATCTTGGTGCCATTGAGTCTAAACTTGTAAGTCTACAAGAAGAGGCAA
+
AAAFFJJJJJJFFJJJJJJJJJJJJJFJFFJJJJJJJJJJJFJJJJJJFJFJJJJAAJFFJJAJ7FJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJ-77AJJFJJFFJJJJJJJJJJJJFJJJFFJJFF<JJJJFF<JJJJJJJJJJJJJ
@J00138:141:HN23TBBXX:5:1101:27965:1103 2:N:0:ATAGCGAG+ATTACTCG
AGGTTGGCAATGTGGAATCAGGCAGAGTGTGCAATGGCAAGCAAGGTT
+
AAFFFJJJJJJJAFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFAF
@J00138:141:HN23TBBXX:5:1101:25276:1121 2:N:0:ATAGCGAC+ATTACTCG

@cebos
Copy link
Author

cebos commented Apr 21, 2023

I also have an unrelated Novoplasty question, what is the function of the Optional config.txt parameters
Insert Range = 1.9 and Insert Range strict = 1.3?

@ndierckx
Copy link
Owner

Hi,
Although it are the forward reads, it has a 2 in the id: "2:N:0:"
Shouldn't these be the reverse reads?

And why are you filtering the reads? You can just use the complete dataset.

Greets,

Nicolas

@ndierckx
Copy link
Owner

Insert range doesn't need to be changed, you can make it larger when you use a library that has very fluctuating insert ranges, but that is almost never the case

@cebos
Copy link
Author

cebos commented Apr 25, 2023

Hi Nicolas,
Thanks for your prompt response! You were right, it appears some of the files were mislabeled as the opposite of either forward or reverse reads, they are working fine now.
What's the difference between the default settings of

Optional:
Insert size auto = yes
Use Quality Scores =
Output path =
versus adding the Insert Range = 1.9 and Insert Range = 1.3 ?

I also have a question about the Store Hash = Yes option, I'm testing 1000s of seeds on my dataset to see if I can find optimal ones, if I enable this option then should that speed up the computational time? My understanding is that the hash table is created based on the read information, and the seed is applied after.
However, once I added the Store Hash = Yes option to my script, it appears that Novoplasty is taking as long or longer to run my analyses than it did before hand. The slurm output for the run appears to show that a new hash table is being stored for each run and the output directory also has a new Hash file for each project run.
Reading Input......OK
Scan reference sequence......OK
Building Hash Table......OK
Subsampled fraction: 100.00 %
Retrieve Seed...BUILD2

I've written a script to create a batch file for each individual that provides a new project name for the seed + sample combination, and the other standard information, so that this structure is iterated through the file (until all seed combinations have been included):
Project_${sample}_${seed}
${seed_dir}${seed}.fasta
${input_dir}${sample}_R1.fq.gz
${input_dir}${sample}_R2.fq.gz

From the above, it appears that the Store Hash option stores a separate hash for each new project, even if the read data being provided is the same. Is there a way I can store the hash table to use across many projects? I want to compare contig lengths across seeds so it's important to keep the project naming conventions since that's how the ouput fasta files are named. Thanks a lot for your help!!

@deyuanyang
Copy link

Hi,

I also have the same question. If I changed the name of the projects, the seeds would not work.

@ndierckx
Copy link
Owner

ndierckx commented May 4, 2023

@cebos

Insert size auto means that it will automatically calculate the insert size, the range determines how much it can differ from the insert size. No need to change anything there, won't change much.

About the store hash, have you read the wiki:
https://github.com/ndierckx/NOVOPlasty/wiki/Store-hashes-locally

You need to run store hash only ones and then you need to use the stored hashes in stead of the reads. It will speed up the first phase by a lot, especially for larger datasets.

Why are you using 1000s of seeds, if you have a WGS dataset, one seed should be enough and the seed is only need to initiate the assembly and should be quite flexible

@ndierckx
Copy link
Owner

ndierckx commented May 4, 2023

@deyuanyang

Not sure what you mean by the seeds won't work...

@ndierckx
Copy link
Owner

ndierckx commented May 4, 2023

There is also a batch function: you can check the wiki

https://github.com/ndierckx/NOVOPlasty/wiki/Batch-function

It is easy to use and like this you can run many samples with the changes you want per run

@cebos
Copy link
Author

cebos commented May 4, 2023

@ndierckx
Thanks for sharing the specific info on the store hash function. So, if I understand correctly, there is not a way to generate and store the hash within the same run (since the config file must first have Store Hash = yes and then Store Hash = path/to/hash/file? If possible, I would like to be able to generate the hash and then immediately call it for subsequent project runs within the same batch file, like so (incorporating a 5th line for the HASH_project.txt file):
Project_${sample}_${seed}
${seed_dir}${seed}.fasta
${input_dir}${sample}_R1.fq.gz
${input_dir}${sample}_R2.fq.gz

Project_${sample}_${seed}
${seed_dir}${seed}.fasta
${output_dir}HASH2B_Project_${sample}_${seed}.txt
${output_dir}HASH2C_Project_${sample}_${seed}.txt
${output_dir}HASH_Project_${sample}_${seed}.txt
I assume this isn't possible since two separate arguments are required in the config file to first generate the hash and then call it later. However, I can still use the batch function to call saved hashes, correct?

@ndierckx
Copy link
Owner

ndierckx commented May 4, 2023

You can use the hash files directly for subsequent runs, because you will know how the hash file is called.
You can use the bash mode for it too...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants