QuickTips

This repository shares quick tips for better data management and terminal efficiency on servers and HPC systems. Also, provides a comprehensive collection of commands and scripts commonly used while analyzing sequencing data. The content is derived from my daily usage and experience.

How I Organize My Data on the Server?
How to calculate number of reads in fastq file?
How to calculate the average read depth after aligning reads to a reference genome?
How to fetch mapped and unmapped reads from BAM file using samtools?
How to Remove Lines in a Text File Above a Specific Sentence using sed in Bash? Example sentence : ">>>>>>> Coverage per contig"
How to check the delimiter of a file in bash?
How to verify data integrity using md5sum?
How to convert multi-line fasta to single-line fasta?
How to convert BAM to fastq and split to paired reads?

How I Organize My Data on the Server?

Step 1: Create Unique Project Folders

I create unique project folders for each project. Each project folder contains a set of subfolders that organize the different stages of analysis. This structured approach ensures that all data and results related to a specific project are kept together, making it easier to manage, access, and navigate through the project's lifecycle.

Example: Genome assembly project: Ma_gasm, Population genomics project: Ma_popgen

Step 2: Use Numerical Prefixes

Using numerical values before the folder name helps maintain order and enables efficient use of the TAB key for quick access to any folder through terminal.
Example: 01_Ma_gasm, 02_Ma_popgen

Step 3: Avoid Spaces in Folder Names

Using spaces in folder names can lead to issues with command-line operations and scripts. Instead, use underscores (_) as word separators.
Example: Genome Assembly should be Genome_Assembly.

Step 4: Use Short Forms

I prefer using short forms instead of lengthy titles in places where possible.
Example: 01_Maethiopoides_genome_assembly (very long) becomes 01_Ma_gasm (short and simple)

Step 5: Organize Analysis Steps Within Project Folders

Within each project folder, I organize the analysis steps as individual folders. Raw data comes first, followed by subsequent analysis steps.
Example:
Within 01_Ma_gasm:

01_raw_data
02_Base_calling
03_ncgnm_asm

Further within 03_ncgnm_asm:

01_flye
02_medaka

Within flye, each sample is placed in individual folders, starting with a QC folder:

00_QC
01_Magm1
02_Magm2
03_Magm3

Step 6: Separate QC Files

Quality check (QC) files can be the files generated while assessing the output of a particular dataset. For example, let's say we have FASTQ files which will undergo QC using tools like FastQC and MultiQC. The files generated from these tools should be organized into 00_QC folders, and separated into individual folders for each sample within the QC folder. This ensures that QC results are clearly organized and easily accessible.

Example of QC File Organization:

Within 02_Illumina:

00_QC
- 01_Sample1
  - fastqc_report.html
  - multiqc_report.html
- 02_Sample2
  - fastqc_report.html
  - multiqc_report.html
- 03_Sample3
  - fastqc_report.html
  - multiqc_report.html

How to calculate number of reads in fastq file?

In a Fastq file, each read is represented by four lines, with each read beginning with the "@" symbol. This command functions by counting the occurrences of "@" at the start of each read.

For a single file:

grep -c "^@" input.fastq

For multiple files:

grep -c "^@" *.fastq > read_count.txt

For zipped files:

zcat *fastq.gz | grep -c "^@" > read_count.txt

How to calculate the average read depth after aligning reads to a reference genome?

Depth generally refers to the number of times a particular nucleotide in the genome is read during the sequencing process. It reflects the number of reads aligned to a specific position in the genome. Evaluating read depth helps determine if there are a sufficient number of reads aligned over a given base position, which is crucial for our analysis. Whereas coverage means to check if the reads are aligned over the genome by checking the percentage. This is a common way to evaluate the quality of samples by checking read depth after alignment.

Step 1: Convert SAM file to BAM file

samtools view -bS --threads <number_of_threads> input.sam > output.bam

Step 2: Sort BAM file based on position

samtools sort input.bam -o output_sorted.bam -@ <number_of_threads>

Step 3: Calculating Average Depth
There are two tools either “samtools depth” or “Mosdepth”

samtools depth -a output_sorted.bam |  awk '{sum+=$3} END { print "Average = ",sum/NR}'

OR

Index Sorted BAM file and use mosdepth to calculate average read depth

samtools index input_sorted.bam -@ <number_threads>
mosdepth  sample_depth input_sorted.bam
awk '$1=="total"{print $4;}'  sample_depth.mosdepth.summary.txt

How to fetch mapped and unmapped reads from BAM file using samtools?

To fetch mapped reads:

samtools view -b -F 4 input.bam > output.bam

“-F” flag to fetch mapped reads & “-b” to produce output BAM file To fetch unmapped reads:

samtools view -b -f 4 input.bam > output.bam

“-f” flag to fetch unmapped reads

OR

Fetching mapped reads from BAM and output as Fastq files

samtools fastq -F 4 <input.bam> > <output.fq>

How to Remove Lines in a Text File Above a Specific Sentence using sed in Bash? [Example sentence : ">>>>>>> Coverage per contig"]

Edit within same file

sed -i '1,/>>>>>>> Coverage per contig/d' yourfile.txt

Create new file with edits

sed '1,/>>>>>>> Coverage per contig/d' yourfile.txt > newfile.txt

How to check the delimiter of a file in bash?

Save the following code as "delim_check.awk"

BEGIN {
   sep[","]   = "comma"
   sep["\\|"] = "pipe"
   sep["\t"]  = "tab"
}

{
    for (x in sep) {
        c = gsub(x,"&",$0)
        if (c) cnt[sep[x] " " (c+1)]++
    }
}

END {
    for (x in cnt) {
        if (max == "" || cnt[x] > max) {
            max = cnt[x]
            est = x
        }
    }
    print est
}

Run the code as following:

awk -f delim_check.awk <file_to_check>

Prints type delimiter(tab, pipe, comma) with number of columns (int)

How to verify data integrity using md5sum?

Syntax

md5sum -c <list_of_files.md5>

Example: Fastq files from SRA:

Md5sum check on the fastq files and store in ".md5":

md5sum *.fastq.gz > md5sum_check.md5

Output of md5sum_check.md5:

Verify files

md5sum -c md5sum_check.md5

Results: If "OK" files are good

How to convert multi-line fasta to single-line fasta?

Usage:

multi2singlefa.pl <input_multi.fasta> > <output_single.fasta>

Perl script

Save the following code as a Perl script

#!/usr/bin/perl -w
use strict;

my $input_fasta=$ARGV[0];
open(IN,"<$input_fasta") || die ("Error opening $input_fasta $!");

my $line = <IN>; 
print $line;

while ($line = <IN>)
{
chomp $line;
if ($line=~m/^>/gi) { print "\n",$line,"\n"; }
else { print $line; }
}

print "\n";

How to convert BAM to fastq and split to paired reads?

Step 1: Convert BAM to FASTQ

Use samtools to convert your BAM file to a FASTQ file.

samtools bam2fq SAMPLE.bam > SAMPLE.fastq

Step 2: Split Paired-End Reads

By using above command paired-end reads typically have /1 or /2 added to the end of read names. To split a single FASTQ file of paired-end reads into two separate files:

Extract Reads Ending with /1 (Forward Reads):

cat SAMPLE.fastq | grep '^@.*/1$' -A 3 --no-group-separator > SAMPLE_r1.fastq

Extract Reads Ending with /2 (Reverse Reads):

cat SAMPLE.fastq | grep '^@.*/2$' -A 3 --no-group-separator > SAMPLE_r2.fastq

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Data_organisation.png		Data_organisation.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QuickTips

How I Organize My Data on the Server?

Step 1: Create Unique Project Folders

Step 2: Use Numerical Prefixes

Step 3: Avoid Spaces in Folder Names

Step 4: Use Short Forms

Step 5: Organize Analysis Steps Within Project Folders

Step 6: Separate QC Files

Example of QC File Organization:

How to calculate number of reads in fastq file?

How to calculate the average read depth after aligning reads to a reference genome?

How to fetch mapped and unmapped reads from BAM file using samtools?

How to Remove Lines in a Text File Above a Specific Sentence using sed in Bash? [Example sentence : ">>>>>>> Coverage per contig"]

How to check the delimiter of a file in bash?

How to verify data integrity using md5sum?

Syntax

How to convert multi-line fasta to single-line fasta?

Perl script

How to convert BAM to fastq and split to paired reads?

Step 1: Convert BAM to FASTQ

Step 2: Split Paired-End Reads

About

Releases

Packages

meeranhussain/QuickTips

Folders and files

Latest commit

History

Repository files navigation

QuickTips

How I Organize My Data on the Server?

Step 1: Create Unique Project Folders

Step 2: Use Numerical Prefixes

Step 3: Avoid Spaces in Folder Names

Step 4: Use Short Forms

Step 5: Organize Analysis Steps Within Project Folders

Step 6: Separate QC Files

Example of QC File Organization:

How to calculate number of reads in fastq file?

How to calculate the average read depth after aligning reads to a reference genome?

How to fetch mapped and unmapped reads from BAM file using samtools?

How to Remove Lines in a Text File Above a Specific Sentence using sed in Bash? [Example sentence : ">>>>>>> Coverage per contig"]

How to check the delimiter of a file in bash?

How to verify data integrity using md5sum?

Syntax

How to convert multi-line fasta to single-line fasta?

Perl script

How to convert BAM to fastq and split to paired reads?

Step 1: Convert BAM to FASTQ

Step 2: Split Paired-End Reads

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages