Methodology (Application)

FASTQ file is the raw data of sequencing. It stores the nucleotide sequence and its corresponding quality score in a text format. In this project, we applied the novel DBG assembler on the data Trigoniulus_corallinus_genomic.fastq (29.0 GB), which is provided in the millipede genomes dataset.

1. Data preprocessing

The raw data Trigoniulus_corallinus_genomic.fastq is a large file. Each read in the file has 302 bp with two N (an unidentified nucleotide due to basecalling problem during sequencing). The number of reads was counted by applying the following CML:

grep -c "^@" Trigoniulus_corallinus_genomic.fastq 
#> 1593150 reads

a. Trim to 150 bp and remove N – CML

In the raw FASTQ file, two sequencing reads in 151 bp are put together to form reads with 302 bp. Because of the large file size, only the first read of each data was collected. To obtain the reads, they were trimmed to 150 bp and Ns were cut.

seqtk trimfq -b 1 -e 151 Trigoniulus_corallinus_genomic.fastq > Trigoniulus_corallinus_genomic_150bp_withoutN.fastq

The options used in the command are:
b 1: Trim the first base of each read
e 151: Trim the last 151 bases of each read
Trigoniulus_corallinus_genomic.fastq: The input FASTQ file that will be trimmed
Trigoniulus_corallinus_genomic_150bp_withoutN.fastq: The output FASTQ file that will contain the trimmed reads
The trimmed reads will be saved in the output file.

b. Random sampling – CML

1000 reads were picked randomly from the trimmed file. After random sampling, the data for testing contained 1.5 x 10⁵ bp.

seqtk sample Trigoniulus_corallinus_genomic_150bp_withoutN.fastq 1000 > T_c_1000.fastq

c. Quality check – FASTQC

FASTQC, a FASTQ file quality check tool, was used to analyse the randomly sampled reads. It was to ensure the sequencing data were suitable for genome assembly. Per base sequence quality was examined. The higher the quality score, the lower the probability of incorrect basecalling is.

fastqc T_c_1000.fastq

d. Filtering – CML

To increase the accuracy of genome assembly, reads having bases with a Phred quality score lower than 30 (error rate < 0.001) were filtered out. This error rate is considered low enough for high-quality assembly. Meanwhile, a large enough number of reads are still included after the filtering step.

seqtk seq -q30 T_c_1000.fastq > T_c_1000_filter.fastq

e. FASTA file conversion – CML

FASTA file is required for further steps in genome assembly. Only the nucleotide sequences without quality scores are shown in the FASTA file. With its text format property, all the bases have to be recorded as uppercase or lowercase characters. Therefore, for better visualization, all the reads were converted to uppercase characters. The FASTQ file was then changed to a FASTA file.

seqtk seq -A T_c_1000_filter_U.fastq > T_c_1000.fasta

2. Reads clustering – CD-HIT

Cluster Database at High Identity with Tolerance (CD-HIT) is a program that compares protein or nucleotide sequences and clusters sequences with high similarity. Similar reads have higher probability of obtaining from the same or neighboring DNA fragments. Total CPU time 0.50 seconds was needed to perform CD-HIT on 1000 reads in FASTA file with 80% identity threshold.

#80%
cd-hit -i T_c_1000_filter_U -o T_c_1000_filter_CDH -c 0.8 -n 5 -T 0

i specifies the input file name.
o specifies the output file name.
c sets the sequence identity threshold, which in this case is 80% (0.8).
n sets the word length used for the sequence comparison, which in this case is 5.
T sets the number of threads to use, which in this case is 0 (meaning use all available threads).

3. DBG for each cluster – Python

With subsetting done, DBG was performed for each single cluster. The optimal k-mer size would be generated automatically and specific for the clusters. Multiple k-mer sizes were applied to the whole dataset.

DBG was applied twice in this step. DBG was first run for each cluster. After the generation of contigs from the available clusters, the contigs were used in DBG again. All the data was involved and the final contig, i.e. final results, could be observed.

Input the CD-HIT result file.
Change the name of the original FASTA file in the code for searching the sequences (The .clstr file only provides the ID of the reads).
Run the code for performing DBG with testing k-mers
Final contigs from all clusters are output as a FASTA file