Genome Assembly Results
A. Trials
Sequences from Helicorthomorpha_holstii_genomic.fna were used to test the DBG assembler. To examine the effect of k-mer optimization, both our modified DBG and previous DBG from Data Analytics Practice Opportunity 2021/22 were applied. Two sequences in FASTA format SczTNLB_3789 (1735 bp) [Sequence A] and SczTNLB_2600 (1066 bp) [Sequence B] were chosen to perform DBG.
1. Without k-mer optimization
Four k-mer sizes (10, 15, 20 and 31) were manually selected and tested. From the results, using k-mer = 15 generated the shortest contig for both cases. All the four k-mers could not produce the best contigs. The contig should have the similar length with the original sequence with only end-to-end overlapping. Manual k-mer application for high-quality genome assembly is time-consuming.
Modified DBG assembler
The contig with k-mer = 31 gave the best result among the k-mers used. From Fig. 1, we could clearly see that half of the original sequence predicted (combination of both sequence A and B) was completely shown in the result.
Fig. 1 Modified DBG on Sequence A and B without k-mer optimization
Previous DBG assembler
Using k-mer = 10, 20 or 31 generated the same result with 100% identity. From Fig. 2, the contig was in the same length (around 1200 bp) and all the nucleotides were the same.
Fig. 2 Previous DBG on Sequence A and B without k-mer optimization
2. With k-mer optimization
K-mer optimization was automatically done with DBG performed afterwards. Only one single Python script was involved.
Modified DBG assembler
K-mer size ranging from 2 to 100 was examined. According to the results shown, the best k-mer was 30. Comparing the contigs from k-mer = 20, 25, 30 and 40, the k-mer 30 did produce the longest sequence (Fig. 3). From a rough estimation, the optimal contig was the same as the sequence that we predicted.
Fig. 3 Modified DBG on Sequence A and B with k-mer optimization
Fig. 4 DBG of the contig produced from k-mer = 30
To compare the results in a base-to-base format, a sequence alignment was performed. It gave 100% consensus for all the bases, indicating the contig produced from optimal k-mer 30 perfectly matches the sequence we predicted (Fig. 5).
Fig. 5 Comparisons between the expected sequence and the contig from k-mer = 30 in a base-to-base format
B. Application
Data T_c_1000.fasta generated from random sampling was applied for the following cases. Different factors were compared to examine whether clustering facilitated high-quality genome assembly.
1. Without clustering + modified DBG
The best k-mer was 25 and the longest contig (result) produced was in the length of 729 bp. Due to random sampling, a one whole contig could not be generated. They were too discrete that the overlapping regions were limited. The time taken to perform DBG was 586.5 seconds.
2. Without clustering + previous DBG
k-mer = 25 was applied on the previous DBG assembler. The result was 634 bp long. The time taken was 27.7 seconds.
3. With Clustering + modified DBG
All the contigs from different clusters were used to run DBG again. The best k-mer was 11, producing the longest contig with 865 bp. The first DBG (using CD-HIT results) took 58.2 seconds while the second DBG (using contigs from the first DBG clustering) took 297.5 seconds.