Methodology (Code Development)
To test whether the DBG assembler works well, 2 sequences were used for trial. Normally, it gives reasonable results with short reads (around 150 base pairs) as the path is short. We used long sequences from the dataset to ensure it functions with long paths. End-to-end overlapping was targeted so that we could predict the sequence after performing DBG.
1. BLASTN
Nucleotide Basic Local Alignment Search Tool (BLASTN) helps find regions of local similarity between sequences by comparing nucleotide sequences to the sequence database. We performed BLASTN locally by using the following command lines (CML):
# make database
$ makeblastdb -in Helicorthomorpha_holstii_genomic.fna -dbtype nucl-parse_seqids
# run blastn
$ blastn -query try1.fasta -db Helicorthomorpha_holstii_genomic.fn
-outfmt "6 qseqid qstart qend qlen sseqid sstart send slen pident length evalue bitscore sstrand" -out output.csv
Subsets were created from the large millipede genomes dataset. Criteria were set to find out sequences with end-to-end overlapping:
– pident (percentage of identical matches) = 100;
– sstart (start of alignment in subject) = 1 AND
– qend (end of alignment in query) = qlen (query sequence length); OR
– qstart (start of alignment in query) = 1 AND
– send (end of alignment in subject) = slen (subject sequence length)
2. K-mer optimization
With the modified DBG code, we can automatically find the k-mer size suitable for a specific set of data. K-mer size ranged from 2 to 100 was examined. The sequence length obtained was shown with the k-mer size respectively. The k-mer size that gave the longest sequence would be pointed out as the best k-mer. Its corresponding sequence would be the final output.
3. Checking
The DBG result was aligned with the sequence predicted. We used BLASTN and Jalview to check if the result 100% matched with another one. 100% identity indicates that the DBG assembler demonstrates high accuracy even though long sequences are examined. If not, it may produce wrong paths when large data is applied.