The De Bruijn graph algorithm
Genome assembly can be summarized as ‘concatenating strings that have overlapping substring of certain length in one’s end and one’s beginning’. Considering the large amount of short reads that are available, we also need a fast method.
There are many proposed methods for genome assembly. The most powerful method among them is the de Bruijn graph, which is also used in state-of-the-art sequence assembly machines.
De Bruijn graph is a directed graph formed by breaking the input sequence(s) into nodes labeled by k-mers (sub-sequences of length k where k is a hyper-parameter), each edge of which indicates a connection between the sub-sequences in one of the reads.
For example, with the input short read is ‘ATCG’ and k equals to 3, we are going to slice the short read into 3-mers, which are
‘ATC’ and
‘TCG’ respectively, putting a directed arrow indicating their ordering relationship, the De Bruijn graph created will look like this:
Fig. 1. The nodes and its connection in de Bruijn graph, generated from the sequence ‘ATCG’.
To Concatenate the short reads, we can utilize the de Bruijn graph. Suppose we have 2 short reads–
‘ATCGGAC’ and
‘CGGACTGT’, and we set the hyper-parameter k as 4.
Fig. 2. De Bruijn graph generated from the 2 sequences: ‘ATCGGAC’ and ‘CGGACTGT’.
Example Usage of the De Bruijn Graph Algorithm
Traversing the graph, we can obtain the final sequence by appending the last character of the next node. The resulting string is ATCGGACTGT.
This method can work on any number of short reads. The quality of results tends to increase as k increases because with the longer the sub-string, the less likely the 2 short reads having the exact same sub-sequence by random chances, instead of them being a continuous part in the original DNA sequence.
Here is an example of the algorithm. The line plot is the alignment between the input short reads and the output contigs, which is located at the bottom line.
a = 'ATCGGAC'
b = 'CGGACTGT'
c = 'TCTTTCTCGC'
d = 'CTCGCTACCCCCG'
reads = [a, b, c, d]
contigs, vertices, edges = assembly(reads, k=4) plot_contigs(contigs)
Fig. 3. De Bruijn graph generated with k=4, and the corresponding short reads-contigs mapping from the 4 input sequences a, b, c and d as defined in above code.
The choice of k is important in the algorithm. If k is too small, the nodes from different sequences could look the same in the de Bruijn graph. For instance, the node TCG appears in short reads a, b, c in the sample, causing the 2 lines we see in fig. 3 to merge together, although short read a is unrelated to b, c. This could lead to potential mis-assembling unrelated segments of short reads.
Fig, 4. De Bruijn graph generated using the same sample dataset with k=3.
Here is another example applying on the sample DNA string data achieved from a github repository. We can generate a de Bruijn graph for reconstructing a contig by tracing the graph with the assembly algorithm described above.
Fig. 5. De Bruijn graph generated with k=12 and a short reads-contigs mapping from the sequences in the g2000reads.fa file retrieved from the above mentioned github repository.