Applying De Bruijn Graph to the Millipede Dataset

Applying de Bruijn Graph to the Millipede Genome Dataset

Applying the de Bruijn graph algorithm onto the millipede dataset generates a corresponding de Bruijn graph, which is used for tracing and reconstructing a fitting sequence. The sequence links the short reads that have overlapping area at its head/tail.

In this project website, we have demonstrated how a contig is formed from only one short read in the dataset. How does the underlying de Bruijn graph look like? What causes the contigs to consist of only 1 short read?

In this project, we implemented a de Bruijn graph visualization tool on the Digital Scholarship Tools platform (to be available soon), allowing users to understand the de Bruijn graph algorithm interactively with the pre-installed sample datasets or a user supplied dataset. After a dataset is chosen, user can choose a suitable hyper-parameter k for the system to run the de Bruijn graph algorithm and generate a set of contigs. User can also choose a layout for the de Bruijn graph visualization. 

Two shorter sequences from the millipede genomes dataset are chosen to demonstrate the application of the algorithm on real life data.

>JAAFCE010000011.1 Trigoniulus corallinus isolate tc20190418 Scaffold_14, whole genome shotgun sequence GACTGTTTTGGCTTTTGCCGCTTTTATTGTTATGTCTGTTTTGGCTTTTGCCACTTTTATTGTTATGTCT GTTTTGGCGTTTGCCACTTATATTGTTATGTCTGTTCTCATTTTTGCTGCCTTTTCTTTTATTTCTATTT TGTCTTTTGCCTATAATTATTCCTTGGTATTTATTTGGCTATTGCCTTTGTACATGTTAGTTTTAGTTGG CTTTTGCCATTATTTATTGCCATCATGGTGGAAAGGCCATTTACACCCCCACCATGATGATTGTCTTTCT GGGCCCTAATGACCATTGTAGTTGATGGGCCCTAAAACATCAACAAACAAACAAACAAACAAGCAACAAC AAACAACAACAACAAACAGTGCCAGAGCCAAGTTTAAATTTAATTCTTGCAGCCTTACCCAAAACCAGTA CCAATAGGTCATTGTACCAACAAGAGTTCAGACGCCTCTCTTGGACCTATCGAGGCCATGTTTTTATTTA CACGGATGGGTCGAAGGATGGTTCCTATGTGGGTTGGTCAGTCTTTACAACTGGTCATCTCATACGGGGC CGCCTTCCAGGCGTTGCCAGTGTCTTAACTGCAGAACTCACAGCTCTGCACAGGGCACTGGAACTCATCC TAGTATTGCATAACAGACACTTTGTTGTTGCGACTGACTCCCTATGCTCCCTTCAGCTGTTGCACACCCT GTATGGCAGACATGCCATCGTCCAGGGTGTGCACAGATTGCTGCTTGTGATTCTTAAGCAGGGCAAGACA GTGTTGTTTATGTGGGTCCCGGGGCATGTTGGCATCGTGGGCAACGAGATGGCAGATGCCCATGCCCGTC AGGTGGCCAACATGCCAGAAATTCCTGACACTGCTGTCCCCTCCTCCGACGCTCTACCGATGGTGCGGCG GGCCGTCCATGACCAATGGTGGGCCCAGTGGTTTGGGGAGACCACGAACAAACTCCACACACTGAAGGAC ACTCCAGTGCACTGGGTGTCCTCATGCCGCCAATCGCGGAGGGAGGAGGTTGTGCTTGCTCGGCTGCGAA TCGGTCACACCTTCCTCACCCACTCCTACATTCTACGGGTTGAGGATCAGCCATGGTGTGACTTCTGTTG GGAGCCTCTCACAGTGGCGCACATCTTTAGTCACTGTGAGAGATACCGAGAACTTAGGTGGCGACTGGGA GTCAGCGCAGAGCTCTCCACATCACTTGGGGATAACCCTCAGGCAGTGGAGAGCTCGCTAGTATTTGTGA 
AAGAGGCTGGCCTCTTTCA 

>JAAFCE010000012.1 Trigoniulus corallinus isolate tc20190418 Scaffold_15, whole genome shotgun sequence AGTTAACGCACAACACTTATGGACGACCCTTGACAGATGCTACACTAACAGACGGCAGACCTCCATGGCA ACTTTATTTAGAATCAATACCAATTGCACACACAACAAAAAAACGATAAAAAAAAGGGGAAGAACCCCGT CAAAAAGTAAAATTTTGCTACTAACGTGGTCACCAAATCCTAATGAGAAAAACAGTAAAAAAGGAAGATA AAAAACGTTAAAATATTACCCAAAAAACATAAAAATACAAGCAACGACTTCACATGCATATGCATACACC AATGTTGCCAGTAGATTCAATGCTGTAAGCATGTCACATGGATGCTTCTTTTTCTCAAAAACTATAAAAG CTACAGACATGCCGTTAGGAAACTTTGTAGAGTACAACATTCTCTACATTCTGGCCAAGATTTAGCGATC TGTAATGTATAGAAGAGAAAATTTAGCACAAATAAAGATTTTCTCATGCTTTTCTATGGGAGAACATTCG TTTTTGCAATATCTCAAAAACTCTTTAAGATAGAGCTTTAAAACTTAGCAAATGTGTAGAAAACCGCCTA TCTTACCTACTTGGAAAGTTTTACACATGTGGTTTTCATAAAAGTTTAGAAATTTCAAAAAACATGTGTA ATGACAAAAAGTGCTGGAAATCGCTTTTCTGCTCAATGAGTAAGCATGTCACCTAGACGATTCTTTTTCT CAAAAACTATATAAGCTACAGACATGCCGTTTGGCATCGTTGTAGAGTACCACATTCTTAACATTCTGGC CATGATATAGCGTTCTGTGATGTATAGAAGAGAAAATATCGCACAAAACATGATTTTCACATACTTTTTT ATGGGAGAACATTCGTTTTTGCAATATCTCAAAAACTATTAAAGATAGAGCTTTAAAACTTAGCAAAAGT GTAAAAGACCACTTATTTACCTTACTGGCAAAGTTTTATACATGTGGTTTCCATTGAAGTGAAGAAATTT CACAAAAGATGTGTAATGACAAAAAGTGCTGGAAATCGTGAAAAATCACTTCTTTCCTCAATGAGTAAGC ATGTCACTTAGATGCTTCTTGTTCTCAAATTCTATAAAAGCCACAGACATGCCGTTTGGCAACGTTGTAG AGTACCACATTCTCTAAATTCTGGCCAAGATTAATCGTTCTGTGATTTATAGAAGAAAAAATATCG

Two different representations were generated by the visualization tool. Except fo the de Bruijn graph, a short reads-contigs mapping is also created to show the mapping relationship between the provided short reads in the dataset and the output contigs. Mouse hovering the contigs shows the sequences.

With these graphs, the questions raised above can be answered. In order to have a clear image, we have limited the de Bruijn graph to be formed by 2 short reads and discarded the sequence labels in the graph.

k=37

Fig. 1. The de Bruijn graph generated by 2 short reads from the dataset with k=37.

Video. 1. The expansion process of the de Bruijn graph in Fig. 1. 

Fig. 1 shows that nodes of the 2 sequences are all crumbled together. This is caused by the two unrelated sequences having one or more identical sub-sequence of length=37. Although the algorithm can still reconstruct the sequences, it is possible for the algorithm to follow a wrong path and give a wrong reconstruction.

A common method addressing this issue is to increase the value of k, reducing the likelihood of two unrelated sequences having the exact same sub-sequence of length k. This eliminates the possibility of sharing the same nodes in the two sequences. The following de Bruijn graph generated with k=49 shows that the sequences are now separated into two.

Fig. 2. The de Bruijn graph generated by two short reads from the dataset with k=49, showing that they form a pair of disconnected graphs.

Video. 2. The expansion process of the de Bruijn graph in Fig. 2. 

From Fig. 2, we can see that there are two disconnected graphs where the two components are from the two distinct sequences inputted. When we trace the graph, we can never reach the components from the other graph. This shows the two sequences have no overlapping area when k = 49 and the algorithm can generate two contigs which are identical as the input.