Introduction

De Bruijn Graph in Genome Assembly with Millipede Genomes Dataset

Data Analytics Practice Opportunity 2021/22

In the biological studies of modern day, DNA composition of a target is often the major reference. However, it is not straightforward to obtain the DNA sequence from the target as you can imagine. In practice, it is impossible to get the target DNA sequence only by scanning the sample once. We achieve this by having short sub-sequences, called ‘short reads’, which are extracted from the original DNA sample.  

The followings demonstrate how to reconstruct the original DNA sequence from these short reads.
Suppose we have a target sequence:
‘ATCGGACTGTTTTATCTTTC’

And we are obtained 4 short reads:
‘ATCGGAC’,               ‘TCTTTC’,
    ‘CGGACTGT’  and      ‘TTTCTCGC’ respectively. 

After reconstruction, we should be able to get the results:
‘ATCGGACTGT’ and ‘TCTTTCTCGC’. These results are called contigs or scaffolds, and this process is called genome assembly. In real situations, it is normal to obtain multiple contigs due to noise or missing pieces in the short reads.

In this project, we attempted to implement the de Bruijn graph method on the millipede genomes dataset to demonstrate DNA assembly.

Project Team

This project is conducted by John Ching Fung YEUNG (CSE/4) in the Data Analytics Practice Opportunity 2021/22.