Introduction

Novel De Bruijn Graph Assembler for Millipede Genomes

Data Analytics Practice Opportunity 2022/23

Background

Studying the sequence information of various species is one of the goals for biologists. It helps understand evolution and how their genes (basic unit of inheritance) relate to the biological systems. By comparing sequences from different organisms, researchers can search for genes that may be linked to diseases. This gives hints for developing innovative approaches to treat human disease and therefore improve human health.

Genome assembly (the computational process of reconstructing the original sequence from a large number of short DNA sequences) is required for obtaining the sequence of a whole genome. It generally includes read preprocessing, contig construction, scaffold assembly, gap filling and quality assessment. Among these steps, contig construction is a determining step for high-quality genome assembly. De Bruijn Graph (DBG) algorithm is a modern and popular method for forming contigs from sequencing data. It is commonly used in de novo genome assembly. In this project, the millipede genomes dataset is used for de novo assembly.

Motivation

Sequence information of various genomes is frequently used in our modern lives such as disease diagnosis and environmental sample analysis. Building the database relies on genome assembly. However, it is a complicated process due to the large datasets involved. Accuracy and efficiency then become the crucial factors for a powerful assembler.

Here, we would like to develop a DBG assembler with high accuracy and efficiency. Basic DBG codes can be easily found on the internet, but it is not enough for handling numerous sequencing raw data. To improve the assembler, we modified the DBG code with the use of multiple k-mer size and parallel computation. Different assembly methods were applied to the millipede genomes dataset for comparisons.

Acknowledgement

Special thanks to support from Professor Jerome HL Hui’s Research Group in providing the millipede genomes dataset, and student John Ching Fung YEUNG in conducting the project De Bruijn Graph in Genome Assembly with Millipede Genomes Dataset in Data Analytics Practice Opportunity 2021/22.

Project Team

This project is conducted by a group of students in the Data Analytics Practice Opportunity 2022/23:

  • Anson Tsun On KWOK (CSCI/3)
  • Katie Tsz Yan MOK (BCHE/3)