The Millipede Dataset – Digital Scholarship Projects, CUHK Library

The Millipede Genomes Dataset

In this project, we are reusing the dataset used in the paper: Millipede genomes reveal unique adaptations during myriapod evolution (Qu. Zhe 2020)^[1], deposited in the CUHK Research Data Repository ^[2]. We would like to use the Millipede Genomes dataset for demonstrating how the de Bruijn graph algorithm we have developed could be applied on real life data.

We are using the data in the file Trigoniulus_corallinus_genomic.fna, which contains ~9,000 gene sequences. The analysis was completed using k=37. This number, the standard hyper-parameter for practical gene assembly, is chosen because statistically speaking, in practical genome assembly tasks, with short reads of length usually between 100 and 200 characters, the possibility of two unrelated sequences having the same sequence slice decreases as the length k increase. k=37 could ensure that whenever 2 sequences having the same slice of length 37, they are likely taken from a continuous segment in the original genome, while this length is short enough to prevent the impact of noise (i.e. due to the noise or missing pieces in the short reads, it is normal to obtain multiple contigs in the results.).

Here is the assembly result using 100 reads from the file Trigoniulus_corallinus_genomic.fna in the dataset.

The bottom line is our assembled result, called contigs. Among the different color segments, each line represents a separate resulting contig string. The line below shows the generated contigs, ranked from longest to shortest, and the line above are the corresponding short reads participated in the generation of those contigs.

Fig. 1. The short reads-contigs mapping generated by running the de Bruijn graph algorithm on the first 100 short reads in the dataset.

If we focus on the first 3 short-contig mapping, we can observe that all short reads have a one-to-one mapping relationship towards our final assembled contigs. This indicates that each of our short reads does not have overlapping regions of length = 37 with other short reads. This strongly proves that the millepede genome dataset is a set of assembled contigs, rather than a brunch of raw short reads.

Fig. 2. A close-up image on the first 3 short reads in Fig. 1.

We can observe the same situation using other short reads in the file. Fig.3 shows the contigs generated with the second hundred and third hundred sequences in the dataset.

Fig. 3. Short reads-contigs mapping generated using the 101^th-200^th and 201^th-300^th short reads in the millipede genome dataset.

^[1] Qu, Z., Nong, W., So, W. L., Barton-Owen, T., Li, Y., Leung, T. C. N., Li, C., Baril, T., Wong, A. Y. P., Swale, T., Chan, T. F., Hayward, A., Ngai, S. M., & Hui, J. H. L. (2020). Millipede genomes reveal unique adaptations during myriapod evolution. PLOS Biology, 18(9), e3000636 doi: 10.1371/journal.pbio.3000636.

^[2] Qu, Zhe; Nong, Wenyan; So, Wai Lok; Barton-Owen, Tom; Li, Yiqian; Leung, Thomas C N; Li, Chade; Baril, Tobias; Wong, Annette Y P; Swale, Thomas; Chan, Ting-Fung; Hayward, Alexander; Ngai, Sai-Ming; Hui, Jerome H L, 2021, “Millipede genomes”, https://doi.org/10.48668/LQYII3, CUHK Research Data Repository, V1.