Acceleration of Local Sensitivity Hashing for Genome Assembly

High-throughput sequencing have substantially changed the way biological research is performed since the early 2000s. These sequencing technologies obtain millions of short fragments (sequences) of DNA from a living organism to generate the organism’s DNA blueprint (genome). Thanks to these new DNA sequencing platforms, we can now investigate human genome diversity between populations, find genomic variants that are likely to cause diseases and even investigate the genomes of even ancient hominids.

Algorithms for de-novo genome assemblies from 3rd generation (long-read) sequencing technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid de-novo genome assembly, the most time consuming tasks need to be accelerated in hardware.

In this project, we design a hardware accelerator of the 1^st part of de-novo assembly workflow, which consist of the following steps:

Run-length encoding (RLE) of reads
Market extraction and marker representation of RLE-encoded reads
MinHashing (calculating hash signatures of RLE-encoded reads in marker representation)
Local sensitivity hashing of read signatures

This is a research project, endeavoring into a very new field of study, which may lead to further research and scientific publications.

What will we do and learn in the project?

Develop a novel algorithm for the acceleration for DNA / genome assembly
Implement the algorithm with a high level programming language
Any ideas you will have along the way…

Requirements

Ability to work independently and endeavor into unchartered territory
Desire to innovate and try new things
No prior VLSI design knowledge is required – this project is the opportunity to learn the subjects.