High-throughput sequencing have substantially changed the way biological research is performed since the early 2000s. These sequencing technologies obtain millions of short fragments (sequences) of DNA from a living organism to generate the organism’s DNA blueprint (genome). Thanks to these new DNA sequencing platforms, we can now investigate human genome diversity between populations, find genomic variants that are likely to cause diseases and even investigate the genomes of even ancient hominids.
Algorithms for de-novo genome assemblies from 3rd generation (long-read) sequencing technologies have cost and production time bottlenecks that prohibit efficient scaling to large cohorts. De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid de-novo genome assembly, the most time consuming tasks need to be accelerated in hardware.
In this project, we design a hardware accelerator of the 1st part of de-novo assembly workflow, which consist of the following steps:
- Run-length encoding (RLE) of reads
- Market extraction and marker representation of RLE-encoded reads
- MinHashing (calculating hash signatures of RLE-encoded reads in marker representation)
- Local sensitivity hashing of read signatures
This is a research project, endeavoring into a very new field of study, which may lead to further research and scientific publications.
What will we do and learn in the project?
- Develop a novel algorithm for the acceleration for DNA / genome assembly
- Implement the algorithm with a high level programming language
- Any ideas you will have along the way…
- Ability to work independently and endeavor into unchartered territory
- Desire to innovate and try new things
- No prior VLSI design knowledge is required – this project is the opportunity to learn the subjects.