Till innehåll på sidan

Ivan Tolstoganov: Multi-context seeds enable fast and high-accuracy read mapping

Tid: On 2024-11-27 kl 13.00 - 14.00

Plats: Room Cramer

Medverkande: Ivan Tolstoganov (SU)

Exportera till kalender

A key step in sequence similarity search is to identify seeds that are found in both the query and the reference sequence. A seed is a shorter substring (e.g., a k-mer) or pattern (e.g., a spaced k-mer) constructed from the sequences. A well-known trade-off in applications such as read mapping is that longer seeds offer fast searches through fewer spurious matches but lower sensitivity in variable regions as longer seeds are more likely to harbor mutations. Some recent developments on seed constructs have considered approximate (or fuzzy) seeds such as k-min-mers, strobemers, BLEND, SubSeqHash, TensorSketch, and more, that can match over smaller mutations and, thus, suffer less from sensitivity issues in variable regions. Nevertheless, the sensitivity-to-speed trade-off still exists for such constructs. In other applications, such as genome assembly, using multiple sizes of k-mers is effective. While this can be achieved in read mapping through, e.g., MEM construction from an FM-index, such seed constructs are typically much slower than hash-based constructs.

To this end, we introduce multi-context seeds (MCS). In brief, MCS are strobemers where the hashes of individual strobes are partitioned in the hash value representing the seed. Such partitioning enables a cache-friendly approach to search for both full and partial matches of a subset of strobes. For example, both the full strobemer and the first strobe (a k-mer) can be queried. We demonstrate that MCS improves sequence matching statistics over standard strobemers and k-mers without compromising seed uniqueness. We demonstrate the practical applicability of MCS by implementing them in short-read aligner strobealign. Strobealign with MCS comes at no cost in memory and only little cost in runtime while offering increased mapping accuracy over default strobealign using simulated Illumina reads across genomes of various complexity. We also show that strobealign with MCS outperforms minimap2 in short-read mapping and is comparable to BWA-MEM in accuracy in high-variability sequences. MCS provides a fast seed alternative that addresses the trade-offs between seed length and alignment accuracy.