Kristoffer Sahlin: Large-scale text indexing and search with MinHash and the Jaccard index

Tid: On 2026-03-11 kl 13.00 - 14.00

Plats: Albano, Lärosal 3, floor 2, house 1

Medverkande: Kristoffer Sahlin

Abstract: Modern biology produces large amounts of sequence data, especially DNA sequences that can be viewed as long strings over a four-letter alphabet. Comparing such sequences is essential for tasks like identifying related organisms, detecting mutations, or organizing large genomic databases. However, directly comparing long strings character by character becomes computationally expensive when datasets contain millions or billions of sequences.

This lecture introduces scalable techniques for large-scale text indexing and similarity search based on the Jaccard index and MinHash. Instead of comparing sequences position by position, we treat them as collections of shorter substrings (k-mers). Similarity can then be measured as the overlap between these substring sets using the Jaccard index. MinHash provides a compact probabilistic “fingerprint” of each sequence that allows us to estimate similarity quickly, without storing or comparing entire sequences.

We will explain the key ideas behind these methods, why they work, and what trade-offs they involve in terms of accuracy, speed, and memory usage. Although the examples focus on DNA sequences, the same techniques apply to general text documents, web data, and other large-scale string collections.

By the end of the lecture, participants will understand how probabilistic sketching techniques enable efficient similarity search in massive datasets, and how core concepts from algorithms and data structures can be applied to modern biological data analysis.

Till kalendern