Mikkel Thorup: Fast and Powerful Hashing using Tabulation + Deterministic Edge Connectivity in Near-Linear Time

Time: Mon 2015-10-12 12.00

Location: Room 4523, Lindstedtsvägen 5, KTH CSC

Participating: Mikkel Thorup, University of Copenhagen

Practicalities

Lunch is served at 12:00 noon (register at this doodle at the latest the Saturday before, but preferably earlier). The presentation starts at 12:10 pm and ends at 1 pm. Those of us who wish reconvene after a short break for ca two hours of more technical discussions.

ABSTRACT

The first part of the presentation, titled "Fast and Powerful Hashing using Tabulation," is intended for a general audience.

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees.

Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of c characters and we have precomputed character tables h_1,...,h_c mapping characters to random hash values. A key \(x=(x_1,...,x_c)\) is hashed to \(h_1[x_1] ⊕ h_2[x_2].....⊕ h_c[x_c]\). This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing.

Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This is also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures.

Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, with high probability, for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space.

While these tabulation schemes are all easy to implement and use, their analysis is not.

After the break, there will be a more technical presentation titled "Deterministic Edge Connectivity in Near-Linear Time," based on joint work with Ken-ichi Kawarabayashi.

We present a deterministic algorithm that computes the edge-connectivity of a graph in near-linear time. This is for a simple undirected unweighted graph G with n vertices and m edges. This is the first o(mn) time deterministic algorithm for the problem. Our algorithm is easily extended to find a concrete minimum edge-cut. In fact, we can construct the classic cactus representation of all minimum cuts in near-linear time.

The previous fastest deterministic algorithm by Gabow from STOC'91 took ~ \(~O(m+k^2 n)\), where k is the edge connectivity, but k could be Omega(n).

At STOC'96 Karger presented a randomized near linear time Monte Carlo algorithm for the minimum cut problem. As he points out, there is no better way of certifying the minimality of the returned cut than to use Gabow's slower deterministic algorithm and compare sizes.

Our main technical contribution is a near-linear time algorithm that contract vertex sets of a simple input graph G with minimum degree d, producing a multigraph with ~\(O(m/d)\) edges which preserves all minimum cuts of G with at least 2 vertices on each side.

In our deterministic near-linear time algorithm, we will decompose the problem via low-conductance cuts found using PageRank à la Brin and Page (1998), as analyzed by Andersson, Chung, and Lang at FOCS'06. Normally such algorithms for low-conductance cuts are randomized Monte Carlo algorithms, because they rely on guessing a good start vertex. However, in our case, we have so much structure that no guessing is needed.

To the calendar