Department of Computer Science and Engineering Chair Martín Farach-Colton helps solve a longstanding conundrum
This month Quanta Magazine published a look at the hash table, a method that allows for storing and accessing large datasets conveniently and efficiently. Developed some seven decades ago by an IBM engineer, hash tables are now ubiquitous in modern computing systems.
The data entered into a hash table is connected to a key that identifies it for retrieval when needed. “Consider the Oxford English Dictionary, which has definitions for more than 600,000 words. If a digital edition relies on a hash table, you can simply use a given word as a key and proceed straight to the definition,” the Quanta piece explains. “Without a hash table, the dictionary would likely rely on a much slower search mechanism, using a process of elimination to eventually converge on the requested definition.”
Despite the widespread adoption of hash tables, computer scientists had long struggled to find a balance: everyone wants to retrieve needed data as quickly as possible, yet doing so requires an enormous amount of memory. It seemed impossible to maximize speed while limiting memory.
In a recent journal article, “On the Optimal Time/Space Tradeoff for HashTables,” Martín Farach-Colton, the chair of NYU Tandon’s Department of Computer Science and Engineering, and his co-authors detailed what it would take to create a hash table with the best possible balance between time (speed) and space (memory).
Ironically, Farach-Colton and his fellow researchers had been seeking to confirm that the optimal balance had already been identified in a 2003 paper: instead, after building a hash table with two data structures — a primary one in which items are stored according to a ranked system (a preferred number-one location, a next-best number-two location, and so forth) and a secondary structure storing that number and responding to queries accordingly — they showed that it was indeed possible to do better.
The key was that if a piece of data was in the 100th-best spot in the primary structure, its tag was represented by the binary code 1100100 — which obviously takes more memory to store than a simple 1. But by shifting the items in the primary structure to more preferable locations — much as a librarian might shift books to spots that are more logical or easier to reach – the second structure could answer queries quickly, without proportionately greater consumption of memory.
The paper represented a new standard for hash table efficiency — a feat that was confirmed when a different research team attempted to improve upon their results and proved, theoretically, that it was impossible, meaning that Farach-Colton’s findings could not be improved upon by any possible data structure — even into the future.