NSF III-1117829: Efficient Query Processing in Large Search Engines

This page describes research conducted at the NYU Tandon School of Engineering (formerly the Polytechnic School of Engineering) by Prof. Torsten Suel and his research group, funded by the National Science Foundation under grant III-1117829, "Efficient Query Processing in Large Search Engines", and by grants from Google. In the following, we first give a high-level description of the problems addressed and the proposed solutions, written for the non-expert. Then we provide some results, project participants, and research publications and manuscripts.

1. Introduction and Motivation
2. Our Research Efforts
2.1. Faster Early Termination for Simple Ranking Functions
2.2. Approximating Complex Ranking Functions
2.3. Index Pruning, Tiering, and Selective Search
2.4. Fast Index Updates
2.5. Positional Indexing and the Role of Positions in Ranking
2.6. Distance Estimation in Large Graphs
3. Project Participants
4. Project Publications and Manuscripts

1. Introduction and Motivation

As the title suggests, this project focuses on new techniques that allow large web search engines, such as those run by Google, Bing, and others, to more efficiently execute queries posted by users. These engines now receive several billion queries per day that have to be processed over databases of hundreds of billions of web pages and other types of documents. To do so, companies are building many large data centers involving tens of thousands of machines each, with large hardware, maintenance, and energy costs. Reducing these costs by even a moderate amount would have great benefits for the economy and environment.

Reducing the cost of query processing in search engines is of course a very active research area, with many projects underway at companies and universities. Large search engines are based on hundreds or thousands of machines, and efficiency improvements are obtained by optimizing latency and throughput in each machine, as wll as through proper index partitioning, query routing, and load balancing across machines. We mostly focus on the first aspect, how to optimize query processing within a single machine or core, which is largely orthogonal to the other one. Our work studies new algorithmic techniques (i.e., smart algorithms and data structures) for this problem. Our approach starts from the following properties of current state-of-the-art search engines:

Our main goal is to develop early termination techniques that work for complex ranking functions and provide significant savings in processing costs at little or no reduction in quality. Many previous techniques are heuristic in nature and may work only for certain classes of ranking functions; others are unpublished and closely guarded by the search engine companies. A larger and more ambitious long-term goal is to develop a general framework for optimizing query processing in Information Retrieval systems with complex and often changing ranking functions, partly inspired by the highly successful body of work on query optimization in the world of relational databases.

2. Our Research Efforts:

Our research group is exploring the query processing problem using several different approaches, as discussed in the following. One approach (Section 2.1) focuses on faster query processing on index structures with block-wise score approximations, resulting in significant speedups over the state of the art for disjunctive top-k queries [1][2]. Another line of work (Section 2.2) looks at techniques for index pruning, index tiering, and selective search [6]. Another approach (Section 2.3) attempts to efficiently approximate complex ranking functions in a cascading model, with recent work [3] showing extremely efficient processing of the first cascade. The fourth major activity (Section 2.4) focuses on efficient index updates in the single-node [8] and distributed [9] case. Finally, other work conducted in this project studies distance estimation problems in large graphs [5] [7], and new positional indexing techniques that could replace inverted lists in some scenarios [4].

2.1. Faster Early Termination for Simple Ranking Functions:

Problem Statement: How to more efficiently execute disjunctive top-k queries under simple ranking functions (such as BM25), in particular by improving a recently proposed approach that stores block-wise upper-bounds for posting scores.

Our first set of activities focuses on more efficient safe early termination for simple ranking functions, as used in the first phase of a cascading ranking model. We are building on the recently proposed Block-Max Wand (BMW) approach [Ding/Suel, SIGIR 2011]. Our results in [1] and [2] give an in-depth experimental comparison of a number of possible approaches, and describe a new algorithm that exploits the SIMD features of modern CPUs. We observe speed-ups by a factor of 2-3 over state-of-the-art approaches, at the expense of some extra main memory required to store additional auxiliary index structures.

2.2. Approximating Complex Ranking Functions:

Problem Statement: Given a complex ranking function and a query distribution, design a cascading query processing approach that is much faster than an exhaustive application of the complex ranker, while achieving essentially the same end-to-end result quality.

This work starts from the multi-phase (or cascading) model of ranking that is widely used in today's search engines: first, a simple ranking function is run on a large part of the index, resulting in a set of, say, a few hundred or thousand candidate results that are then filtered in subsequent phases (cascades) using increasingly more complex (i.e., precise but computationally expensive) ranking functions (see, e.g., [Wang/Lin/Metzler, SIGIR 2011]). The ultimate goal here is to come up with fully automatic ways to build and maintain efficient cascades, and associated index structures, given a document collection and a query trace.

Our efforts focused on improving the efficiency of the first cascade. A first approach (by M. Christoforaki and C. Dimopoulos) attempted to directly learn better term-wise impact scores from a complex ranker; however, experiments did not give any improvements for this approach, most likely due to not having a large enough training trace available (as large search engine query traces are almost impossible to get). A different approach, published in [3], succeeded in getting very significant speed-ups, by building additional specialized index access structures based on modeling the query trace and posting quality. The resulting method can perform the first cascade in less than a ms on the ClueWeb09B collection, compared to tens of ms in other approaches, while achieving almost the same end-to-end quality as an exhaustive approach.

Follow-up work currently underway in the group is looking at how to get even better results by building additional index structures, including phrase index structures.

2.3. Index Pruning, Tiering, and Selective Search:

Problem Statement: Three techniques, (i) Index Pruning, (ii) Index Tiering, and (iii) Selective Search, aim to make search more efficient by, respectively, (i) completely removing postings that are low quality from the index, (ii) relegating such postings to a lower tier that is only accessed for some of the queries, and (iii) topically clustering the index and only accessing the clusters relevant to the current query. Our goal is to improve these methods, and in particular to explore their relationship and possible hybrids.

Static index pruning is the idea of pruning an inverted index structure by removing some of the less useful postings from the index. The goal is to get a smaller index that allows queries to be processed faster and with less memory, without significant losses in result quality. There has been some amount of work on this approach, mostly based on simple heuristics that remove postings with low rank or impact score. We propose a new approach [6] that tries to estimate the usefulness of a posting by using language modeling and machine learning on a small set of training queries. Current results show that this outperforms previous techniques in terms of pruned index size, while achieving the same or better quality.

More recent and on-going follow-up work has looked at more general machine-learning approaches to index pruning that keep or remove postings based on a number of features. Another interesting insight from this work has been that there is a pronounced trade-off between pruned index size and speed, in that methods that minimize size do not minimize speed at all, and vice versa. (Previous work focused primarily on size, with the expectation that smaller size means faster speed.) Moreover, we are comparing techniques for index pruning and index tiering, the first usually posting-oriented (a posting is either kept of discarded) and the second document-oriented (a complete document is assigned to a tier), to understand the relationship between pruning and tiering.

Other follow-up work looks at better selective search by integrating ideas from index tiering. Here, documents are clustered as in selective search, but a broker decides not just which clusters to access but also how "deep" to access each cluster that is selected.

2.4. Fast Index Updates:

Problem Statement: How to efficiently update inverted index structures under insertions and deletions, in two scenarios: (i) on a single node, when most of the index is on hard disk or SSD, and (ii) in a distributed system, in a scenario where documents need to be clustered by similarity for efficiency.

The basic challenge for the first scenario of index updates is that documents usually contain hundreds of different terms, which for disk-based indexes would result in changes in hundreds of different locations on disk. Thus, inverted index updates, if not properly managed, can result in large numbers of random disk accesses.

While there has been a lot of previous work on index updates, there are still some opportunities for significant improvements. One case is the interaction between queries and updates -- basically, the choice of best update mechanism should depend on the query load, and thus it is not a good idea to study updates in isolation. Also, both updates and queries utilize caching and buffering mechanisms that interact. Finally, different terms in the same document should be handled differently for best performance. Work in [8] looks at these questions in detail and shows how to get more efficient index updates on both hard disks and SSD with policies that adapt to the query load and that try to model future updates and queries.

In the distributed case, our goal is to incorporate index updates while also routing similar documents to the same index shard and the same location within the shard, in order to enable recent approaches that assign consecutive docIDs to similar documents to improve query processing speed. In [9], we propose an approach that assigns incoming documents to shards, and then reorganizes the index inside each shard in an online manner using lazy algorithms with good amortized performance, leading to much faster query processing.

2.5. Positional Indexing and the Role of Positions in Ranking:

Problem Statement: The standard approach of indexing term position data using inverted lists typically increases index size by a factor of 4-5 over the non-positional case. Are there alternative approaches, based on wavelet trees or forward indexes, that do better in some scenarios?

State-of-the-art ranking functions make heavy use of information about the locations (positions) of the query terms in the documents. Thus, documents where some of the query terms occur close to, or directly next to, each other, are often ranked higher than documents where the terms are far apart. Thus, positions allow search engines to return better results, and efficient index structures for proximity can help quickly narrow the search to the most promising results. Unfortunately, index structures for positions and phrases are large and expensive to access, and search engine efficiency heavily depends on how and when to utilize position information during query processing.

In an initial piece of work under this project, undertaken with colleagues in Chile, we looked at how to best store and access positions. Our work in [4] showed that a standard inverted index often does not provide the best trade-off between time and space, and that in fact simply storing the parsed documents in highly compressed form is often better than this or an approach based on wavelet trees. The right approach depends on when and how often positions are accessed -- if positions are heavily accessed during the first cascade of ranking, then a standard positional index is still best, but if accesses are delayed to the second cascade (with the potential exception of accesses to more efficient but specialized phrase index structures in the first cascade), then storing the parsed documents in highly compressed form may perform better. (In fact, subsequent work by researchers at Bing confirmed that positional indexes may in fact not be the best choice for fast query processing.)

2.6. Distance Estimation in Large Graphs:

Problem Statement: Given a very large graph, say a web graph or social network, how can we quickly estimate the distance between a pair of nodes, as precisely as possible, and much faster than by running an optimizated shortest path computation between the nodes?

One of the main online trends of the last decade has been the creation of large social networks and other social structures on the web. This creates an opportunity for search engines to improve the relevance of their results by taking this, and in particular the social structure around the user and the candidate results or searched-for entities, into account. However, this also creates a challenge in query processing: while we know a lot about how to search in text data, and about how to process graph data, much less is known about how to deal with the combination of these basic data types, in particular ``text graphs'' where nodes or edges may contain some textual information. For example, a search engine may want to return results that are both relevant in terms of textual similarity and close to the user in terms of social structure.

One important sub-problem arising in this context is how to quickly answer distance queries between arbitrary pairs of nodes in such a graph, a problem that arises in a number of algorithms for search on text graphs, and that thus has been studied by many researchers. We looked at two new approaches for this problem that achieve a better trade-off between the size and speed of the lookup structure and the accuracy of the estimate. One approach [5] that is particularly promising for the types of graphs encountered in social networks uses machine learning in order to basically learn to estimate the distance. If there is only very limited space available, then our experiments show that this achieves much better accuracy than a standard landmark-based approach. A second approach, in [7], proposes to extend the landmark approach to a three-hop cases, achieving moderate improvements on some types of graphs and distance queries (though not on some others).

3. Project Participants

Participants include the PI, Torsten Suel, several PhD students at NYU supported by this and other grants, in particular, PhD students Maria Christoforaki, Costas Dimopoulos, Wei Jiang, Sergey Nepomnyachiy, Jose Rodriguez, and Qi Wang, MS students Yan Krasny, Shi Li, and Abrar Sheikh, and a faculty colleague, Ozgur Ozkan. Our collaborators at other institutions include Diego Arroyuelo, Mauricio Marin, and their team in Santiago de Chile, Alex Markowetz and Pascal Welke at the University of Bonn, and visiting PhD student Naiyong Ao from Nankai University.

4. Project Publications and Manuscripts

Peer-Reviewed Publications:

[1] A Candidate Filtering Mechanism for Fast Top-K Query Processing on Modern CPUs. C. Dimopoulos, S. Nepomnyachiy, and T. Suel. 36th Annual ACM SIGIR Conference, July 2013. PDF
Source codes for this paper here.

[2] Optimizing Top-k Document Retrieval Strategies for Block-Max Indexes. C. Dimopoulos, S. Nepomnyachiy, and T. Suel. 6th ACM Conference on Web Search and Data Mining, February 2013. 36th Annual ACM SIGIR Conference, July 2013. PDF
Source codes for this paper here.

[3] Fast First-Phase Candidate Generation for Cascading Rankers. Q. Wang, C. Dimopoulos, and T. Suel. 39th Annual ACM SIGIR Conference, July 2016. PDF
Source codes for this paper here.

[4] To Index or not to Index: Time-Space Trade-Offs in Search Engines with Positional Ranking Functions. D. Arroyuelo, S. Gonzalez, M. Marin, M. Oyarzun, and T. Suel. 35th Annual ACM SIGIR Conference, July 2012. PDF

[5] Estimating Pairwise Distances in Large Graphs. M. Christoforaki and T. Suel. IEEE International Conference on Big Data, October 2014. PDF

[6] Improved Methods for Static Index Pruning. W. Jiang, J. Rodriguez, and T. Suel. IEEE International Conference on Big Data, December 2016. PDF

[7] Three-Hop Distance Estimation in Social Graphs. P. Welke, A. Markowetz, T. Suel and M. Christoforaki. IEEE International Conference on Big Data, December 2016 (short paper). PDF

[8] Efficient Index Updates for Mixed Update and Query Loads. S. Nepomnyachiy and T. Suel. IEEE International Conference on Big Data, December 2016 (short paper). PDF

Under Review:

[9] Document Routing and Index Reorganization Strategies in Distributed Search Engines. C. Dimopoulos, O. Ozgur, A. Sheikh, and T. Suel. Submitted for publication, July 2016.