NSF III-1718680: Index Sharding and Query Routing in Distributed Search Engines

This page describes research conducted at the NYU Tandon School of Engineering by Prof. Torsten Suel and his research group, funded by the National Science Foundation under grant III-1718680: Index Sharding and Query Routing in Distributed Search Engines, and by a grant from Amazon. We first give a high-level description of the problems addressed and the studied approaches, written for the non-expert. Then we describe our work in more detail, and list project participants, research publications and manuscripts, and available software and data sets.

1. Introduction and Motivation
2. Our Research Efforts
2.1. Faster Single-Node Query Processing
2.2. Distributed Search Methods
2.3. Load Balancing and Query Routing in Distributed Search Engines
2.4. Software Infrastructure and Reproducibility
2.5. Other Topics
3. Project Participants
4. Project Publications and Manuscripts
5. Software and Datasets

1. Introduction and Motivation

This project focuses on the efficiency of large distributed search engines. By this we mean search systems, such as Google Search, Bing and others, where the document collection is partitioned into subsets called shards that are assigned and replicated across a large number of machines in a data center. Queries in such systems are processed by (1) routing the query to machines holding the needed shard replicas, (2) processing the query locally in the machines, and (3) aggregating the results supplied by the different machines into a Search Engine Result Page (SERP) returned to the user.

When optimizing the efficiency of a search architecture, we need to balance two objectives, latency and throughput, while of course returning high quality results. This is often modeled via a Service Level Agreement (SLA) that may, e.g., specify that 90% of all queries much be answered in 100ms, and 99% in 250ms. (More complex SLAs may also specify degradations in search quality that users are willing to accept.) Thus, the goal is to maximize the number of queries per second (the throughput) that can be processed with existing hardware and energy resources, while satisfying the SLA.

Overall, the query processing efficiency of a search engine mainly depends on the following three aspects:

Our research focuses on all three aspects, and on the interplay between them. For single-node query processing, we focus on index reordering methods, faster top-k query processing, and the role of threshold estimation techniques. For distributed search, we study index sharding methods and selective search and index tiering approaches. For load balancing and query routing, we look at the performance of existing and new replica assignment and load balancing methods under different workloads, including selective search workloads, and at policies for reissuing queries and tied requests.

In addition to publishing our research in academic venues, we also aim to create software prototypes that are useful to other researchers, and to foster reproducibility of research results. In particular, this project supports the development of a toolkit for research on indexing and query processing called PISA (Performant Indexes and Search for Academia) [1] in which we implement many of our proposed techniques. More details are provided below.

2. Our Research Efforts:

We now provide more details on our research efforts, organized into the three different aspects discussed above. Our research involved four PhD students at NYU Tandon, Antonio Mallia, Juan Rodriguez, Michal Siedlaczek, and Qi Wang, a number of MS and undergraduate researchers, and outside collaborators at several institutions.

2.1. Faster Single-Node Query Processing:

This research thrust builds on previous research by the PI's research group on optimizing the efficiency of query processing on a single node and shard. In particular, we focus on the candidate generation problem, where the goal is to quickly find a few hundred to a few thousand documents that can then be reranked using more complex often machine-learned ranking functions. One part of our work looks at the role of document and index reorderings on query efficiency. In particular, work by PhD student Qi Wang in [2] proposed the first formal model for the efficiency of conjunctive queries on reordered indexes, and described a heuristic algorithm for reordering that speeds up conjunctive queries by about 10-20% over the best previously known ordering. In [3] we performed a replication study of the current state-of-the-art reordering method, the BP algorithm of Dhulipala et al. in KDD 2016. This study confirmed the results in the original work proposing the BP algorithm, and also resulted in a complete and highly optimized implementation of the BP algorithm in the PISA toolkit. (The original paper did not make any implementation available.)

Another focus are algorithms for safe disjunctive top-k search, where the goal is to retrieve the k documents with the highest score (according to a fairly imple ranking function) that contain at least one of the query terms. This problem is particularly relevant for the case of large k, in the hundreds or thousands, where it is commonly used to identify candidates for complex reranking. Our work in [4], lead by PhD students Antonio Mallia and Michal Siedlaczek, performed an extensive experimental comparison of existing algorithms under different compression mathods. It was published as a reproducibility paper at ECIR, with optimized implementations of these methods integrated into PISA. In work currently underway, Antonio and Michal are looking at improved top-k algorithms that use conjunctions for faster disjunctive queries [5], and at improved live-block based methods [6] that follow the approach of Dimopoulos et al. in SIGIR 2013.

Several of the above methods require, or benefit from, the use of threshold estimation techniques that estimate the score of the k-th highest-scoring result for an incoming query. In [7] we performed an experimental comparison of the four main approaches that have been proposed, and the two most promising have been integrated into PISA. In other work, PhD student Mallia proposed improvements to the VBMW algorithm for top-k query processing [8], while PhD student Juan Rodriguez explored the potential to trade off space versus speed in index pruning methods [9]. Faster algorithms for index decompression on Graphical Processing Units (GPUs) are described in [10].

2.2. Distributed Search Methods:

The second research focus attempts to improve methods for distributed search, with emphasis on selective search and on index tiering methods.

Selective search is an approach described by Kulkarni and Callan in TOIS 2015 that assigns documents to shards based on topical similarity. Each incoming query is then routed only to those shards deemed most relevant to the query. Selective search can lead to significant reductions in query costs, and also allows trading off quality and efficiency during peak times and under suitable SLAs. Work in [11] by PhD students Michal Siedlaczek and Juan Rodriguez showed how to improve the efficiency of selective search by exploiting global document orderings in addition to topical sharding. This allows for exhaustive processing of the most relevant shards, while doing only a limited access into marginally relevant shards. Other work by MS student Saitong Zhao that is underway looks at new methods for creating good topical shards.

Index tiering is a widely used optimization in distributed search engines that divides documents by overall quality, into tiers of ``high quality'' and ``low quality'' documents. All queries are evaluated on the first tier of shards containing high-quality documents, but only some queries are evaluated on the second tier. Most work on index tiering assigns complete documents to tiers. Current work by Juan Rodriguez is looking at the potential of ``ragged tiering'' where partial documents can be assigned to the first tier.

Another approach currently being investigated is based on predictive indexing, first proposed by Goel at al. in NIPS 2008. Predictive indexing clusters the space of possible queries, and then builds optimized index structures for each cluster. It was originally proposed and evaluated on advertising data and for high-dimensional nearest neighbor search. This work, lead by undergraduate researcher Kejian Shi, has two goals: to evaluate the potential of predictive indexing for general web search scenarios, and to study distributed implementations of predictive search that integrate it into a selective search architecture -- basically by using the document clustering to induce a clustering on the query space that is then used to build the predictive index.

2.3. Load Balancing and Query Routing in Distributed Engines:

The third research focus is on load balancing and query routing mechanisms in distributed search, where the goal is to maximize throughput while satisfying a given SLA. Our work thus far has looked at the following three issues. First, work by undergraduate researcher Mingyang Wang and PhD student Michal Siedlaczek has implemented a simulation of various load balancing policies in the DESMO-J discrete event simulator. Experiments also consider different ways of assigning shard replicas to machines, since a good assignment is important for load balance and robustness to failures.

Second, undergraduate researcher Sofiia Barysheva has studied diffusive methods for load balancing in search engines, inspired in particular by the approach for multi-commodity flow of Awerbuch and Leighton in FOCS 1993. Under such an approach, requests are routed to shard replicas with a simple local control algorithm, and under certain conditions the system will automatically converge to a new routing policy in the presence of machine failures. The challenges here are to prove convergence, and to characterize the situations when a new fully balanced policy is possible.

The third currently underway work focuses on designing good reissue policies for interactive services, including search. The idea underlying reissue policies is that user experience can be adversely impacted by a few requests, say to a particular service or shard replica, that take much longer than average. This problem can be addressed by reissuing such requests to a different replica of the service or shard, as proposed by Dean and Barroso in CACM 2013 (using the term ``hedged request''). However, reissuing requests too early or frequently can increase overall system load, while waiting too long is not that useful, and thus a good reissue policy has to balance these concerns.

Our work builds on previous work by Kaler et al. in SPAA 2017, which claimed that a simple class of policies that only reissues requests at a single point in time is at least as powerful as a larger class that has multiple reissue points. In work with PhD student Michal Siedlaczek and undergraduate researcher Jingxian Xu, we have shown that this is in fact not true in general, and that policies with multiple reissue points can significantly outperform the simple case in certain scenarios. Also, while finding an optimal policy is NP Complete, there are simple algorithms that can approximate the optimal policy reasonably well. A manuscript is currently in preparation.

Efforts in the remainder of the project will also focus on the following additional questions: (1) How is load-balancing impacted by the types of more skewed workloads generated by selective search and index reordering approaches? (2) How can selective search be applied in conjunction with SLAs that constrain both latencies and quality degradations, thus opening up additional optimizations in load balancing and reissue policies? (3) Can we design good policies for tied requests (where a request is issued to two replicas, but cancelled on one shard after processing it on the other shard), under a suitable model that takes resulting overheads into account?

2.4. Software Infrastructure and Reproducibility:

One important goal in this project is to contribute software prototypes that can be used by other groups to replicate and extend our work. During the project, we have created and contributed to several open-source projects, in particular PISA, FXT, and CIFF.

PISA (Performant Indexing and Search for Academia) [1] is a C++-based toolkit focusing on indexing and query processing that was initiated by PhD student Antonio Mallia, with contributions from PhD student Michal Siedlaczek and Joel Mackenzie of RMIT University. PISA provides a simple platform for experimenting with new ideas in indexing, index compression, document reordering, and query processing, with focus on search-engine efficiency research. It provides implementations of many state-of-the-art index compression, index reordering, and query processing algorithms, and many of our codes for this project have been added into PISA. PISA has already been used by several other research groups. A set of Python wrappers called PyPISA is under development.

FXT [12] is a toolkit for feature extraction from large document collections, with focus on learning-to-rank applications, that is spearheaded by Shane Culpepper's group at RMIT University, with contributions from PhD students Antonio Mallia and Michal Siedlaczek. CIFF (Common Index File Format) [13] is a definition of a common format for inverted indexes that allows different search engine prototypes to exchange index data. This facilitates reproducibility as it alleviates the impact of different parsers and parsing settings on the resulting index. CIFF is a collaboration between six institutions, with Antonio and Michal contributing from NYU.

As discussed, many of our codes are available via the PISA toolkit. However, some other codes do not really fit into PISA or have not been integrated, and are distributed via separate repositories, mostly on github.

In addition, we have published several reproducibility studies as part of this project, in particular a comparison of many state-of-the-art index compression and top-k query processing algorithms [4], a reimplementation and evaluation of the BP index reordering algorithm [3], and a comparison of threshold estimation algorithms [7]. PISA was also submitted as an entry to the 2019 Open-Source IR Replicability Challenge (OSSIRC 2019).

2.5 Other Topics:

In addition, there have been other research activities under this project that do not clearly fit into the three major themes. Internships of PhD students Qi Wang and Michal Siedlaczek at Blippar Inc., a company working on augmented reality applications, resulted in a research collaboration on index compression and query processing for instance retrieval systems (systems for recognizing objects in images), published in [14][15]. Work started by REU researcher Michael Chen, and continued by MS student Yuanhao Han, proposed a new method for incremental positional index updates, while work by REU researcher Rohan Kandi explored language models for search queries. MS Thesis research by Lina Qiu proposed extensions to the open-source Luwak subscription search engine [16]. Research by undergraduate student Zishi Deng, continued after his May 2019 graduation, proposed and evaluated methods for sharding social network graphs [17].

3. Project Participants and Collaborations

Participants at NYU include the PI, Torsten Suel, four PhD students, Antonio Mallia, Juan Rodriguez, Michal Siedlaczek, and Qi Wang, MS students Yuanhao Han, Lina Qiu, Mengyang Sun, and Saitong Zhao, and undergraduate researchers Sofiia Barysheva, Michael Chen, Zishi Deng, Rohan Kandi, Kejian Shi, Jingxian Xu, Mingyang Wang, Yuhong Zhang, and Victor Zheng.

The project has so far produced one PhD graduate, Qi Wang, and two MS Thesis graduates, Lina Qiu and Mengyang Sun. Lina and Mengyang are now PhD students at BU and Tsinghua, while undergraduate researcher Mingyang Wang is entering the PhD program at Duke. Two undergraduate researchers, Michael Chen and Rohan Kandi, received REU support through this project.

Various parts of the project also include outside collaborators from RMIT University and U. of Melbourne (Australia), Blippar Inc., U. of Waterloo (Canada), U. of Otago (New Zealand), and CNR (Italy). Finally, additional funds for parts of the project were provided by a grant from Amazon.

4. Project Publications and Manuscripts

[1] PISA: Performant Indexes and Search for Academia. Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel. Proceedings of the Open-Source IR Replicability Challenge (OSIRRC 2019), July 2019. PDF

[2] Document Reordering for Faster Intersection. Q. Wang and T. Suel. 45th International Conference on Very Large Data Bases, August 2019. PDF

[3] Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study. J. MacKenzie, A. Mallia, M. Petri, S. Culpepper, and T. Suel. European Conference on Information Retrieval, April 2019. PDF

[4] An Experimental Study of Index Compression and DAAT Query Processing Methods. A. Mallia, S. Siedlaczek, and T. Suel. European Conference on Information Retrieval, April 2019. PDF

[5] Title omitted for double-blind review. S. Siedlaczek, A. Mallia, and T. Suel. Unpublished Manuscript, January 2020.

[6] Title omitted for double-blind review. A. Mallia, S. Siedlaczek, and T. Suel. Unpublished Manuscript, May 2020.

[7] A Comparison of Top-k Threshold Estimation Techniques for Disjunctive Query Processing. S. Siedlaczek, A. Mallia, T. Suel, and M. Sun. 29th International Conference on Information and Knowledge Engineering, November 2020, accepted for publication.

[8] Faster BlockMax WAND with Longer Skipping. A. Mallia and E. Porciani. European Conference on Information Retrieval, April 2019. PDF

[9] Exploring Size-Speed Trade-Offs in Static Index Pruning. J. Rodriguez and T. Suel. IEEE International Conference on Big Data, December 2018. PDF

[10] GPU-Accelerated Decoding of Integer Lists. Antonio Mallia, Michał Siedlaczek, Torsten Suel, and Mohamed Zahran. 28th International Conference on Information and Knowledge Engineering, November 2019. PDF

[11] Exploiting Global Impact Ordering for Higher Throughput in Selective Search. S. Siedlaczek, J. Rodriguez, and T. Suel. European Conference on Information Retrieval, April 2019. PDF

[12] Feature Extraction for Large-Scale Text Collections. L. Gallagher, A. Mallia, S. Culpepper, T. Suel, and B. Cambazoglu. 29th International Conference on Information and Knowledge Engineering, Resource Track, November 2020 (to appear).

[13] Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, A. Mallia, M. Siedlaczek, A. Trotman, and A. de Vries. CoRR abs/2003.08276, March 2020. PDF

[14] Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval. M. Siedlaczek, Q. Wang, Y. Chen, and T. Suel. IEEE International Conference on Big Data, December 2018. PDF

[15] Forward Index Compression for Instance Retrieval in an Augmented Reality Application. Qi Wang, Michał Siedlaczek, Yen-Yu Chen, Michael Gormish, and Torsten Suel. IEEE International Conference on Big Data, December 2019. PDF

[16] Optimizing Subscription Processing in Luwak. L. Qiu. MS Thesis, NYU Tandon School of Engineering, May 2019. PDF

[17] Optimizing Iterative Algorithms for Graph Sharding. Z. Deng and T. Suel. Unpublished Manuscript, May 2020.

[18] Top-K Threshold Estimation in Information Retrieval. M. Sun. MS Thesis, NYU Tandon School of Engineering, May 2019. PDF

5. Software and Datasets

We strive to make codes and data sets developed under this project available for easy access by other groups. Many of our algorithms are implemented as part of the PISA toolkit at https://github.com/pisa-engine, while some other codes are outside of PISA in other github repositories. Following are links to the major code bases that are currently available.