Spatial Approximate String Search
2011
text
Many applications are featured with both text and location information, which leads to a novel type of search: spatial approximate string search (Sas). The Sas is gaining attention from the database community only recently. A large variety of problems remain open. Spatial and text data have been independently studied for decades. Spatial data have many unique features that are drastically different from text data. As a result, most of the existing techniques for string processing are either inapplicable or inefficient when adapted to spatial databases. We have investigated four issues in the general area of Sas. They are: (i) Spatial approximate string search in Euclidean space (Esas); (ii) Spatial approximate string search on road networks (Rsas); (iii) Selectivity Estimation for Esas Range Queries; (iv) Multi-Approximate-Keyword Routing query on road networks. For efficiently answering spatial approximate string queries in Euclidean space, we propose a novel index structure, MhR-tree, which is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the subtree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. For the Rsas, we propose a novel method, RsasSol, which is superior to the base line algorithm in practice. The RsasSol is a combination technique of inverted list and reference points. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our methods. We also discuss how to estimate range query selectivity in Euclidean space since the query selectivity estimation is always important for the query optimization. Based on the spatial uniformity principle and the minimum number of neighborhoods principle for strings, we define a partitioning metric to facilitate the construction of buckets. We present a novel adaptive algorithm for finding balanced partitions using both the spatial and string information stored in the R-tree, which works well in practice. Lastly, we go further beyond the basic kNN and range queries for Sas and consider an interesting application of Sas on the road network: find the shortest path with keywords constraints on the road network. We propose both exact and approximate solutions for this issue. For small set of query keywords (m ≤ 6), our exact solutions can quickly find the top-k shortest paths. For large set of query keywords, our approximate solutions can still return the answer on real time with good approximate quality.
Routing, R-tree, Selectivity Estimation
June 24, 2011.
A Dissertation submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy.
Includes bibliographical references.
Feifei Li, Professor Directing Thesis; Washington Mio, University Representative; Ashok Srinivasan, Committee Member; Piyush Kumar, Committee Member.
Florida State University
FSU_migr_etd-0988
This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s). The copyright in theses and dissertations completed at Florida State University is held by the students who author them.