Skip to main content
UOIT Home
Please use this identifier to cite or link to this item: http://hdl.handle.net/10155/295

Issue Date: 1-Dec-2012
Title: Using Rabin-Karp fingerprints and LevelDB for faster searches
Authors: Deighton, Richard A.
Publisher : UOIT
Degree : Master of Science (MSc)
Department : Computer Science
Supervisor : Pu, Ken
Keywords: Rabin-Karp fingerprints
Text search
LevelDB
Abstract: This thesis represents the results of a study into using fingerprints generated according to the Rabin-Karp Algorithm, and a database LevelDB to achieve Text Search times below GREP, which is a standard command-line UNIX text search tool. Text Search is a set of algorithms that find a string of characters called a Search Pattern in a much larger string of characters in a document we call a text file. The Rabin-Karp Algorithm iterates through a text file converting character strings into fingerprints at each location. A fingerprint numerically represents a window length string of characters to the left of its location. The algorithm compares the calculated fingerprint to the Search Pattern’s fingerprint. When fingerprints are not equal, we can guarantee the corresponding strings will not match. Whereas when fingerprints are, the strings probably match. A verification process confirms matches by checking respective characters. Our application emerges after making the following major changes to the Rabin-Karp Algorithm. First, we employ a two-step technique rather than one. During step 1, the preprocessing step, we calculate and store fingerprints in a LevelDB database called an Index Database. This is our first major change unique to us. Step 2, the matching step, is our second unique change. We use the Index Database to look-up the Search Pattern’s fingerprint and gather its set of locations. Finally, we allow the pattern to be any length relative to the window length. We even created an equation to check if the difference in length is too long for the fingerprint’s number system base. We facilitated our performance experiments by first building our application and testing it against GREP for a wide range of different parameters. Our conclusions and recommendations determine that although we currently only outperform GREP in about half the cases, we identify some promising opportunities to modify some parts of our application so that we can outperform GREP in all instances.
Appears in Collections:Electronic Theses and Dissertations (Public)
Faculty of Science - Master Theses

Files in This Item:

File Description SizeFormat
Deighton_Richard.pdf2.36 MBAdobe PDFView/Open

Items in e-scholar@UOIT are protected by copyright, with all rights reserved, unless otherwise indicated.