Articles   Members Online:
-Article/Tip Search
-News Group Search over 21 Million news group articles.
-Delphi/Pascal
-CBuilder/C++
-C#Builder/C#
-JBuilder/Java
-Kylix
Member Area
-Home
-Account Center
-Top 10 NEW!!
-Submit Article/Tip
-Forums Upgraded!!
-My Articles
-Edit Information
-Login/Logout
-Become a Member
-Why sign up!
-Newsletter
-Chat Online!
-Indexes NEW!!
Employment
-Build your resume
-Find a job
-Post a job
-Resume Search
Contacts
-Contacts
-Feedbacks
-Link to us
-Privacy/Disclaimer
Embarcadero
Visit Embarcadero
Embarcadero Community
JEDI
Links
Introductory Principles of Indexed Searching Turn on/off line numbers in source code. Switch to Orginial background IDE or DSP color Comment or reply to this aritlce/tip for discussion. Bookmark this article to my favorite article(s). Print this article
23-Jun-03
Category
DB-General
Language
Delphi 2.x
Views
107
User Rating
No Votes
# Votes
0
Replies
0
Publisher:
DSP, Administrator
Reference URL:
DKB
			Author: Jim McKeeth

Describe the principles of an Indexed search engine (like google). How this works 
when a lot of text information is compressed into an index - so that a search can 
be done within very short time.

Answer:

Introduction 

There are really two main ways to search a large collection of text documents.  The 
simplest method would be to load each document and scan through it for the search 
terms, this would be referred to as a full text scan.  The second, much faster 
method is to create an index and then search the index.  An index is a list of 
terms found in a document or set of documents.  Each word only appears once per 
document so it is much shorter then the original document.   

Creating an index 

Finding the words 
In order to create an index you must first parse the document.  Parsing, is the 
process of picking out the individual tokens (terms or words) in a piece of text.  
A parser is a type of state machine.  There are many existing parsing routines 
available.  "The Tomes of Delphi Algorithms and Data Structures" by Julian Bucknall 
contains many very good parsers.  An example. 

A simple parser would scan through a string of text, starting at the beginning, 
looking at each character.  If it is a letter or number then it is part of a word, 
if it is white space or punctuation then it is a separator.  Each word is added to 
a list (i.e. TStringList) in the order it is found in the document.  Typically each 
word is converted to the same case (upper or lower).   

It is really important to consider what you are indexing and how your index will be 
used when creating your index and parsing.  For example if you are parsing HTML 
then you want to exclude most tags (with the obvious exception of META tags, which 
are handled specially).  Other times you might only want to index summery 
information about each document.   

Indexing the Words 
Now that we have a parsed token list we need to index it.  The simplest index is 
just a list of each word found in a document, and a reference to the document.  
This reference may be a URL, a document name or any other unique identifier (a GUID 
or a foreign key to another table describing the document).  A more complex index 
may include the number of times the word is found in the document or a ranking for 
where it is in the document (in the title, keyword section, first paragraph, 
middle, last, etc.)  This additional information stored with each word is part of 
what differentiates one search engine's performance from another. 

Many times certain words are left out.  These are called stop words.  Stop words 
are common words, words that will not be searched on, or words that will not 
enhance the meaning of a search.  An example of stop words includes "THE, A, AN, 
AND, IF, BUT", words with numbers in them, or anything else you want to filter out. 
 Selecting stop words is another point of separation of performance.   

Some web search engines used to leave out words like "HTML" or "WEB" because they 
were so common while other search engines would include every word.  Other search 
engines start with a dictionary list and only index words found in that dictionary 
list.  This leads to trouble when you are indexing names, technical terms or 
anything else not found in your original dictionary.   

One time I was creating a search engine for a collection newsgroup articles.  I 
discovered that there was UUEncoded (similar to MIME or Base64) binaries in the 
articles.  This resulted in my parser finding words that were hundreds of 
characters long and total gibberish.  I decided to omit any word longer then 50 
characters or shorter then 4.  Making the choices about what to include and what to 
omit is an important decision, and will vary based on what content you are 
indexing. 

So here is an example table structure for your index: 

Table: WordList 

Document: Number (foreign key to Documents table) 
Word : String[20] (if our longest word is 20 characters) 
Count : Number (how many times the word is found) 

The primary key would be a compound of Document and Word since each word is listed 
once per document. 

Table: Documents 

Document : Number(primary key index) 
Title : string (title of document) 
Location : string (URL or full filename and path) 

Optionally you could include the entire document as a blob in this table.  You 
could also have other tables that lists terms (from the meta section of the 
document) or include authors.  Again this design choice depends on the type of 
documents you are indexing and the purpose of your search engine. 

Searching Your Index 

Once all the indexes are stored in a database you need to be able to search the 
index for a document.  A simple SQL statement to search for a document that 
contains a single word could look like this: 

SELECT * 
FROM WordList 
WHERE Word = :v_search_term 
ORDER BY Count DESC 

This returns all documents containing your single search term and they are ordered 
by the number of times the word is found.  If you want to use SQL then to search on 
multiple terms involves an join for each term.  Instead you could retrieve a list 
for each term and then merge them manually.  This is where you would support AND, 
OR or NOT key words.   

If you want to allow phrase searching then you could search for each word in the 
phrase and then search those documents for the phrase.  The same technique could be 
used for the NEAR key word.  There are other more advanced techniques to do this 
that are much quicker, but they are beyond the scope of this document. 

Once the hits are found and ranked then display the title of each document, 
possibly a summary or the context of the hits, and provide a way for your user to 
reach the document. 

Variations 

One thing Google does a little differently is they look at how pages are linked.  
This works really well with the hyper linked nature of the web.  For example if you 
search for Borland most pages that mention Borland link to 
www.borland.comhttp://www.borland.com.  This is assumed to indicate that 
www.borland.comhttp://www.borland.com is a very important site about Borland.  
Google also limits the number of hits you get on each domain. 

Many search engines also rank pages higher if the search term appears in the URL or 
title of the page.  They also look at the description and keywords meta tags for 
ranking.  Some search engines will actually ignore a word if it appears too often 
in a page.  This weeds out sites that try to artificially inflate their rankings.   

Phonetics or Soundex is another technique that can be used.  This could be done 
with an additional table similar to the word table, but instead store the soundex 
value for the words instead of the actual word.   

Conclusion 

Searching a shorter and well organized index is much quicker then searching an 
entire document.  Creating and searching an index takes a lot more planning and 
effort up front, but quickly pays off if the text is searched very often.  
Typically the larger and more complex the index, the more effective the search.  If 
your index gets too large or complex then the search speed will degrade.   

There are off the shelf searching tools available to end users and developers 
alike.  dtSearch ( http://www.dtsearch.comhttp://www.dtsearch.com ) makes a wide 
range of searching tools for both end users and all types of developers.  Tamarack 
Associates' Rubicon ( http://www.tamaracka.comhttp://www.tamaracka.com ) is a 
search component for Delphi that provides a lot of flexibility, especially in 
storage options.  Both are extremely fast and useful, but don't let that stop you 
from designing your own, especially if you have a specialized need. 

See also: 

http://www.dtsearch.com/dtsoftware.html#Art_of_The_Text_Query

			
Vote: How useful do you find this Article/Tip?
Bad Excellent
1 2 3 4 5 6 7 8 9 10

 

Advertisement
Share this page
Advertisement
Download from Google

Copyright © Mendozi Enterprises LLC