Thursday, March 25, 2010

Search Techniques – Part 1: Overview


Since many days I am thinking about different kinds of search requirements and suitability of current available tools. Since it’s a very big topic, I am thinking to look at one search aspect at a time. In this section (part), I would be going through the overview of the search techniques and it would be followed by different search tools available today and their applicability in next sections (parts). There are two kinds of data/information retrievals. In the first kind, the user is aware of the full context of the data i.e. steps to locate the data. For example retrieving a file from a particular folder in your folder system falls into this category. In the second kind, user has some or full idea about the data he is looking for but he is not aware of its context. For example, retrieving a file from a forgotten/unknown folder of your file system falls in this kind of retrieval. By the term “search”, we often mean second kind of data/information retrieval.

There are different ways to categorize the search. We can categorize the searches based on the algorithms e.g. linear, binary etc used in the search process. Also we can categorize the search based on the search object e.g. data or content etc. Another classification could be based on the context of the search e.g. local/specified repository search or virtual space search. Let’s go over these different kinds of classifications first before moving to the different search tools available and their applicability.

Search Categories based on algorithm: There are huge number of search algorithms available today. Some of them are listed below:

1.      Linear Search: This is basic algorithm of search when we compare each source element one by one against the search element. This is simplest and well performing algorithm if the source elements are of limited number e.g. searching a particular card from a set of playing cards. Since there are 52 cards only, my best case scenario could be that I find the desired card as first card and the worst case scenario could be that the desired card is the last card in the stack. But we are not going to have more than 52 comparisons in any case.

2.      Binary: As it’s clear from the linear search section that if the size of source elements is high, performing a linear search could be very expansive. Binary search is an attempt to address the limitation of linear search and is applicable to only elements which could be sorted in some order e.g. numeric data or string based data. Binary search is a two step process. As a first step, we sort the data in a particular order. And then as a second step, we start comparing the search element against the middle element in the source list. If the middle element is bigger/smaller as per the sorting criteria, we limit the search is one half and ignore the second half of the list. This way, at each comparison, we reduce the size of the source by half. This is well performing search algorithm but not the best and also it’s applicable to only sortable elements.


3.      Hash Table Search: This is most interesting search and widely used search technique. This has proved very efficient in cases where the source elements consist of multiple fragments e.g. String, Text Files, and Web Page etc. Hash table is a structure of creating an identifier of the element and linking the element with the identifier. The identifiers are commonly referred as indexes. Under hash table search, we perform the search on identifiers and retrieve the source element of the matching identifier. I will go through it in details when I talk about various search tools.

4.      Tree Search: Tree search is an extended version of binary search. Similar to binary search, it’s a two step process. As a first steps, we arrange the source elements in form of tree (node graph) and as second step, we perform the search by traversing the tree elements and comparing with the search element. It can be further classified based on the tree nature and our traverse mechanism as below:


a.       Linear: As similar to linear search mentioned earlier, linear tree search traversal is a mechanism of traversing every node of the tree one by one starting from the root. Linear traversal can be performed in 4 ways.

                                                               i.    Depth First: This is a traversal mechanism when we visit the child nodes (vertical) before sibling (horizontal) nodes.
                                                             ii.   Breadth First: This is a traversal mechanism when we visit the sibling nodes (horizontal) before sibling nodes (vertical).
                                                            iii.   Left First: This is a traversal mechanism when we visit the left node before right nodes.
                                                           iv.     Right First: This is a traversal mechanism when we visit the right node before left nodes.

b.      Binary: Again as similar to binary search, this is applicable to the elements which can be sorted in some order. In applicable scenario, we make a tree with restriction of every intermediate node having exact two child nodes arranged in the sorted order. This arrangement is called binary tree. To perform a search, we compare the search element with root element. If not matching, we proceed to node which falls under the sorted order side and ignore the other node and hence other part of the tree completely e.g. if the tree is sorted in ascending order (left node has smaller value than right node), if searching element is smaller than intermediate node, we proceed in the left side and ignore the right side completely.

c.       Heap: Similar to binary search, heap search is also applicable to the elements which are sortable. Under this mechanism, we arrange the elements in a tree format with a condition that intermediate node is always bigger/smaller than all it descendent nodes. This way the root node is always biggest/smallest element in the tree. This property is called heap and search process using heap tree is called heap search. There are many variations of the heap search and binary search could be considered as one of its variations.


5.      Heuristic Search: This is most interesting search technique and the search technique involving artificial intelligence. Under this technique, we apply some artificial intelligence in the search process and search result is not required to be completely search criteria satisfying result. We apply extra logic during the searches and also retrieve elements partially satisfying the search criteria with mostly matching result in the top. To get a clear understanding of this process, imagine any online search, product store search. The search result may not exactly satisfy the search term but most applicable result would be on the top. There are logics involved in deriving the priority of matching results. We will go through it in details while going through different search tools. There are multiple variations of heuristic searches. Some of the variations are as follows.

a.      Pattern Matching: Under this technique, the base of heuristic search is the pattern of the searching element and search result is result of all matching and similar patterns. For example, if we are searching for term “good people”, few variations of the search term could be, full both words together, both words but not together, first full word only, and second full word only. All the results matching both words together would appear in top followed by other results. Any online search is typical example of this.

b.      Behavioral Search: Under this technique, the base of heuristic search is the behavioral aspect of the search term. For example, if we are searching form term “good people”, the behavioral aspect of it is taken as basis of the search. Some of the considered behavioral aspects could be good people, good society, good locality, polite people, educated people etc. and search result could be search result of all these behavioral variations.

c.       Contextual Search: Under this mechanism, basis of heuristic search is the context of the search term/element. Other example could be a product search on a product website. Most of the good product website such as Amazon.com will perform contextual search of the desired product and return exact and all other products of the context. E.g. if we are searching for Shaving Blade, the search result might return all other products used in saving. This can be further classified in various categories; some of them are as following:

                                                               i.      Flat Search: Under this mechanism, we device the algorithm of prioritization and all search results are presented in the flat format. Some of the less popular product web sites perform this kind of search against the desired product. User is fully responsible to either refine the search term or traverse through all result elements to further narrow down.
                                                             ii.      Guided Search: Under this mechanism, the most appropriate result is presented along with a navigation/guide to narrow down the specific search object. Considering the above example, we might be presented with different Shaving blades along with some navigational categorization such as type, brand, pricing rage etc. Following the navigational categorization, the user is directed to the specific product.

Search Categories based on search object: There are various ways to classify the searches based on the search object. The most common category is as following.

1.      Data (Numeric): As it’s clear from its name, data search is a category when we perform our search logic over a certain set of data (numbers). For example, if we are maintaining details of an organization’s employees in a tabular format (database) then process of locating details of a particular employee using his employee number (numeric) falls into this category. Performing this kind of search is comparatively simpler than performing content search.

2.      String: This is little complex as compared to the numeric search. Sometimes it’s referred as String Matching Search. There are several algorithms available such as Native String Search, Rabin-Karp Search, Knuth-Morris-Pratt & Boyer-Moore; to perform string search. These all algorithms differ in a sense of comparing the search string against the string source repository elements. The complexity comes in picture because strings are not stored as a single value as compared to numeric values e.g. if I consider a numeric value e.g. “1134”, though it consists 4 digits but it’s stored as a single value while if I consider a string value e.g. “HAPPY”, its stored as 5 different characters. Therefore to perform a string search, we need to compare each character of the search string against the source string. This becomes very complex when we are performing a string search other than exact match.


3.      Content: This is highest complexity search category. As we have a got a sense from string search section, when data is stored in pieces (e.g. characters of string) and search is non-exact match search, it becomes more complex to perform the search. Content can be imagined as bigger form of string and content search is always of non-exact match search. Search a text file using some fragments of contained search falls into this category. Most of the online searches also fall under this category.

Search Categories based on search context: Again there are various ways to categorize the search based on the search context. The most common categorize are as below:

1.      Local: When all the search elements are locally available, we can categorize the search as local search. For example, searching files in files system of your computer, or search any records in your database falls under this category.

2.      Remote Space: When the search elements are remotely available but are definite and properly identified, we can classify the search as remote space search. For example, searching a flight details on service provider website falls under this category.

3.      Virtual Space: When our search domain is wide and unknown, we classify the search as virtual space search. Any online search such as Google search, Yahoo search falls under this category. This most challenging and most wide search category now days. Again we will go in details in next sections (parts).

I would like to stop here as it is pretty wide topic and I would like to start thinking about various available  search tools and their applicability. If interested in search tools, please follow the other blogs with respective details.

No comments:

Post a Comment