Log in ....Tribune--Feature article

Log in ....Tribune

Monday, July 30, 2001		Lead Article

SEARCH SMART

Everything is available on the Net, so goes the popular belief. But if you have to find anything, you have to use a "search engine." These "engines" have names like Google, Alta Vista, Excite, Lycos, and all of them use specifically designed software to cull answers to your queries from the Net maze. Kuljit Bains takes a look at the working of search engines and explains how to use them optimally.

HEY, go to the Internet! That’s what everyone tells you when you want information on any common or abstract subject that might come up in your daily activity, or even to settle an argument.

You’ve been told that whether you want to know what’s taken the shine off your dog’s coat or what are the latest medical possibilities available to keep your cholesterol low to keep your heart going, or even who’s the "top-seeded" actress in Bollywood; itall there.

The trouble is just where is that information in this blundering monolith of a Web! It is supposed to have anywhere over a billion pages of data available to the general public, and more is being added to it each day, what with all the universities and dot.coms (even if most of them have gone bust) working overtime.

If we did not have the right tools to dig through this pile of information, it would be of no use. Fortunately, there are tools for that and many of us use them. They are called search engines and Web directories. They essentially help you find the Web site on which the information you want may most likely be.

While beginners on computers may not be using them at all, a lot of the old hands also use them only perfunctorily and do not realise even half of the potential that a good search engine can deliver.

To make the best of the money you have invested on the PC and that you spend on the Internet connection, make an effort to learn using search engines efficiently and save time. As search engines are nothing but some software working on machines placed "out there," you have to realise that they do not have brains to understand what you are looking for. You have to understand how search engines "think" and then present your requirement according to their limited, though magnificent, capabilities. You have to search smart.

HOW DO THEY WORK

A search engine in essence is a database of all the information that is there on the Web and kept it in a certain order so that it can readily be located and accessed when demanded by a common user through a simple interface available on a Web site.

Database: All the data of the Internet in one place? Yes! That’s almost what it is, or at least that’s what every search engine attempts to have. (Google, one of the most popular search engines, is said to have over 1 billion pages in its database.) This Herculean task is performed by software robots called spiders that are sent out to crawl the Web.

These spiders start visiting sites by getting links from server lists (DNS entries), and lists of the most popular or best sites. They then follow the links on these pages to find more links to add to the database. While some databases want the spider to send back only the title and URL (address) of each page it visits, or just some HTML tags, nowadays most want them to send back the entire text of each page along with the information on where it was found.

One more source of getting data is the submissions to the search engines by the authors of new Web sites.

The information once acquired has to be then stored on the servers of the company providing search engines in such an ordered way that it is useful. This data is indexed in such a way that the user may be able to know what bit of data was found where. All words found on various pages are given a weight according to where on the page was the word found, i.e., page title, heading, sub-head, etc, and also how many times. Using this statistics and other algorithms, search engines try and establish the context and relevance of each word in the database.

To search the database, engines have basically two ways, keyword search and concept search.

Keyword search is the most commonly employed method. Search engines spot and index words that they consider significant. Words mentioned towards the top of a document and those that are repeated in a particular document are taken as more important. While most search engines index each word, others index only part of the document, such as the title, headings, subheadings, hyperlinks to other sites, or the first few lines of text. Some of the search engines may discriminate between upper and lower case while others don’t.

One problem with keyword searching is that distinguishing between two different meaning words spelled the same way may not be possible. A search on ‘tree’ may also lead to ‘family trees’ instead of just the horticultural variety.

Concept-based search systems try to understand what you mean. A concept-based search looks for the subject you're exploring, even if the keyword you give does not match the words on the document precisely. This system usually examines words in relation to other words found nearby. It calculates the frequency with which certain words appear. When several words or phrases that are tagged to signal a particular concept appear close to each other, the search engine concludes that the document is "about" a certain subject.

For example, the word joint, when used in the medical context, would more likely be accompanied by such words as bone, fracture, or arthritis. If it appears in a document with others words like pipes, taps, or houses, the search engine gives results on the subject of plumbing. This system is, however, far from perfect. The results are good only when you enter a lot of relevant words. Thus, most search engines use the keyword method.

Ranking: Notwithstanding the narrowing down of relevant results by the logic search engines use, the results that they deliver for a particular query may still be so large in number that you would not be much better off than before the search. To fix this problem, search engines use some more logic to present the results in a sequence such that, according to them, the most relevant result is at the top followed by other results with diminishing relevance.

Note the expression "according to them." The results are what the engine believes you wanted. It may not be correct. While you may be talking of heart as in romance, the results may be on heart as in heart attacks! This is where your smartness comes in. You have to understand the way search engines’ logic works.

Various parameters are used to guess the relevance. Most search engines believe that if the term you searched for appears more frequently on a particular document, then that document is more relevant and will show it the near the top of the search result list.

The position of the keyword in a particular document is also used to determine relevance. If the keyword appears early in the document, or in the headers, the relevance is considered more. So hits may be ranked according to how many times the keyword appears in the indices of the document and in which fields they appear (i.e., in headers, titles or text).

Yet another parameter used is the number of links from other pages on the Web to that particular document. It is taken that if other people consider it important, you may, too.

Search engines may use some or all of these ways to assess the relevance of a particular page to your query. There are certain search engines that allow you to even assign relevance weights to your query terms before conducting a search. Although this requires practice, it allows you to have a say in what results you get.

Meta-tags: These are unseen words put in the code of a Web page by its author to specify to the search engines which concepts the page should be indexed under. This can be helpful particularly when certain words are likely to have more than one meaning. However, certain Web authors, in order to get more hits, put too many irrelevant words also in the meta-tags. This undermines the system. As a result, the latest trend is for the Web-crawling spiders to ignore meta-tags, especially if there are too many of them.

HOW TO SEARCH

Now that we know how search engines basically work, we should use this knowledge to think the way search engines think so as to reduce the communication gap between man and machine.

First, decide on the kind of information you want. Whether it is general information you want about a broad subject or a very specific bit.

If it general information that you want your best strategy would be to go to a Web directory like Yahoo. Yes, Yahoo is not really a search engine, though it can nowadays use the Google search if you want. It is a kind catalogue somewhat like the library card systems.

Data is filed under subject heads to which there are links, followed by further links to sub-topics and so on. For example, there may be a series of topics and sub-topics as health>cardiology>treatment>blockage.

While most search engine providing sites are now also giving Web directories, still these directories have limited benefit. You may be lead up a lot of irrelevant links. However, some Web directories have specialised search utilities like "people search" or "e-mail search," which may be very useful.

The best bet for specific information is search engines. Using them needs some basic knowledge of the tools they provide to make smart searches as well as some practice initially.

You learn a lot by making a search and studying the results you get. The more you do, the better you become.

While most search engines first give you the option of a quick or basic search first on their home page, they all offer "advanced search" options also. If your purpose is not served in the quick search, it is best to go to the advanced option.

Here you will find a lot of terms used and search options asked. The basic aim is to narrow down the results delivered to exactly what you want. The search engine tries to understand your exact requirement through these. To help it do that you will have to know some of the basic search concepts and terms given below yourself. While most of these options are there in search engines, not all are offered by all.

Boolean search refers to how multiple terms that you may enter are combined in a search:

AND requires that both terms be found while OR lets either term be found. NOT means any record containing the second term will be excluded. ( ) means the Boolean operators can be nested using parentheses.

+ is equivalent to AND; the + should be placed directly in front of the search term.

- is equivalent to NOT and means to exclude the term; the - should also be placed directly in front of the search term

Operators can be entered in the case shown by the example.

Examples: (mystery and (writer or author)) not novel or you may say + mystery - novel writer author

Default operation is what happens when multiple terms are entered for a search using no Boolean operators, + or - symbols, phrase marks, or other special features.

Example: If you enter family tree, different search engines would respond differently. It could be processed as family AND tree, family OR tree or "two terms" as exact phrase.

Proximity searching refers to the ability to specify how close within a document found terms should be to each other. The most commonly used proximity search option is a "phrase search" that requires terms to be in the exact order specified. The default standard for identifying phrases is to use double quotes (" ") to surround the phrase.

Example: "To be or not to be" This expression would not be found in any other way as all the words used in are very common and not taken note of by search engines.

Other proximity operators can specify how close two terms should be to each other. Some also specify the order of the search terms. Each search engine defines them differently and uses different operator names such as NEAR, ADJ, W, or AFTER.

Truncation is a search technique that refers to the ability to search for a part of a word. A symbol such as the asterisk (*) is used to represent the rest of the term. End truncation is where several letters at the beginning of a word are given but the end can vary. In internal truncastion, * can represent characters within a word.

Stemming related to truncation is to find grammatical variations of a word such as its plurals, singular forms, tenses, etc.

End truncation examples: photo* finds photograph, photography, photographer

Internal truncation examples: wha*ver may find whatever, whatsoever

Stemming: bake may find baked, baking, baker

Case sensitive: Most search engines are not case sensitive about the keyword(s) you may enter and treat upper case, lower case, and mixed case all the same. However, certain search engines have the capability to match exact case. Entering a search term in lower case will usually find all cases. In a case-sensitive search engine, entering any upper case letter in a search term will invoke the exact case match.

Fields searching allows you to designate where a specific search term should appear, i.e., instead of searching for words anywhere on a Web page, you specify parts of a document like the title, the URL, an image tag, or a hypertext link on a Web page. The fields are usually given in a drop-down menu to choose from.

Limits allow you to put restrictions on the search. Commonly available limits are the date limit and the language limit.

The latter would restrict the search results to only those Web pages identified as being in the specified language. (Though no Indian language is offered by the popular search engines, apart from English, that is.)

Stop words are the frequently occurring words like the, a, is, of, be, 1, html, or com that are not searched by most engines.

Sorting is the manner in which the results of a search are delivered. Most search engines sort by "relevance" determined by relevance ranking algorithms as described earlier. Other options are to arrange the results by date, alphabetically by title, or by URL or host name.

Using these options to narrow down your search can strike out a lot of irrelevant links that you may have otherwise followed only to be disappointed. You may use any one of them at one time or in combination. But for this you have to be clear about what you want in the results.

QUICK TIPS

Do a bit of refining of the keywords you choose yourself before entering them. If you are looking for fertilisers for roses, don’t say flowers or roses, or fertilisers, say roses AND fertilisers. You may also exclude some links by saying roses AND fertilisers NOT fragrance NOT smell.

To use exclusion more effectively, you should visualise what all kinds of irrelevant results you are likely to get for a particular query. Use the required terms to exclude those.

For site-specific information on how Boolean operators work, it is best to read the help files given on all search engines.

As far as possible, use exact phrases, specified within double quotes (" ").

The results that are displayed by search engines are accompanied by short descriptions of the page. Pay attention to them. Every search engine has a different way of arriving at the descriptions. Some pick up the first few lines on the page, while others show the meta-tags specified by the author of the page, while still others show the instance of the keyword in bold. Study the way each search engine displays its results so that you know what to make of the results. A search engine is in the end nothing but yet another machine with software and data— get down to its level, and you’ll communicate.

As a geek would say, garbage in garbage out, put in a smart search query, and you’ll get an intelligent response, a simple (silly) query . . . well.

SEARCH ENGINES

www.altavista.com

www.dogpile.com

www.excite.com

www.alltheweb.com

www.go.com

www.google.com

www.hotbot.com

www.profusion.com

www.looksmart.com

www.lycos.com

http://magellan.excite.com

www.mamma.com

www.metacrawler.com

www.northernlight.com

www.search.com

www.snap.com

www.webcrawler.com

www.yahoo.com

www.about.com

www.askjeeves.com