Grooming The Qualities of Search Engine

Modern day web search engines produce the same information for a query to every user of it. The two main problems with search engine are firstly more time to generate the results and secondly producing irrelevant results. They do not take into account the user’s interest. So a system and method for using a user profile to order placed content in search results is needed [1]. The user profile is based on predefined categories. But to employ this technique we need a large categorized database. Manually categorization is obviously not a cost effective method so we need some automatic categorization techniques. In this paper author proposes a hybrid approach for automatic document categorization to improve search engine performance. The paper also focuses at Psearch domain.

1.    Introduction: Search engines to some extent are helping the end user to refine the information needed in quick time. Modern web search engines produce the same information for a query to every user of it. They are not considering user’s interest while providing the information for a given query. For example, two users with the same query get the same set of results even though they might expect contrasting results. If the current web search engines continue to produce the same results for every user providing that the data continuously adds up in large amount, they would not satisfy each and every one of them without meeting their needs of interest. This is where personalization has got its importance. With the need of taking user interest into consideration for providing search results to every user. Search has been developed. search (Personalized Search) is a domain specific search engine aiming at providing very relevant search results to user by taking user interest into consideration. To employ this method we need categorized data that is Text categorization. Text categorization (also known as text classification or topic spotting) is the task of automatically sorting a set of documents into categories from a predefined set.  This task has several applications including automated indexing of scientific articles according to predefined thesauri of technical terms, filing patents into patent directories, selective dissemination of information to information consumers, automated population of hierarchical catalogues of Web resources, spam filtering, identification of document genre, authorship attribution, survey coding, and even
automated easy grading Besides improving Psearch, Automated text classification is attractive because it frees organization from the need of manually organizing document bases, which can be too expensive, or simply not feasible given the time constraints of the application or the number of documents involved.

2.    Framework: The two main problems with search engine are: 
1. More to time to generate the results                      
2. Producing irrelevant results.
To solve both the problems we can divide our large database into small pieces in such a manner that the query which is fired by user is not need to be searched in whole database rather then it will be looked in only some pieces of the database. As the throughput directly depends upon size of database.The performance of query using standard database, while changing the data size is shown in

the graph[12]. 
By this our both problems can be solved. As the searching space is getting reduced resulting fast generation of result, secondly since the query is now to be looked in only some portion of database so the probability of irrelevant results will get reduced to a greater extent. For this we are using a combination of two approaches.
2.1    Rule based approach: Building Knowledge Base: The rule base is specified by means of a series of selected keywords and phrases which are entered in by a user. The phrases are further selected in conjunction with a general Information display enabling the user to select various phrases concerning a series of newspaper articles or media articles which are placed in the knowledge base of the system. By selecting topics such as article type, individual age level, focus of the article and topics of the article a user can then develop a rule base to enable the inference engine implemented by a computer to search the knowledge base and select those articles which are associated with the particular phrases which are indicated and selected by the user. Each phrase or keyword detected will cause tag words to be provided to further provide categories for the processed article.
           

 Although this approach is too much time taking process to attain a satisfactory standard of knowledge base, but this offers one of the finest approach for text categorization. These knowledge-based categorization methods are more powerful and accurate than statistical techniques. However, the phrasal pre-processing and pattern matching methods that seem to work for categorization have the disadvantage of requiring a fair amount of knowledge-encoding by human beings.[3] So whenever some input will be given to this module it will check the words of input text to our knowledge base and will categorize the input text to that category to which more number of words are matching.  In this technique we are doing classification by developing a single pooled dictionary of words for a sample set of documents, and then generating a decision tree model, based on the pooled dictionary, for classifying new documents. So basically we can say we are using our knowledge database. By this technique we will get a “α” measure which defines how much words area matching out of total number of extracted root words.

2.2  Statistical Methods: Statistical methods for categorization require little or no human customization [3]. But they don't offer any of the benefits of natural language processing, such as the ability to identify relationships and enforce linguistic constraints. This technique also relies on corpus effectiveness, as by large and rich corpus we can derive more ideal probability. In this method when any input is given to system, various statistics is calculated on it. A info matrix is created with two columns mentioning which word is appearing how many times in text. So by this we can decide the high probable words which are the deciding factor for categorization. By this technique we will get a “β” measure which defines how much high probable words are found out of total number of extracted root words. By the combination of this parameters we will define a Threshold, which decides that the document’s relevant category.

2.3 Machine Learning: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of pre classified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains [4].
So whenever any document is classified then we will apply above two methods on it, to make our system more efficient with time.
3.    Algorithm
Input the sample text to be categorized.
i.. System will firstly perform rule based approach and check the word of sample text to predefined rules.
2. System will now perform statistical methods for sample text and will find out the relatively higher degree of reoccurring words and then will match with high probable words of each category.
3. Calculating threshold to refine documents.
4. At step three Machine learning techniques will be applied.

4.     Conclusion and Future Work: In this paper we had tried to come up with a hybrid approach for text categorization. Text categorization is treated as preliminary requirement for improving the search engine performance. Distributing databases is one of the techniques for improvising data retrieval process. We would like to work on other facts based on user behaviour and searching history to improve the efficiency of search engines.
5.    References:
1.    Zamir, Oren Eli,Korn, Jeffrey L. Fikes, Andrew B.Lawrence, Stephen R “Personalization of placed content ordering in search results”.
2.    Fabrizio Sebastiani “Text Categorization” in third international conference on Artificial intelligence.
3.     Yang Y. and Pedersen J.P. (1997) Feature selection in statistical learning of text categorization. In The Fourteenth International Conference on Machine Learning, pages 412-420.
4.     Fabrizio Sebastiani “Machine learning in automated text categorization” Mock K.J., (1996) Hybrid hill-climbing and knowledge-based techniques for intelligent news filtering. In Proceedings of the National Conference on Artificial Intelligence (AAAI’96).
5.    Murata M., Ma Q., Uchimoto K., Ozaku H., Isahara H., and Utiyama M. (2000) information retrieval using location and category information. Journal of the Association for Natural Language Processing, 7(2).
6.    Radev D.R., Jing H., and Stys-Budzikowska M. (2000) Summarization of multiple documents: clustering, sentence extraction, and evaluation. In Proceedings ofANLP-NAACL Workshop on Automatic Summarization.
7.    Salton G., Yang C., and Wang A. (1975) A vector space model for automatic indexing. Communications of the ACM, Vol. 18, No. 11, pp. 613-620.
8.    Salton G., Fox E.A., and Wu H. (1983) Extended Boolean information retrieval. Communications of the ACM 26 (12), pp. 1022-1036.
9.    Salton G. and Buckley C. (1988) Term weighting approaches in automatic text     retrieval. Information Processing and Management, 24:513-523.
10.    Yang Y. and Pedersen J.P. (1997) Feature selection in statistical learning of text categorization. In The Fourteenth International Conference on Machine Learning, pages 412-420.
11.    Yang Y., Slattery S., and Ghani R. (2002) A study of approaches to hypertext categorization, Journal of Intelligent Information Systems
12.    web resource: http://ossipedia.ipa.go.jp/en/capacity/EV0612250281EN/index.php