Information Extraction From Web Pages Based On Pattern Discovery

Information extraction from Web documents is a critical issue for Software agents on the Internet. Extracting data from the Web requires a wrapper that “wraps" around a Web site and extracts data from Web pages. Wrapper induction is one of such techniques that learn extraction rules by generalizing from training examples. In this project, we utilize a data structure called a PAT tree in which repeated patterns in a given input string can be efficiently identified. A PAT tree is an efficient data structure successfully used in the area of information retrieval for indexing a continuous data stream using this data structure to index an input string, all possible character strings, including their frequency counts and their positions in the original input string can be easily retrieved.

We consider the Web as the largest “knowledge base” ever developed and made available to the public. The problem of information extraction is to transform the contents of input documents into structured data, and the problem of information extraction from a Web page is to apply information extraction to Web pages. Usually Wrapper induction approach is used for the rapid generation of extractors for Web pages.  A training example “teaches” a wrapper induction program how to extract data from a given Web page by providing examples of the following information:  (a) how to partition a Web page (along string of HTML tags and text) into meaningful substrings, and (b) how to group these substrings into data records. Wrapper induction programs usually uses the procedure of clicking and highlighting on a particular web page referred to as “labeling”. Since generating a correct extractor may require many training examples, labeling can be difficult and labor intensive. As a result, wrapper induction only solves the problem partially. We need an approach that minimizes the need of labeling. Our approach eliminates the need of user-labeled training examples. The idea is based on the fact that data on a semi-structured Web page is presented in some particular format with a regular pattern. By discovering such pattern in target Web pages, an extractor can be generated. We design a pattern discovery algorithm that can be applied to any semi-structured web page without training examples. This greatly reduces the cost of extractor construction.

In this project, we present a novel technique that extracts information blocks without training examples using a data structure called a PAT tree. PAT trees allow the system to efficiently recognize repeated patterns in a semi-structured Web page. From these repeated patterns, information blocks can be easily located based on pattern length and repeated counts. Afterwards, those blocks through string alignment algorithm will get ordered result data.

2. PAT Tree Structure : A PAT tree is a Patricia tree (Practical Algorithm to Retrieve Information Coded in Alphanumeric) constructed over all the possible semi-infinite strings [2]. A Patricia tree is a particular implementation of a binary digital tree such that the abstract data type semi-infinite string is represented as a suffix string that ends with a special character not occurring anywhere in the input string. Like a suffix tree [5], the Patricia tree stores all its data at the external nodes and keeps one integer, the bit-index, in each internal nodes as an indication of which bit of a query is to be used for branching. This avoids empty sub trees and guarantees that every internal node will have non-null descendants. For a set of n semi-infinite strings to be indexed, there will be n external nodes in the PAT tree and n-1 internal nodes. This makes the tree O(n) in size. 

 

Figure 1 shows an example of a PAT tree over a sequence of bits. External nodes are indicated by squares that contain a reference to a semi-infinite string, and internal nodes are indicated by a circle and contain a displacement (or index) to the root. More specifically, the Patricia tree for a string S of size n uses an index at each internal node to indicate the bit used for that node’s branching. A “zero” bit will cause a branch to the left subtree, and a “one” bit will cause a branch to the right sub-tree . Conceptually, each edge between two nodes is labeled with a zero and the right edges begin with one. Hence, the concatenation of the edge-labels on the path from the root to a leaf node with reference i exactly spell out the sistring i of S. For example, string 3 is the concatenation of edge-lables,”0”,”110” and “1100…” between nodes 1,2,5, and leaf 3.

 

3. Related Work : On the World Wide Web a large numbers of information is available in the form of semi-structured format. Knowledge discovery in semi-structured document has been recognized as promising task. Since semi- structured document is typically hidden within HTML formatting intended for human viewing the details of which vary widely from site to site and frequent changes made to their formatting so we can’t construct a global schema. Most of the existing system uses hand-coded wrappers to extract information, which is time consuming. An intelligent and automated method is needed for their processing. Thus the task of extraction rule generation can be solved by pattern discovery without user-labeled training examples that are required for previous work in wrapper induction.

 

4. Conclusion and Future work : With the growing amount of online information, the availability of robust, flexible information extraction systems have become a stringent necessity. In this paper, we presented an unsupervised pattern discovery approach to semi- structured information extraction. This approach generates extraction rules for given Web pages. In this project, we present an approach to semi-structured information extraction based on the recognition of repeated patterns in the translated HTML input [1]. The underlying technique is the string manipulation based on the data structure called the PAT tree. The PAT tree has been applied in the field of information retrieval for indexing for a long time [4]. It has also been used in Chinese keyword extraction [6]. The essence of a PAT tree is a binary suffix tree, which has also been applied in bioinformatics for finding repeated substring in genomes [3] etc. In the application of information extraction, we are not only interested in repeats but also repeat that represent interesting information blocks.  The future work will follow two directions. The first one concerns the implementation of text level encoding to extract more delicate information. The second is to improve the filtering process of redundant patterns. Finally, extending our work to help creating Semantic web will be an interesting topic.

References

1. Chia-Hui Chang and Chun-Nan Hsu. “Automatic Extraction of Information Blocks Using PAT Trees,” In Proceedings 1999 National Computer Symposium (NCS1999), Tamkang University, Tamsui, Taiwan.

2. G.H. Gonnet, R.A. Baeza-yates, and T. Snider, “New Indices for Text: Pat Trees and Pat Arrays," Information Retrieval: Data Structures and Algorithms, Prentice Hall, Chapter 5, pp.66-82. 1992.

3. S.Kurtz, C. Schleiermacher, “REPuter: fast computation of maximal repeats in complete genomes,” Bioinformatics, Vol. 15, No.5,pp.426-427.1999.

4. G.H. Gonnet, R.A. Baeza-yates, and T. Snider, “New Indices for Text: Pat Trees and Pat Arrays," Information Retrieval: Data Structures and Algorithms, Prentice Hall, Chapter 5, pp.66-82. 1992.

5. D.Gusfield “Algorithms on strings, trees, and sequences, ” Cambridge. 1997.

6. L.F. Chien, “PAT-tree-based keyword extraction for Chinese information retrieval," In Proceedings of the 20th annual international ACMSIGIR conference on Research and development in information retrieval. Pp.50-58. 1997.