Towards Knowledge Acquisition from Semi-Structured Content
Received:August 02, 2007  Revised:October 16, 2008  Download PDF
Xi Bai,Jigui Sun,Haiyan Che,Lian Shi. Towards Knowledge Acquisition from Semi-Structured Content. International Journal of Software and Informatics, 2008,2(2):233~248
Hits: 3301
Download times: 2418
Xi Bai  Jigui Sun  Haiyan Che  Lian Shi
Fund:This work is supported by the NNSFC under Grant No.60496321 and European Commission(EASTWEB: Building an Integrated Leading Euro-Asian Higher Education and Research Community in the Field of the Semantic WEB) under Grant No.111084
Abstract:A rich family of generic Information Extraction (IE) techniques have been developed by researchers nowadays. This paper proposes WebKER, a system for automatically extracting knowledge from semi-structured content on Web pages based on wrappers and domain ontologies. Within the extracting process, wrappers are learned through su x arrays.Then domain ontologies automatically align the raw data extracted by wrappers and knowledge are generated by describing the data with Resource Description Framework (RDF)statements. After the merging process, newly generated knowledge are added to the Knowledge Base (KB) nally for users to query regardless of resources' derivation. A prototype of WebKER is implemented. This paper also gives the performance evaluation of this system and the comparison between querying information in the KB and querying information in the traditional database, indicating the superiority of our system. In addition, the evaluation of the outstanding wrapper and the method for merging knowledge are also presented.
keywords:information extraction  knowledge base  domain ontologies  pattern discovery  su x array  knowledge merging
View Full Text  View/Add Comment  Download reader

 

 

more>>  
Visitor:1817115
Top Paper  |  FAQ  |  Guest Editors  |  Email Alert  |  Links  |  Copyright  |  Contact Us

© Copyright by Institute of Software, the Chinese Academy of Sciences
ICP: Jing ICP Bei No.10016592

京公网安备 11040202500065号