First-Order Logic Rule Induction for Information Extraction in Web Resources

José Ignacio Fernández-Villamor, Carlos A. Iglesias & Mercedes Garijo. (2012). First-Order Logic Rule Induction for Information Extraction in Web Resources. International Journal of Artificial Intelligence Tools, 21 (6), 1250032-1-,1250032-2.

Abstract:
Information extraction on web pages, commonly known as screen scraping, is usually performed through wrapper induction, a technique that is based on the internal structure of HTML documents. As such, the main limitation of these kinds of techniques is that a generated wrapper is only useful for the web page it was designed for. To overcome this, we have designed a system that generates ?rst-order logic rules that can be used to extract data from web pages. These rules are based on visual features such as font size, elements positioning or types of contents. Thus, they do not depend on a document structure, and can be applied on di erent sites. The system has been evaluated on a set of web pages, which has served to identify several design patterns used across the Web.
JCR 2012 0.25 Q4