Tomaž Kovačič (2012) Evaluating Web Content Extraction Algorithms. EngD thesis.
Abstract
Today's web has proved to be a vast and valuable resource of information. A large portion of written textual content is published in the form of HTML documents on news sites, blogs, forums and other web sites. Automated extraction of such content from an arbitrary HTML structure has proved to be a non-trivial problem, since the underlying technology and the authors themselves who deliver the content to the end user, tend to wrap the main content with boilerplate elements as navigation, header, footer, advertisements, site-maps and other noisy content. In this diploma thesis we focus on the review and evaluation of algorithms capable of extracting main textual content from an arbitrary web page. Throughout this work we refer to this category of algorithms as text extractors (TE). The main goal is to provide an overview across the sparse literature of content extraction and to evaluate the performance of text extraction methods based on golden standard data. First we provide a detailed survey of methods that we categorize into four categories: wrapper based, template detection, web page segmentation and text extraction (TE) approaches to content extraction. For the evaluation part of our task we focus only on the last category of algorithms. We continue by exploring the means for creating an evaluation environment, which consists of metrics and data. We use two datasets, one representing a cross-domain collection of documents and the other containing mostly news type web pages and then evaluate various TE using precision, recall and F1-score. We conclude our work by presenting the final evaluation results of 17 algorithms on two datasets. Our results analysis includes raw numbers, discussion and various visualizations that help us to interpret the results.
Actions (login required)