Slavko Žitnik (2014) Iterative semantic information extraction from unstructured text sources. PhD thesis.
Abstract
Nowadays we generate an enormous amount of data and most of it is unstructured. The users of Internet post more than 200,000 text documents and together write more than 200 million e-mails online every single minute. We would like to access this data in a structured form and that is why we throughout this dissertation deal with information extraction from text sources. Information extraction is a type of information retrieval, where the main tasks are named entity recognition, relationship extraction, and coreference resolution. The dissertation consists of the four main chapters, where each of them represents a separate information extraction task and the last chapter which introduces a combination all of the three tasks into an iterative method within an end-to-end information extraction system. First we introduce the task of coreference resolution with its goal of merging all of the mentions that refer to a specific entity. We propose SkipCor system that casts the task into a sequence tagging problem for which first order probabilistic models can be used. To enable the detection of distant coreferent mentions we propose an innovative transformation into skip-mention sequences and achieve comparable or better results than other known approaches. We also use a similar transformation for relationship extraction. There we use different tags and rules that enable the extraction of hierarchical relationships. The proposed solution achieves the best result at the relationship extraction challenge between genes that form a gene regulations network. Lastly we present the oldest and most thoroughly researched task of named entity recognition. The task deals with a tagging of one or more words that represent a specific entity type - for example, persons. In the dissertation we adapt the use of standard procedures for the sequence tagging tasks and achieve the seventh rank at the chemical compound and drug name recognition challenge. We successfully manage to solve all of the three problems using linear-chain conditional random fields models. We combine the tasks in an iterative method that accepts an unstructured text as input and returns extracted entities along with relationships between them. The output is represented according to a system ontology which provides better data interoperability. The information extraction field for the Slovene language is not yet well researched which is why we also include a list of translations of the selected terms from English to Slovene.
Actions (login required)