Iterative semantic information extraction from unstructured text sources

Slavko Žitnik (2014) Iterative semantic information extraction from unstructured text sources. PhD thesis.

Preview

Abstract

Nowadays we generate an enormous amount of data and most of it is unstructured. The users of Internet post more than 200,000 text documents and together write more than 200 million e-mails online every single minute. We would like to access this data in a structured form and that is why we throughout this dissertation deal with information extraction from text sources. Information extraction is a type of information retrieval, where the main tasks are named entity recognition, relationship extraction, and coreference resolution. The dissertation consists of the four main chapters, where each of them represents a separate information extraction task and the last chapter which introduces a combination all of the three tasks into an iterative method within an end-to-end information extraction system. First we introduce the task of coreference resolution with its goal of merging all of the mentions that refer to a specific entity. We propose SkipCor system that casts the task into a sequence tagging problem for which first order probabilistic models can be used. To enable the detection of distant coreferent mentions we propose an innovative transformation into skip-mention sequences and achieve comparable or better results than other known approaches. We also use a similar transformation for relationship extraction. There we use different tags and rules that enable the extraction of hierarchical relationships. The proposed solution achieves the best result at the relationship extraction challenge between genes that form a gene regulations network. Lastly we present the oldest and most thoroughly researched task of named entity recognition. The task deals with a tagging of one or more words that represent a specific entity type - for example, persons. In the dissertation we adapt the use of standard procedures for the sequence tagging tasks and achieve the seventh rank at the chemical compound and drug name recognition challenge. We successfully manage to solve all of the three problems using linear-chain conditional random fields models. We combine the tasks in an iterative method that accepts an unstructured text as input and returns extracted entities along with relationships between them. The output is represented according to a system ontology which provides better data interoperability. The information extraction field for the Slovene language is not yet well researched which is why we also include a list of translations of the selected terms from English to Slovene.

Item Type:

Thesis (PhD thesis)

Keywords:

information extraction, coreference resolution, relationship extraction, named entity recognition

Number of Pages:

134

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
prof. dr. Marko Bajec	245	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1536169155 )

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

2889

Date Deposited:

11 Dec 2014 11:47

Last Modified:

23 Jan 2015 09:17

URI:

http://eprints.fri.uni-lj.si/id/eprint/2889

Actions (login required)

View Item