Maja Ramšak (2011) Automatic metadata generation for learning objects. MSc thesis.
One of the results of modern era is a massive production and usage of manifold electronic resources. Number of digital collections, digital libraries and repositories who offer these resources to users, usually by search mechanisms, are increasing. This is especially evident in scientific research and education area. Above mentioned services for managing electronic resources use metadata and metadata records, respectively. Many authors present metadata as data about data or information about information, although better definition exists for some time that metadata represent structured information that describe, explain, locate or on any other way provide easier retrieval, usage and managing of information sources. Despite the large awareness of metadata importance their usage doesn’t achieve the potential they offer. Many resources have metadata with bad or low quality or even don’t have them. Namely, the exceptional growth of numbers of electronic resources, changes in hardware and software and services their management and creation become complicated. Creation of metadata or their generation respectively, can be in general tackled by one of the following approaches: handmade generation, automated generation, combination of handmade and automated, and conversion from existing metadata. Automated generation can be further divided into two fields: metadata extraction and metadata harvesting. Accordingly, several tools for that exist. Especially interesting are tools for keywords extraction, as a subset of metadata elements that represent comprehensive description of electronic resource contents. In literature, the efficiency of these tools is measured with metrics from information retrieval: precision, recall and f-measure. In Master’s Thesis metadata as a whole are considered, most attention is devoted to tools for their automated generation. Latter tools for keywords extraction usually use combinations of the following approaches and techniques: stemmation, using phrase boundaries, stop words and stop phrases, evolution algorithms, machine learning, and natural language processing. Common section of stemmation and evaluation of extraction efficiency are matching criteria. In the first experiment of Thesis the efficiency of different keywords extraction tools (Kea, Yahoo! Term Extractor, SAmgI, and TextRank) is considered by using two real sets of resources (educational resources and conference contributions) in Slovene language, different matching criteria (exact matching, n-cut, soundex, metaphone, and similar text). Different conversions (Apache Tika, pdftotext, copy \& paste, and manual conversion) that prepare original file to form that is acceptable for these tools were used. We have shown that conversions influence on extraction, but not always to improve results. Matching criteria were comparable, and significant better was tool Kea. We also observed that extraction from educational resources was worse than extraction from conference contributions, and that extraction from Slovene texts is worse than extraction from English texts. In the second experiment multi-language searching of educational resources is treated. Three machine natural language translators (Google translate, Microsoft Bing, and Amebis Presis) were used on existing authors' keywords and resource contents that were then used as input for above mentioned keywords extraction tools. Additionally the opposite approach where machine translators were used after keywords extraction was introduced. Significantly best results were obtained by translating given keywords or on average best with combination of two approaches when authors' keywords did not exist. The best machine translator was Google translate and the best keywords extraction tool was Kea. The most important conclusions from experiments are: the most efficient tool for keywords extraction is Kea; a development of such tool for Slovene language is needed to achieve comparable efficiency on English texts; despite the keywords extraction tools that are specialized for English texts the most efficient searching for resources in foreign languages where keywords exist is obtained by translating keywords in the search language and by using combination of two approaches if keywords do not exist.
Actions (login required)