ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Intelligent assistant for extracting semi-structured web data

Nik Adžič (2016) Intelligent assistant for extracting semi-structured web data. MSc thesis.

Download (11Mb)


    In the revised literature we have not identified any existing approach, which could convert data from semi-structured (websites) or unstructured web sources to the RDF form and consequently integrate into a Linked Data cloud. Therefore, our motivation and objective was to develop intelligent assistant for extracting semi-structured web data. This intelligent assistant should automatically identify and select part of web data, some of those web data should be selected by business user without any technical skills and we have automatically prepared wrapper for extracting these web data. We implemented the prototype, which automatically identifies main search form, repeated results with specific algorithms on the website, identifies data inside these results and their details data. It also allows additional selecting data and automatically propose name of those data. With intelligent assistant we can also export data to the RDF form. Intelligent assistant allows us extracting data from very dynamic websites (websites with many lines of JavaScript and AJAX code), where similar approaches have many issues. We have evaluated the functioning of intelligent assistant in such a way that we tried to extract web data from many different websites. As different websites we consider very dynamic, static and secured against extracting websites, etc. We have found out that our approach has advantages over others in extracting web data from very dynamic websites and it allows explicit conversion of web data in the forth or fifth level on five star Linked Data ranking, where others in most cases convert web data in third level only. Besides that it allows automatic identification of repeated results on website with specific algorithm, which is one of the features of our approach and most of others do not offer this option.

    Item Type: Thesis (MSc thesis)
    Keywords: intelligent assistant, semi-structured web data, RDF, Linked Data cloud, extracting data, websites, web sources
    Number of Pages: 114
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Dejan Lavbič302Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1537156547)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3571
    Date Deposited: 13 Sep 2016 14:37
    Last Modified: 30 Sep 2016 09:02
    URI: http://eprints.fri.uni-lj.si/id/eprint/3571

    Actions (login required)

    View Item