ePrints.FRI - University of Ljubljana, Faculty of Computer and Information Science

Efficient Natural Language Processing with Python

Matej Martinc (2015) Efficient Natural Language Processing with Python. EngD thesis.

Download (2401Kb)


    The thesis deals with a comparison of different tools and libraries for natural language processing in Python programming language. In addition to the most popular library for natural language processing NLTK we thoroughly researched other less known libraries, such as SpaCy, pyNLPl, Pattern and Textblob, and made comparisons between them based on different criteria and practical assignments, such as tokenization, lemmatization, stemming, part of speech tagging, dependency tree building, searching for patterns in text, word frequency counting and n-grams building. The libraries were compared by functionality and their methods and tools for natural language processing were analysed and described in detail. Special attention was paid to library abilities to access different text corpora and processing of Slovenian language. The most common methods of natural language processing, such as tokenization, lemmatization, part of speech tagging, were compared by speed and accuracy. Afterwards we focused on the sentiment analysis of Slovenian tweets and internet comments by machine learning, where we tested classifiers from different libraries and compared them by speed and accuracy. We tried to improve accuracy of classifiers with different methods, such as part of speech filtering, filtering based on TF-IDF, n-grams inclusion and removal of URL's and character repetitions in words. The machine learning approach to sentiment analysis was compared to lexical approach. We came to a conclusion that PyNLPl library lacks the basic methods for natural language processing. NLTK library is slow but suitable for learning because of the huge documentation and a big set of alternative methods. It is also very flexible, which makes it the only library that partially supports processing of Slovenian language. Spacy and TextBlob libraries are very fast and also accurate but they lack flexibility and functionality. Pattern library turned out to be average in terms of speed and accuracy but is pretty flexible, has many functionalities and is easy to use.

    Item Type: Thesis (EngD thesis)
    Keywords: natural language processing, Python, lemmatization, tokenization, part of speech tagging, sentiment analysis
    Number of Pages: 90
    Language of Content: Slovenian
    Mentor / Comentors:
    Name and SurnameIDFunction
    doc. dr. Matjaž Kukar267Mentor
    Link to COBISS: http://www.cobiss.si/scripts/cobiss?command=search&base=51012&select=(ID=1536500931)
    Institution: University of Ljubljana
    Department: Faculty of Computer and Information Science
    Item ID: 3106
    Date Deposited: 14 Sep 2015 19:21
    Last Modified: 23 Sep 2015 13:51
    URI: http://eprints.fri.uni-lj.si/id/eprint/3106

    Actions (login required)

    View Item