STATISTIC ANALYSIS OF SLOVENIAN TEXTS

Barbara Suhadolc (2013) STATISTIC ANALYSIS OF SLOVENIAN TEXTS. EngD thesis.

Preview

Abstract

Slovenian language was analysed many times, but most of this research focuses on the grammatical side – use of genders and declension. The statistical part of comparisons of letters, letter combinations and other results of frequency is lack. For better overview of this information a wider analysis is needed. It should not stop just with one letter frequency, but focus on the details as the position of these letters, combinations, etc.. Text is an important part of everyday life for describing events, recording speech and thought. A person can see it everywhere: from commercial ads to contracts. It is a big part of culture and free time a man uses for reading a book, exploring internet, sending messages and so on. Because of this the physics of the text is usually not spoken about, its attention focused more on the meaning of the expression. Still, everything is pieced together by letter groups, their combinations and combinations of those combinations. For these causes, the research centred around Slovenian language and not so much as meaningful, but physical part of words that make text. To do that, number of graphs and tables that made the results easier to display were used. These were work of many short programs. The main line of work was already pre-chosen by a large number of already done analyses of different languages. Still, when the analysing hit some interesting point, it focused on it for a deeper research. Every graph or table display of the results was an outcome of a written program that made the control over the result correctness easier. For each outcome a decision had to be made to either show it in a comparison to others, make a simple display of it or make it a part of some visually pleasant graph. The text was lemmatised and analysed in that shape. At the end of the paper the ideas of result use are lined. Corpuses and written programs were uploaded on the internet for easier access.

Item Type:

Thesis (EngD thesis)

Keywords:

Statistics, analysis, text, Slovenian, Slovene language, letter, word, n-gram, frequency, corpus, literature, poetry, blog, article, Wikipedia.

Number of Pages:

Language of Content:

Slovenian

Mentor / Comentors:

Name and Surname	ID	Function
doc. dr. Dejan Lavbič	302	Mentor

Link to COBISS:

http://www.cobiss.si/scripts/cobiss?command=search&base=50070&select=(ID=10137940)

Institution:

University of Ljubljana

Department:

Faculty of Computer and Information Science

Item ID:

2157

Date Deposited:

14 Sep 2013 15:54

Last Modified:

25 Sep 2013 14:01

URI:

http://eprints.fri.uni-lj.si/id/eprint/2157

Actions (login required)

View Item