Please use this identifier to cite or link to this item:https://hdl.handle.net/20.500.12259/57509
Type of publication: Konferencijų tezės nerecenzuojamuose leidiniuose / Conference theses in non-peer-reviewed publications (T2)
Field of Science: Informatika / Computer science (N009)
Author(s): Mandravickaitė, Justina;Krilavičius, Tomas
Title: Statistical analysis of word frequency distribution in texts of different genres: comparison of Lithuanian and English
Is part of: Data analysis methods for software systems – DAMSS: 9th International Workshop, Druskininkai, Lithuania, November 30-December 2, 2017 / editor Jolita Bernatavičienė. Vilnius : Vilnius University Institute of Data Science and Digital Technologies, 2017
Extent: p. 31-31
Date: 2017
Keywords: Texts written in different genres;Corpus;Lithuanian Language
ISBN: 9789986680642
Abstract: We report an ongoing study on statistical characteristics of texts written in different genres. It has been suggested that genres resonate with people because they provide familiarity and the shorthand of communication. Also, genres tend to shift hand-in-hand with public opinion and reflect widespread culture of certain period(s). From NLP perspective, genres come in use in text classification and categorization, natural language generation, etc. At this stage, we present a statistical analysis of Lithuanian and English texts of genres. For our explorations, we use Corpus of the Contemporary Lithuanian Language (for Lithuanian part) and Freiburg-LOB Corpus of British English (F-LOB). The main points of interest are number of words, number of different words and word frequencies. Structural type distribution and Zipf’s law were applied in order to describe the frequency distribution of words in different textual genres. Zipf’s law is one of the universal laws proposed to describe statistical regularities in language. Thus word frequencies and their derivative indicators could be used to characterize textual genres. Application of word rank-frequency distribution, type-token ratio, the percentage of hapax legomena, i.e., words that occur only once, and entropy for different genre groups (fiction, scientific articles, documents, news articles) supported the latter assumption. Differences between languages (Lithuanian and English) were observed as well. As genres are rather complex phenomena that depend on various linguistic, cultural, societal, etc. factors, our future study includes research of additional frequency structure indicators as well as their combinations
Internet: https://hdl.handle.net/20.500.12259/57509
Affiliation(s): Baltijos pažangių technologijų institutas, Vilnius
Baltijos pažangiųjų technologijų institutas
Informatikos fakultetas
Taikomosios informatikos katedra
Vilniaus universitetas
Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml6.38 kBXMLView/Open

MARC21 XML metadata

Show full item record

Page view(s)

124
checked on Nov 2, 2019

Download(s)

10
checked on Nov 2, 2019

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.