Please use this identifier to cite or link to this item:
Type of publication: conference paper
Type of publication (PDB): Konferencijų tezės nerecenzuojamuose leidiniuose / Conference theses in non-peer-reviewed publications (T2)
Field of Science: Informatika / Informatics (N009)
Author(s): Mandravickaitė, Justina;Krilavičius, Tomas
Title: Statistical analysis of word frequency distribution in texts of different genres: comparison of Lithuanian and English
Is part of: Data analysis methods for software systems – DAMSS: 9th International Workshop, Druskininkai, Lithuania, November 30-December 2, 2017 / editor Jolita Bernatavičienė. Vilnius : Vilnius University Institute of Data Science and Digital Technologies, 2017
Extent: p. 31-31
Date: 2017
Keywords: Texts written in different genres;Corpus;Lithuanian Language
ISBN: 9789986680642
Abstract: We report an ongoing study on statistical characteristics of texts written in different genres. It has been suggested that genres resonate with people because they provide familiarity and the shorthand of communication. Also, genres tend to shift hand-in-hand with public opinion and reflect widespread culture of certain period(s). From NLP perspective, genres come in use in text classification and categorization, natural language generation, etc. At this stage, we present a statistical analysis of Lithuanian and English texts of genres. For our explorations, we use Corpus of the Contemporary Lithuanian Language (for Lithuanian part) and Freiburg-LOB Corpus of British English (F-LOB). The main points of interest are number of words, number of different words and word frequencies. Structural type distribution and Zipf’s law were applied in order to describe the frequency distribution of words in different textual genres. Zipf’s law is one of the universal laws proposed to describe statistical regularities in language. Thus word frequencies and their derivative indicators could be used to characterize textual genres. Application of word rank-frequency distribution, type-token ratio, the percentage of hapax legomena, i.e., words that occur only once, and entropy for different genre groups (fiction, scientific articles, documents, news articles) supported the latter assumption. Differences between languages (Lithuanian and English) were observed as well. As genres are rather complex phenomena that depend on various linguistic, cultural, societal, etc. factors, our future study includes research of additional frequency structure indicators as well as their combinations
Affiliation(s): Baltijos pažangių technologijų institutas, Vilnius
Baltijos pažangiųjų technologijų institutas
Taikomosios informatikos katedra
Vilniaus universitetas
Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml6.38 kBXMLView/Open

MARC21 XML metadata

Show full item record
Export via OAI-PMH Interface in XML Formats
Export to Other Non-XML Formats

CORE Recommender

Page view(s)

checked on Mar 30, 2021


checked on Mar 31, 2021

Google ScholarTM



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.