The Effect of author set size in authorship attribution for Lithuanian

Kapočiūtė-Dzikienė, Jurgita; Šarkutė, Ligita; Utka, Andrius

Use this url to cite publication: https://hdl.handle.net/20.500.12259/35098

The Effect of author set size in authorship attribution for Lithuanian

Type of publication

Straipsnis recenzuojamoje užsienio tarptautinės konferencijos medžiagoje / Article in peer-reviewed foreign international conference proceedings (P1d)

Author(s)

Author	Affiliation
Kapočiūtė-Dzikienė, Jurgita	Taikomosios informatikos katedra / Department of Applied Informatics	LT
Šarkutė, Ligita	Kauno technologijos universitetas	LT
Utka, Andrius	Lituanistikos katedra / Department of Lithuanian Studies	LT

Title

The Effect of author set size in authorship attribution for Lithuanian

[en]

Is part of

NODALIDA 2015 : proceedings of the 20th Nordic conference of computational linguistics, May 11–13, 2015, Institute of the Lithuanian language, Vilnius / editor Beata Megyesi. Linköping : Linköping University Electronic Press, 2015

Date Issued

Date
2015

Publisher

Linköping : Linköping University Electronic Press, 2015

Extent

p. 87-96

URI

URI
https://www.vdu.lt/cris/bitstream/20.500.12259/35098/1/ISBN9789175190983_P_87-96.pdf
https://hdl.handle.net/20.500.12259/35098

Field of Science

Keywords (lt)

Keywords (en)

Abstract (en)

This paper reports the first authorship attribution results based on the effect of the author set size using automatic computational methods for the Lithuanian language. The aim is to determine how fast authorship attribution results are deteriorating while the number of candidate authors is gradually increasing: i.e. starting from 3, going up to 5, 10, 20, 50, and 100. Using supervised machine learning techniques we also investigated the influence of different features (lexical, character, morphological, etc.) and language types (normative parliamentary speeches and non-normative forum posts). The experiments revealed that the effectiveness of the method and feature types depends more on the language type rather than on the number of candidate authors. The content features based on word lemmas are the most useful type for the normative texts, due to the fact that Lithuanian is a highly inflective, morphologically and vocabulary rich language. The character features are the most accurate type for forum posts, where texts are too complicated to be effectively processed with external morphological tools.