Authorship attribution of internet comments with thousand candidate authors

Kapočiūtė-Dzikienė, Jurgita; Utka, Andrius; Šarkutė, Ligita

doi:10.1007/978-3-319-24770-0_37

Use this url to cite publication: https://hdl.handle.net/20.500.12259/47858

Authorship attribution of internet comments with thousand candidate authors

Details

Type of publication

Straipsnis konferencijos medžiagoje Web of Science ir Scopus duomenų bazėje / Article in conference proceedings in Web of Science and Scopus database (P1a)

Author(s)

Author	Affiliation
		LT
		LT
	Kauno technologijos universitetas	LT

Title

Authorship attribution of internet comments with thousand candidate authors

[en]

Is part of

ICIST 2015 : Information and software technologies : 21sth international conference, Druskininkai, Lithuania, October 15-16 2015: proceedings / editors Giedre Dregvaite, Robertas Damasevicius. Berlin : Springer International Publishing, 2015

Date Issued

Date
2015

Publisher

Berlin : Springer International Publishing, 2015

Publisher (trusted)

Is Referenced by

Extent

p. 433-448

URI

URI
https://hdl.handle.net/20.500.12259/47858

DOI

10.1007/978-3-319-24770-0_37

Research Area

Gamtos mokslai / Natural Sciences (N)

Field of Science

Informatika / Informatics (N009)

Keywords (en)

Abstract (en)

In this paper we report the first authorship attribution results for the Lithuanian language using Internet comments with a thousand of candidate authors. The task is complicated due to the following reasons: large number of candidate authors, extremely short non-normative texts, and problems associated with morphologically and vocabulary rich language. The effectiveness of the proposed similarity-based method was investigated using lexical, morphological, and character features; as well as several dimensionality reduction techniques. Marginally the best results were obtained with the word-level character tetra-grams and entire feature set. However, the technique based on the randomized feature sets even using a few thousands of features achieved very similar performance levels, besides it outperformed method’s implementations based on the sophisticated feature ranking. The best obtained f − score and accuracy values exceeded random and majority baselines by more than 10.9 percentage points.

Series/Report no.

(Communications in Computer and Information Science, Vol. 538 1865-0929)

Media Type (COAR)

TextJournalJournal articleResearch article

Language

Anglų / English (en)

Coverage Spatial

Vokietija / Germany (DE)

Collections

Owning collection

Universiteto mokslo publikacijos / University Research Publications

Identifiers

ISBN (of the container)

9783319247694

ISSN (of the container)

1865-0929

WOS

WOS:000369179100037

Other Identifier(s)

VDU02-000018061

Access Rights

Apribota prieiga / Restricted Access

Affiliations

SCOPUS

Journal	Cite Score	SNIP	SJR	Year	Quartile
Communications in Computer and Information Science	0.6	0.306	0.169	2015	Q4