Authorship attribution of internet comments with thousand candidate authors
Author | Affiliation | |
---|---|---|
LT | ||
LT | ||
Kauno technologijos universitetas | LT |
Date |
---|
2015 |
In this paper we report the first authorship attribution results for the Lithuanian language using Internet comments with a thousand of candidate authors. The task is complicated due to the following reasons: large number of candidate authors, extremely short non-normative texts, and problems associated with morphologically and vocabulary rich language. The effectiveness of the proposed similarity-based method was investigated using lexical, morphological, and character features; as well as several dimensionality reduction techniques. Marginally the best results were obtained with the word-level character tetra-grams and entire feature set. However, the technique based on the randomized feature sets even using a few thousands of features achieved very similar performance levels, besides it outperformed method’s implementations based on the sophisticated feature ranking. The best obtained f − score and accuracy values exceeded random and majority baselines by more than 10.9 percentage points.
Journal | Cite Score | SNIP | SJR | Year | Quartile |
---|---|---|---|---|---|
Communications in Computer and Information Science | 0.6 | 0.306 | 0.169 | 2015 | Q4 |