Lithuanian hate speech classification using deep learning methods

Kankevičiūtė, Eglė; Songailaitė, Milita; Mandravickaitė, Justina; Kalinauskaitė, Danguolė; Krilavičius, Tomas

Use this url to cite publication: https://hdl.handle.net/20.500.12259/244955

Lithuanian hate speech classification using deep learning methods

Type of publication

Tezės kitame recenzuojamame leidinyje / Theses in other peer-reviewed publication (T1e)

Author(s)

Author	Affiliation
Kankevičiūtė, Eglė	Informatikos fakultetas
Songailaitė, Milita	Informatikos fakultetas
Mandravickaitė, Justina	Taikomosios informatikos katedra
Kalinauskaitė, Danguolė	Lituanistikos katedra
Krilavičius, Tomas	Taikomosios informatikos katedra

Title

Lithuanian hate speech classification using deep learning methods

[en]

Part Of

DAMSS-2022 : Data analysis methods for software systems : 13th conference, Druskininkai, Lithuania, December 1–3, 2022 : [book of abstract]

Date Issued

Date	Start Page	End Page
2022	40	40

Publisher

Vilnius : Vilnius University Press

URI

URI
https://hdl.handle.net/20.500.12259/244955

DOI (of the container)

10.15388/DAMSS.13.2022

Area of Science

Field of Science

OECD Classification

Abstract (en)

The ever-increasing amount of online content and the opportunity for everyone to express their opinions online leads to frequent encounters with social problems: bullying, insults, and hate speech. Some online portals are taking steps to stop this, such as no longer allowing comments to be made anonymously, removing the possibility to comment under the articles. Also, some portals employ moderators who identify and eliminate hate speech. However, given a large number of comments, an appropriately large number of people is required to do this work. The rapid development of artificial intelligence in the language technology area may be the solution to this problem. Automated hate speech detection would allow to manage the ever-increasing amount of online content. In this work, we report the comparison of hate speech detection models for the Lithuanian language. We used three deep learning models for hate speech detection: Multilingual BERT, LitLat BERT, and Electra. The latter model we trained from scratch ourselves with Lithuanian texts that made ~70 million words. All three models were further re-trained to classify Lithuanian user-generated comments into three main classes: hate, offensive, and neutral speech. To adapt the models to the hate speech detection task, we prepared a pre-processed and annotated dataset. It has had 25 219 user-generated comments (hate speech – 2082, offensive – 7821, neutral – 15 316). The collected text corpus was also analyzed using topic modeling to reveal the most frequent topics of hate speech comments. The trained models were evaluated with accuracy, precision, recall, and F1-score metrics. LitLat BERT performed the best with a weighted average F1-score of 0.72. Multilingual BERT was in the second place with a weighted average F1-score of 0.63, and Electra took the third place with a weighted average F1-score of 0.55. As Electra models have the potential to perform better than BERT models, our future plans include retraining our base Electra model for Lithuanian with larger datasets

Type of document

type::text::conference output::conference proceedings::conference paper

Language

Anglų / English (en)

Coverage Spatial

Lietuva / Lithuania (LT)

Date Reporting

2022

ISBN (of the container)

978-609-07-0794-4

978-609-07-0795-1

Taikomosios informatikos katedra / Department of Applied Informatics

Humanitarinių mokslų fakultetas / Faculty of Humanities

Informatikos fakultetas / Faculty of Informatics

Vytauto Didžiojo universitetas / Vytautas Magnus University

Lituanistikos katedra / Department of Lithuanian Studies