Lithuanian hate speech classification using deep learning methods
Date | Start Page | End Page |
---|---|---|
2022 | 40 | 40 |
The ever-increasing amount of online content and the opportunity for everyone to express their opinions online leads to frequent encounters with social problems: bullying, insults, and hate speech. Some online portals are taking steps to stop this, such as no longer allowing comments to be made anonymously, removing the possibility to comment under the articles. Also, some portals employ moderators who identify and eliminate hate speech. However, given a large number of comments, an appropriately large number of people is required to do this work. The rapid development of artificial intelligence in the language technology area may be the solution to this problem. Automated hate speech detection would allow to manage the ever-increasing amount of online content. In this work, we report the comparison of hate speech detection models for the Lithuanian language. We used three deep learning models for hate speech detection: Multilingual BERT, LitLat BERT, and Electra. The latter model we trained from scratch ourselves with Lithuanian texts that made ~70 million words. All three models were further re-trained to classify Lithuanian user-generated comments into three main classes: hate, offensive, and neutral speech. To adapt the models to the hate speech detection task, we prepared a pre-processed and annotated dataset. It has had 25 219 user-generated comments (hate speech – 2082, offensive – 7821, neutral – 15 316). The collected text corpus was also analyzed using topic modeling to reveal the most frequent topics of hate speech comments. The trained models were evaluated with accuracy, precision, recall, and F1-score metrics. LitLat BERT performed the best with a weighted average F1-score of 0.72. Multilingual BERT was in the second place with a weighted average F1-score of 0.63, and Electra took the third place with a weighted average F1-score of 0.55. As Electra models have the potential to perform better than BERT models, our future plans include retraining our base Electra model for Lithuanian with larger datasets