Please use this identifier to cite or link to this item:https://hdl.handle.net/20.500.12259/57571
Type of publication: Konferencijų tezės nerecenzuojamuose leidiniuose / Conference theses in non-peer-reviewed publications (T2)
Field of Science: Informatika / Computer science (N009)
Author(s): Mandravickaitė, Justina;Krilavičius, Tomas
Title: Hybrid approach for an automatic identification of multi-word expressions for Latvian and Lithuanian
Is part of: Data analysis methods for software systems : 8th international workshop, Druskininkai, Lithuania, December 1-3, 2016 : [abstracts book]. Vilnius : Vilnius University Institute of Data Science and Digital Technologies, 2016
Extent: p. 35-36
Date: 2016
ISBN: 9789986680611
Abstract: A Multi-Word Expression (MWE) is a sequence of >=2 words, which functions as a single unit at linguistic analysis, e.g. syntactical, morphological, etc. Identification of MWEs is one of the most challenging problems in NLP. Many techniques are used for this problem, however, not all of them can be transferred to Lithuanian and Latvian due to rich morphology. In this stage, we use raw corpus (LT and LV, 9 mln. words for each language) and a combination of lexical association measures (LAMS) and supervised machine learning (ML), and look for bi-gram MWEs. EuroVoc, a Multilingual Thesaurus of the European Union is used to evaluate MWE candidates. The candidate MWE bi-grams were extracted from raw text and 5 LAMs (Maximum Likelihood Estimation, Dice, Pointwise Mutual Information, Student’s t score and Log-likelihood) were calculated. Reference lists based on EuroVoc were used for evaluation. Then Naïve Bayes, OneR (rule-based classifier) and Random Forest were applied. SMOTE and Resample filters were used due to the sparseness. Precision, Recall, and F-measure were used to evaluate the results. LAMs and ML algorithms were combined in 3 ways: without any filter, with SMOTE, and with Resample. 10-fold cross-validation was used. The best results for, both Latvian and Lithuanian were achieved with Random Forest+Resample (LV: P = 92.4%, R = 52.2% and F = 66.7 and LT: P = 95.1%, R = 77.8% and F = 85.6%). Our future plans include experiments for extraction of different types of MWEs and a greater diversity of MWEs
Internet: https://hdl.handle.net/20.500.12259/57571
Affiliation(s): Baltijos pažangių technologijų institutas, Vilnius
Baltijos pažangiųjų technologijų institutas
Informatikos fakultetas
Taikomosios informatikos katedra
Vilniaus universitetas
Vytauto Didžiojo universitetas
Appears in Collections:Universiteto mokslo publikacijos / University Research Publications

Files in This Item:
marc.xml6.54 kBXMLView/Open

MARC21 XML metadata

Show full item record

Page view(s)

142
checked on Dec 9, 2019

Download(s)

12
checked on Dec 9, 2019

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.