Hybrid approach for an automatic identification of multi-word expressions for Latvian and Lithuanian

Mandravickaitė, Justina; Krilavičius, Tomas

Use this url to cite publication: https://hdl.handle.net/20.500.12259/57571

Hybrid approach for an automatic identification of multi-word expressions for Latvian and Lithuanian

Type of publication

Konferencijų tezės nerecenzuojamame leidinyje / Conference theses in non-peer-reviewed publication (T2)

Author(s)

Author	Affiliation
Mandravickaitė, Justina	Baltijos pažangiųjų technologijų institutas	LT	Vilniaus universitetas	LT

Title

Hybrid approach for an automatic identification of multi-word expressions for Latvian and Lithuanian

[en]

Is part of

Data analysis methods for software systems : 8th international workshop, Druskininkai, Lithuania, December 1-3, 2016 : [abstracts book]. Vilnius : Vilnius University Institute of Data Science and Digital Technologies, 2016

Date Issued

Date
2016

Publisher

Vilnius : Vilnius University Institute of Data Science and Digital Technologies, 2016

Extent

p. 35-36

URI

URI
https://hdl.handle.net/20.500.12259/57571

Area of Science

Field of Science

Abstract (en)

A Multi-Word Expression (MWE) is a sequence of >=2 words, which functions as a single unit at linguistic analysis, e.g. syntactical, morphological, etc. Identification of MWEs is one of the most challenging problems in NLP. Many techniques are used for this problem, however, not all of them can be transferred to Lithuanian and Latvian due to rich morphology. In this stage, we use raw corpus (LT and LV, 9 mln. words for each language) and a combination of lexical association measures (LAMS) and supervised machine learning (ML), and look for bi-gram MWEs. EuroVoc, a Multilingual Thesaurus of the European Union is used to evaluate MWE candidates. The candidate MWE bi-grams were extracted from raw text and 5 LAMs (Maximum Likelihood Estimation, Dice, Pointwise Mutual Information, Student’s t score and Log-likelihood) were calculated. Reference lists based on EuroVoc were used for evaluation. Then Naïve Bayes, OneR (rule-based classifier) and Random Forest were applied. SMOTE and Resample filters were used due to the sparseness. Precision, Recall, and F-measure were used to evaluate the results. LAMs and ML algorithms were combined in 3 ways: without any filter, with SMOTE, and with Resample. 10-fold cross-validation was used. The best results for, both Latvian and Lithuanian were achieved with Random Forest+Resample (LV: P = 92.4%, R = 52.2% and F = 66.7 and LT: P = 95.1%, R = 77.8% and F = 85.6%). Our future plans include experiments for extraction of different types of MWEs and a greater diversity of MWEs.

Type of document

type::text::conference output::conference proceedings::conference paper

Language

Anglų / English (en)

Coverage Spatial

Lietuva / Lithuania (LT)

ISBN (of the container)

9789986680611

Other Identifier(s)

VDU02-000022212

Taikomosios informatikos katedra / Department of Applied Informatics