Hybrid approach for an automatic identification of multi-word expressions for Latvian and Lithuanian
Author | Affiliation | |||
---|---|---|---|---|
Baltijos pažangiųjų technologijų institutas | LT | Vilniaus universitetas | LT | |
Date |
---|
2016 |
A Multi-Word Expression (MWE) is a sequence of >=2 words, which functions as a single unit at linguistic analysis, e.g. syntactical, morphological, etc. Identification of MWEs is one of the most challenging problems in NLP. Many techniques are used for this problem, however, not all of them can be transferred to Lithuanian and Latvian due to rich morphology. In this stage, we use raw corpus (LT and LV, 9 mln. words for each language) and a combination of lexical association measures (LAMS) and supervised machine learning (ML), and look for bi-gram MWEs. EuroVoc, a Multilingual Thesaurus of the European Union is used to evaluate MWE candidates. The candidate MWE bi-grams were extracted from raw text and 5 LAMs (Maximum Likelihood Estimation, Dice, Pointwise Mutual Information, Student’s t score and Log-likelihood) were calculated. Reference lists based on EuroVoc were used for evaluation. Then Naïve Bayes, OneR (rule-based classifier) and Random Forest were applied. SMOTE and Resample filters were used due to the sparseness. Precision, Recall, and F-measure were used to evaluate the results. LAMs and ML algorithms were combined in 3 ways: without any filter, with SMOTE, and with Resample. 10-fold cross-validation was used. The best results for, both Latvian and Lithuanian were achieved with Random Forest+Resample (LV: P = 92.4%, R = 52.2% and F = 66.7 and LT: P = 95.1%, R = 77.8% and F = 85.6%). Our future plans include experiments for extraction of different types of MWEs and a greater diversity of MWEs.