From Kyrgyz internet texts to an XML full-form annotated lexicon: a simple semi-automatic pipeline

Boizou, Loic; Mambetkazieva, Dinara

Use this url to cite publication: https://hdl.handle.net/20.500.12259/58368

From Kyrgyz internet texts to an XML full-form annotated lexicon: a simple semi-automatic pipeline

Type of publication

Straipsnis recenzuojamoje užsienio tarptautinės konferencijos medžiagoje / Article in peer-reviewed foreign international conference proceedings (P1d)

Author(s)

Author	Affiliation
Boizou, Loic	Užsienio kalbų, lit. ir vert. s. katedra / Department of Foreign Language, Literary and Translation Studies	LT
Mambetkazieva, Dinara	Užsienio kalbų institutas / Institute of Foreign Languages	LT

Title

From Kyrgyz internet texts to an XML full-form annotated lexicon: a simple semi-automatic pipeline

[en]

Other Title

От Кыргызских текстов из интернета до аннотированному XML-лексикону словоформ: описание несложного полуавтоматического конвейера

[ru]

Is part of

TurkLang 2017: Пятая мeждународная конференция по компьютерной обработке тюркских языков: Труды конференции. Т 1. Казань: Издательство Академии наук Республики Татарстан, 2017

Date Issued

Date
2017

Publisher

Казань : Издательство Академии наук Республики Татарстан

Extent

p. 242-254

URI

URI
https://hdl.handle.net/20.500.12259/58368

Field of Science

Keywords (lt)

Keywords (en)

Abstract (lt)

В целях содействия развитию новых свободных ресурсов для кыргызского языка в настоящей статье описывается простой полуавтоматический конвейер, который создает лексикон словоформ на основе корпуса, созданного из свободно распространяемых в Интернете текстов. Все компоненты были разработаны как короткие программы Haskell. Корпус, который состоит приблизительно из 1,6 миллиона слов и 170 текстов, включает в себя различные жанры, такие как литературные произведения (романы, повести, пьесы), законы и другие нормативные тексты, новости, институциональные сайты компаний, университетов, правительственных структур, статьи в кыргызской Википедии. Его структура (жан- ровая пропорция, длина текстов) неоптимальная, но наша цель - получить значительный лексический охват стандартного письменного кыргызского языка, более чем точное представление языка.[...]

Abstract (en)

With the intention to foster the development of new free resources for Kyrgyz, the present paper describes a simple semi-automatic pipeline that generates a full-form lexicon out of a corpus made from texts freely available on the internet. All components were developed as short Haskell programs. The corpus, which comprises about 1.6 million words and 170 texts, includes various genres, such as literary works (novels, short stories, plays), laws and other regulatory texts, news, institutional websites of companies, universities, government structures, Wikipedia articles. Its design (genre proportion, text length) is far from optimal, but, our aim is to get a signifi cant lexical coverage of standard written Kyrgyz, more than a faithful representation of the language. A word list was automatically extracted from the corpus and items with non- Kyrgyz characters were fi ltered out. The resulting list of about 130,000 distinct word forms was analysed as morphemic sequences with a set of grammatical values. This morphological analysis relies on a simple fi nite-state machine which describes Kyrgyz word structure. This FSM is stored in a simple text fi le, where each line is a transition. Each transition represents a single morpheme surface form, except the morphemic sequence possession + case, which exhibits some irregularities and is represented as one transition for the sake of simplicity. This analyser works like a morphological guesser: it provides a list of all formally plausible morphemic segmentations for each word form. There is no information about existing stems and no disambiguation is performed at this stage.[...]