Use this url to cite department: https://hdl.handle.net/20.500.12259/148346

Kompiuterinės lingvistikos centras / Centre of Computational Linguistics

Card

Organisation Name

Parent Organisation

City

Kaunas

Datasets

Now showing1 - 10 of 21

Assessment data of the Dictionary of Modern Lithuanian versus Joint Corpora
dataset[2020][H004,N009]
Vilniaus universitetas / Vilnius University, 2020-06-30
The resource is the assessment data of The Dictionary of Modern Lithuanian, 6th edition (DML6) [1], from the point of view of its coverage in the Joint Corpus of Lithuanian (JCL) [2].The JCL is a merge of three corpora: 1) Vilnius university corpus compiled out of the Lithuanian internet content from 2014 and primarily used for machine translation, 2) legal document corpus in a form of wordlist (courtesy of the Office of the Seimas of the Republic of Lithuania, 2011) and 3) balanced corpus of present day Lithuanian of Vytautas Magnus University (VMU). Total size of the JCL is more than 1,3 billion tokens. The resource consists of 5 files. 1. Frequency list of types (different tokens) in JCL versus DML6. typecountoccurrence_in_dml6 (0 – no, 1 – main entries, 2 – geographic names, 3 – abbreviations). 2. List of explicit lemmas in DML6 versus JCL. lemmapart_of_speechoccurrence_in_JCL (count of all tokens in JCL which can be interpreted as a wordform of the particular lemma). Possible part_of_speech values: N – noun, V – verb, A – adjective, P – pronoun, R – adverb, S – preposition, C – conjunction, M – numeral, Q – particle, I – interjection, O – onomatopoeia, Y – abbreviation. occurrence_in_JCL means count of all tokens in JCL which can be interpreted as a wordform of the particular lemma. 3. Hunspell affixes (flexion rules) for Lithuanian language. 4. Hunspell dictionary, constructed from both explicit and implicit DML6 lemmas. 5. List of filtered out (excluding misspellings, foreign words, proper names, etc.) 254726 word-forms of JCL that are missing in the DML6 typecount Literature [1] Dadurkevičius, V., Petrauskaitė, R. 2020: Corpus based methods for assessment of the traditional dictionaries. Human language technologies - the Baltic perspective: the 9th international conference Baltic HLT, Kaunas, Lithuania, September 22–23, 2020. [2] The Dictionary of Modern Lithuanian. Edited by Keinys S. 6th (3 electronic) edition of the Dabartinės lietuvių kalbos žodynas. 2006, ISBN 978-9955-704-37-9

94
Colloc – a tool for automatic identification of multiword expressions
[Colloc - įrankis automatiniam pastoviųjų žodžių junginių nustatymui]
dataset[2019][H004,N009]
;
;
;
;
;
;
;
Vilkaitė-Lozdienė, Laura
Baltijos pažangių technologijų institutas / Baltic Institute of Advanced Technology, 2019
Colloc -- a tool for automatic identification of multiword expressions (MWE) is freely available for online use at http://resursai.mwe.lt/atpazintuvas. As material for training DELFI.lt corpus (http://tekstynas.mwe.lt/) was used. For identification combination of 2 trained models (RNN bi-LSTM and CRF) is used. Automatically identified MWE can be retrieved in 2 formats -- list of MWE or / and text with annotated MWE.

75
DELFI.lt corpus
[DELFI.lt tekstynas]
dataset[2019][H004,N009]
;
;
;
;
;
;
;
Vilkaitė-Lozdienė, Laura
Baltijos pažangių technologijų institutas / Baltic Institute of Advanced Technology, 2019
DELFI.lt is corpus made of articles published by news portal DELFI.lt since March 2014 till November 2016. Metadata was collected with articles as well: author, title, date, source, link, category, number of words. This corpus is made of 190 000 news articles from 12 thematic categories: DELFI Faces (DELFI Veidai), Projects (Projektai), DELFI Science (DELFI Mokslas), DELFI Auto, Unidentified category, Sport, DELFI Life (DELFI Gyvenimas), DELFI People (DELFI Žmonės), DELFI CItizen (DELFI Pilietis), Business (Verslas), DELFI FIT, DELFI News (DELFI Žinios). All in all DELFI.lt corpus consists of 70 million words. The corpus is morphologically annotated with Universal Dependencies tags and is freely accessible for online search at http://tekstynas.mwe.lt/.

88
Language technology research bibliography for Lithuanian 2016–2020
dataset[2020][H004,N009]
;
;
;
Vytauto Didžiojo universitetas / Vytautas Magnus University, 2020-07-28
The language technology bibliography for Lithuanian language in the period 2016-2020. The resource is in BibTex format and it contains: 1) 91 references of research publications, 2) 15 references of documents and strategies, and 3) 26 references of language resources and tools. The resource is used for the paper: Utka, Andrius, Jurgita Vaičenonienė, Monika Briedienė and Tomas Krilavičius. 2020. Development and Research in Lithuanian Language Technologies (2016-2020). In Human language technologies - the Baltic perspective : proceedings of the 9th international conference, Baltic HLT 2020, Kaunas (Lithuania). Amsterdam : IOS Press.

57 1
Lithuanian 1-gram dataset
[Lietuvių kalbos 1-gramo rinkinys]
dataset[2019][H004,N009]
;
;
;
;
;
;
;
Vilkaitė-Lozdienė, Laura
Baltijos pažangių technologijų institutas / Baltic Institute of Advanced Technology, 2019
Dataset of 1-grams with frequencies extracted from Delfi.lt corpus (~ 70 million words, period: March 2014 - November 2016). Firstly corpus was split into sentences, then symbol analysis as well as analysis of intended structures made of symbols were performed. Also, dictionary of abbreviations was used in order to preserve various abbreviations. Finally, 1-grams generated, making all in all 72 million entries. Frequencies of all entries were added to the dataset as well.

102
Lithuanian 2-gram dataset
[Lietuvių kalbos 2-gramų rinkinys]
dataset[2019][H004,N009]
;
;
;
;
;
;
;
Laura Vilkaitė-Lozdienė
Baltijos pažangių technologijų institutas / Baltic Institute of Advanced Technology, 2019
Dataset of 2-grams with frequencies extracted from Delfi.lt corpus (~ 70 million words, period: March 2014 - November 2016). Firstly corpus was split into sentences, then symbol analysis as well as analysis of intended structures made of symbols were performed. Also, dictionary of abbreviations was used in order to preserve various abbreviations. Finally, 2-grams generated, making all in all 67 million entries. Frequencies of all entries were added to the dataset as well.

113
Lithuanian 3-gram dataset
[Lietuvių kalbos 3-gramų rinkinys]
dataset[2019][H004,N009]
;
;
;
;
;
;
;
Vilkaitė-Lozdienė, Laura
Baltijos pažangių technologijų institutas / Baltic Institute of Advanced Technology, 2019
Dataset of 3-grams with frequencies extracted from Delfi.lt corpus (~ 70 million words, period: March 2014 - November 2016). Firstly corpus was split into sentences, then symbol analysis as well as analysis of intended structures made of symbols were performed. Also, dictionary of abbreviations was used in order to preserve various abbreviations. Finally, 3-grams generated, making all in all 62 million entries. Frequencies of all entries were added to the dataset as well.

76
Lithuanian 4-gram dataset
[Lietuvių kalbos 4-gramų rinkinys]
dataset[2019][H004,N009]
;
;
;
;
;
;
;
Laura Vilkaitė-Lozdienė
Baltijos pažangių technologijų institutas / Baltic Institute of Advanced Technology, 2019
Dataset of 4-grams with frequencies extracted from Delfi.lt corpus (~ 70 million words, period: March 2014 - November 2016). Firstly corpus was split into sentences, then symbol analysis as well as analysis of intended structures made of symbols were performed. Also, dictionary of abbreviations was used in order to preserve various abbreviations. Finally, 4-grams generated, making all in all 57 million entries. Frequencies of all entries were added to the dataset as well.

85
Lithuanian font family AISTIKA
dataset[2022][H004,N009]
Vaičiulis, Jonas
;
Ralys, Danielius Algirdas
;
UAB "Fotonija", 2022-03-25
Original OpenType font designed and hinted in Lithuania. It complies with the ISO/IEC 10646 (Unicode) standard and consists of the full set of casual and accented Lithuanian characters (e.g., į̃, ū̃, r̃, ė́, etc.). All the specific Lithuanian accented letters are presented in Private Use Area as well as available through pre-build compositional sequencies. The font also contains the main signs of Lithuanian heraldry as well as transcription signs and transliteration marks for Arabic, Indian, and other languages. The font family is presented in roman, bold, italic, and bold italic variants. UAB "Fotonija" grants permissive CLARIN-LT LICENCE (PUB) for its font family Aistika. This statement has been issued by general director of UAB "Fotonija" Arūnas Samuilis and is valid from March 25, 2022 onwards.

270
Lithuanian keyboard for macOS users
dataset[2021][H004,N009]
Vytauto Didžiojo universitetas / Vytautas Magnus University, 2021-07-14
This keyboard driver allows easy access of the Lithuanian letters via conventional keyboard layout a.k.a. „Lithuanian letters instead of numbers“. Essential new feature of this layout is the extensive use of "dead key" technique to type the following single letters: • Lithuanian accented (ą̃, ū́, m̃, ė́ etc.); • Latvian; • Estonian; • Polish; • French; • German; • Scandinavian; • old Greek; • Russian

190 11

Kompiuterinės lingvistikos centras / Centre of Computational Linguistics

Filters

Settings

Sort By

Results per page