Tekstynai ir jų išvestiniai produktai lietuvių kalbos mokymui(si) bei tyrimams : mokomoji priemonė
| Author | Affiliation |
|---|---|
| Date |
|---|
2022 |
Leidinyje pristatomi nauji lietuvių kalbos mokymuisi skirti elektroniniai ištekliai, parengti 2017–2020 m. vykdant projektą „Užsienio baltistikos centrų ir Lietuvos mokslo ir studijų institucijų bendradarbiavimo skatinimas“. Mokomasis lietuvių kalbos tekstynas, Mokinių tekstynas ir Mokomasis lietuvių kalbos vartosenos leksikonas viešai prieinami portale https://kalbu.vdu.lt/. Šiame leidinyje išsamiai aprašyta, kaip šie ištekliai parengti, kaip jie galėtų būti panaudoti gimtakalbių ir negimtakalbių mokymo(si) procese, kokius tyrimus būtų galima atlikti juos naudojant.
In this user manual, we aim to present three newly developed digital resources that can be used for research purposes and in teaching Lithuanian as a foreign language. These empirical databases include the Pedagogic Corpus of Lithuanian (hereinafter referred to as ‘the pedagogic corpus’), the Lithuanian Learner Corpus (LLC) and the corpus-driven Lexical Database of Lithuanian (henceforth ‘lexical database’). These resources were created in 2017–2019 within the framework of the EU-funded project “Lithuanian Academic Scheme for International Cooperation in Baltic Studies” and are publicly available at https://kalbu.vdu.lt/. The resources primarily target teachers, researchers, and learners of Lithuanian as a foreign language but can also be relevant for anyone else interested in Lithuanian. They were developed to represent authentic use of the Lithuanian language as it is used by native speakers (in the pedagogic corpus and the lexical database) and non-native speakers (in the LLC); so far, such resources have not been publicly available. The new data in the two corpora with integrated automated search possibilities offers new potentials for (learner) language research and language teaching. In the corpus-driven lexical database, users will find the usage information of 3,700 lexical items (words and multi-word units). The pedagogic corpus (https://kalbu.vdu.lt/mokymosi-priemones/mokomasis-tekstynas/) contains authentic Lithuanian language texts, selected according to criteria that are relevant to language learners of different proficiency levels. All the texts are classified into levels A1, A2, B1 and B2 according to the Common European Framework of Reference for Languages (CEFR). The corpus represents both written data and orthographically transcribed spoken data: 111,000 words for levels A1-A2 (96,000 words in the written component and 15,000 words in the spoken component); and 558,000 words to represent levels B1-B2 (523,000 words in the written part and 35,000 words in the spoken component). In total, the corpus contains 669,000 words. The spoken part of the corpus consists of natural conversations recorded in different settings, covering different communicative situations and various social roles of the interlocutors. Some of the texts were taken from the Corpus of Spoken Lithuanian (see http://sakytinistekstynas. vdu.lt/). The written component consists of two types of texts: (1) texts collected from coursebooks for learners of Lithuanian as a foreign language (they make up about 17% of the entire written subcorpus), and (2) texts collected from popular scientific and fiction books, news portals, public signs, instructions, announcements, documents, etc. (they amount up to about 83% of the entire written subcorpus). Coursebook and non-coursebook texts are classified into 29 genres (dialogues, narratives, instructional texts, etc.) and correspond to four groups according to communication goals (informative, popular scientific texts, appelative, and imaginative). In the LLC, just like in the pedagogic corpus, the texts of Lithuanian language learners are classified into levels A1, A2, B1 and B2 (based on the CEFR). The texts are divided into these four levels of proficiency on the basis of diagnostic placement tests or the amount of Lithuanian language contact hours received in formal education. The corpus (https://kalbu.vdu.lt/mokymosi-priemones/mokiniu-tekstynas/) contains both written and spoken language: 103,148 words in A1 level texts (81,339 written languages and 21,809 spoken languages); 99,359 words in A2 level texts (85,158 written languages and 14,201 spoken languages); 64,400 words for B1 level texts (39,558 written languages and 24,842 spoken languages); and 51,734 words in B2 level texts (24,211 written languages and 27,523 spoken languages). In total, this corpus comprises 318,641 tokens. The LLC represents a large variety of text types (essays, narratives, argumentative texts, letters, emails, postcards, etc.). In addition, it provides information on the linguistic background of the learner, the learning task for which a text was produced, and the learning context. The corpus is normalised and annotated for errors in grammar, lexis, syntax, pronunciation. The lexical database (https://kalbu.vdu.lt/mokymosi-priemones/leksikonas/) is developed on the basis of the written subcorpus of the pedagogic corpus consisting of approximately 620,000 words. This small, monolingual, and morphologically annotated corpus was used to develop the list of headwords for the lexicon and to collect the word usage information. The headword list consists of two categories: (1) approximately 700 most frequent words, which are used at least 100 times in all four levels represented in the pedagogic corpus (from level A1 to B2); and (2) derivatives, compounds, and multi-word units related to these most common 700 words; they make up 3,000 lexical items. Thus, in total, the lexicon contains 3,700 lexical items including individual words and multi-word units (such as compound names, fixed expressions, and sayings). In the headword list, lexical items functioning as verbs, nouns, adjectives, and adverbs are included, and the category of fixed expressions and sayings also includes interjections. Two types entries are used in the lexical database to represent information obtained from the corpus data. The entry of words with frequency of 100 and above is a full-record entry including pronunciation, inflections, corpus patterns, examples of use, and derivatives related to different word meanings. For derivatives and multi-word units related to the most frequent vocabulary, a short-record entry is provided, which contains pronunciation, inflections, examples of use, and derivatives. A large number of word patterns, examples, and lexical relations provide language teachers and learners with valuable information needed to improve language production skills. Corpora and corpus-based reference sources have already become indispensable resources of authentic language use especially in the research and teaching of such widely used languages as English. Thus, following this relatively recently established tradition, we aim to show how the newly developed resources of Lithuanian can contribute to the development of data-driven (or evidence-based) research as well as teaching materials and curricula. In this manual, we provide some suggestions as to how the new databases can be used to find new answers to some well-known issues and to develop new questions that can be triggered by intuitively unnoticed trends in language use but which can become apparent through data-driven language analysis. We also offer some types of data-driven language teaching activities, which can be further extended or supplemented with some other focus activities. In this book, we present the structure, nature, and main features of the corpora and the lexicon; we overview the principles of development, relevance and practical application of these resources. When using any empirical or lexicographic resource, it is important to know the principles behind it, so this book will explain the methods used to collect and systematize the empirical material, discuss search parameters and their rationale, and provide examples and suggestions as to how the data available in the resources can be researched and used in language teaching. We discuss each resource in detail in the following order: the pedagogic corpus is presented in Chapter 1, the lexical database is introduced in Chapter 2, and the learner corpus is overviewed in Chapter 3.