Text document clustering using language-independent techniques

Ciganaitė, Greta; Mackutė-Varoneckienė, Aušra; Krilavičius, Tomas

Use this url to cite publication: https://hdl.handle.net/20.500.12259/51286

Text document clustering using language-independent techniques

Type of publication

Konferencijų tezės nerecenzuojamame leidinyje / Conference theses in non-peer-reviewed publication (T2)

Author(s)

Author	Affiliation
Ciganaitė, Greta	Informatikos fakultetas / Faculty of Informatics	LT
Mackutė-Varoneckienė, Aušra	Taikomosios informatikos katedra / Department of Applied Informatics	LT

Title

Text document clustering using language-independent techniques

[en]

Is part of

Data analysis methods for software systems : 7th international workshop, December 3-5, 2015, Druskininkai, Lithuania : [abstracts book]. Vilnius : Vilnius university, 2015

Date Issued

Date
2015

Publisher

Vilnius : Vilnius university, 2015

Extent

p. 15-16

URI

URI
http://www.mii.lt/datamss/files/liks_mii_drusk_2015_abstract_last_1.pdf
https://hdl.handle.net/20.500.12259/51286

Field of Science

Keywords (lt)

Keywords (en)

Abstract (en)

Clustering is a technique for grouping objects by their similarity. Document clustering is used for topic extraction, filtering and fast information retrieval. However, due to the high dimensionality, clustering of documents is rather slow and computationally intensive. In case of highly inflective languages, such as Lithuanian, it becomes even problematic. We investigate language-independent document clustering for Lithuanian and Azeri languages. Bag-of-words (BOW) is used for documents representation. We propose four feature selection models based on the terms frequencies in the corpora. The best results have been achieved by the model where features which occur less than amin times (or more than amax times) in the whole corpora are eliminated from feature set as non-informative. The importance of features in defined feature subset is evaluated by the weights of term frequency-inverse document frequency (TFIDF). Results show that it is enough to use only 2% – 5.6% of features of initial feature set to get the best clustering results. Hierarchical and flat clustering algorithms based on documents similarity were applied and precision of results was evaluated. Cosine distance was selected as the best distance measure. Many experiments with well-known Euclidean distance were made, but this measure is inappropriate due to the sparsity of the feature matrix. Best clustering results were reached by using spherical k-means algorithm (F-score value approx. 0.8 for both languages).

Type of document

type::text::conference output::conference proceedings::conference paper

Language

Anglų / English (en)

Coverage Spatial

Lietuva / Lithuania (LT)

ISBN (of the container)

9789986680581

Other Identifier(s)

VDU02-000019056