Text document clustering using language-independent techniques
Date |
---|
2015 |
Clustering is a technique for grouping objects by their similarity. Document clustering is used for topic extraction, filtering and fast information retrieval. However, due to the high dimensionality, clustering of documents is rather slow and computationally intensive. In case of highly inflective languages, such as Lithuanian, it becomes even problematic. We investigate language-independent document clustering for Lithuanian and Azeri languages. Bag-of-words (BOW) is used for documents representation. We propose four feature selection models based on the terms frequencies in the corpora. The best results have been achieved by the model where features which occur less than amin times (or more than amax times) in the whole corpora are eliminated from feature set as non-informative. The importance of features in defined feature subset is evaluated by the weights of term frequency-inverse document frequency (TFIDF). Results show that it is enough to use only 2% – 5.6% of features of initial feature set to get the best clustering results. Hierarchical and flat clustering algorithms based on documents similarity were applied and precision of results was evaluated. Cosine distance was selected as the best distance measure. Many experiments with well-known Euclidean distance were made, but this measure is inappropriate due to the sparsity of the feature matrix. Best clustering results were reached by using spherical k-means algorithm (F-score value approx. 0.8 for both languages).