C-W-FCM: Constrained Weighted Fuzzy Clustering Algorithm with a Semi-Supervised Approach for Text Classification
Abstract
The emergence of digital information era and rapid development of the Internet makes information to change gradually from paper form to the electronic one. This makes the users capable to search the news and books in an electronic way. Thus, the existenceof systems for information retrieval appears to be essential. This paper suggests a system for text classification by means of semi-supervised fuzzy clustering with a weighted feature vector. In the proposed method, after a preprocessing phase, a Genetic Algorithm together with the TF-IDF method is used for dimensionality reduction. Accordingly, features with highest discriminating power are chosen and finally, the documents are classified with the clustering algorithm, C-W-FCM. In fact, the proposed clustering algorithm applies the Euclidean distance with different weights for different dimensions. For evaluation of the proposed approach, a number of prominent criteria for clustering, namely Fukuyama and Sugeno (FS), are used conducted on the Reuters dataset. It is assumed that a small number of documents have labels which are called the seeded set. Simulation results show that the proposed approach is 27 to 33% superior to conventional clustering algorithms based on the evaluation criteria in determining clusters. In addition, the proposed clustering algorithm increases the system effectiveness especially when documents are highly similar to each other.
Keywords
Text classification, fuzzy clustering, semi-supervised, genetic algorithm