Email Spam Detection Using Linear Discriminant Analysis Based on Clustering
Abstract
The high volume of unwanted spam emails annoys the Internet users; causes spam activities and financial losses. So, spam detection is a serious task to provide a secure electronic environment. Email spam databases usually have multimodal distributions with high overlap, which cause difficulties in separating spam emails from normal emails. Moreover, the number of available labeled emails may be limited. A supervised feature extraction method, which is called cluster space linear discriminant analysis (CSLDA), is proposed in this paper to deal with these difficulties. CSLDA uses the ability of unlabeled testing samples in addition to labeled training ones for estimation of the within-class and between-class scatter matrices. Based on the multimodal distribution of email spam databases, CSLDA clusters the unlabeled testing data for using them in the learning phase of feature extraction. CSLDA uses the testing samples without determination of their labels, and just with obtaining relationship between training and testing samples through clustering. The use of Fisher criterion increases the class discrimination. Moreover, the use of clustered unlabeled samples solves the small sample size problem and provides good performance for multimodal data. The experimental results on spambase dataset indicate the superiority of CSLDA compared to some popular and state-of-the-art feature extraction and spam detection methods, especially in small sample size situations.
Keywords
Classification, Clustering, Discriminant analysis, Email spam