Authorship identification from unstructured texts: A stylometric approach
Abstract
With the increasing use of the Internet, a considerable volume of texts is exchanged in cyberspace in which individuals can hide their true identities. Abuses that may occur in online communities due to unknown identities reduce the confidence of cyberspace and create many challenges. Hence the importance of maintaining the security of the space by controlling the user-generated content and identifying the authors of documents increases day by day. Author Identification is a method of finding the author of the anonymous document. Since there would not be any standard corpus for the Persian language, we created a standard Persian corpus for the authorship analysis applications in this language. In this paper, we propose an approach based on modeling the authors" writing style with the extracted stylometric features from their writing documents. Performance of author identification is also improved by applying pre-processing of the documents and reducing the dimensionality of the feature space by selecting the features with higher discriminative capability. The proposed approach is evaluated in terms of performance measures in data mining by designing and conducting experiments on the benchmark datasets of standard documents in Persian and English languages. The effect of different factors on the accuracy of the author"s identification has also been investigated by designing and performing experiments. The results of these experiments have shown that the proposed method has a higher performance than the related state-of-the-art methods.
Keywords
Authorship identification, feature selection, classification method, writing styles, stylometric
References
- [1] V. Chandani, N. Deshmane, K. Buva, S. Apte, and D. R. Prasad, “Study of different methods for author identification,” International Journal of Engineering Research and Technology, vol. 4, pp. 558–560, Jan. 2015.
- [2] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writingstyle features and classification techniques,” Journal of the American Society for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006.
- [3] C. Zhang, X. Wu, Z. Niu, and W. Ding, “Authorship identification from unstructured texts,” Knowledge-Based Systems, vol. 66, pp. 99–111, 2014.
- [4] M. Fatima, K. Hasan, S. Anwar, and R. M. A. Nawab, “Multilingual author profiling on facebook,” Information Processing & Management, vol. 53, no. 4, pp. 886–904, 2017.
- [5] E. Stamatatos, M. Potthast, F. Rangel, P. Rosso, and B. Stein, “Overview of the PAN/CLEF 2015 Evaluation Lab,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction. 6th International Conference of the CLEF Initiative (CLEF15) (J. Mothe, J. Savoy, J. Kamps, K. Pinel-Sauvagnat, G. Jones, E. SanJuan, L.
- Cap- pellato, and N. Ferro, eds.), (Berlin Heidelberg New York), pp. 518–538, Springer, Sept.2015.
- [6] O. Halvani, C. Winter, and A. Pflug, “Authorship verification for different languages, genres and topics,” Digital Investigation, vol. 16, pp. S33–S43, 2016.
- [7] V. Benjamin, W. Chung, A. Abbasi, J. Chuang, C. A. Larson, and H. Chen, “Evaluating text visualization: An experiment in authorship analysis,” in Proceedings of IEEE International Conference on Intelligence and Security Informatics, pp. 16–20, 2013.
- [8] M. Kocher and J. Savoy, “Distance measures in author profiling,” Information Processing & Management, vol. 53, no. 5, pp. 1103–1119, 2017.
- [9] J.-M. Torres-Moreno, G. Sierra, and P. Peinl, “A german corpus for text similarity detection tasks,” International Journal of Computational Linguistics and Applications, vol. 5, no. 2, pp. 9-24, 2014.
- [10] E. Stamatatos, W. Daelemans, B. Verhoeven, P. Juola, A. L´opez-L´opez, M. Potthast, and B. Stein, “Overview of the author identification task at pan 2014.,” in Proceedings of Conference and Labs of the Evaluation Forum, pp. 877–897, 2014.
- [11] C.-T. Li, Handbook of Research on Computational Forensics, Digital Crime, and Investigation: Methods and Solutions: Methods and Solutions. 2009.
- [12] A. Gokhale, K. Borkar, and R. S. Prasad, “A proposed system for author identification using statistical method,” International Journal of Engineering Research and Technology, vol. 2, pp. 1609–1611, Sept. 2013.
- [13] E. Castillo, O. Cervantes, D. V. Ayala, D. Pinto, and S. Le´on, “Unsupervised method for the authorship identification task.,” in Proceedings of Conference and Labs of the Evaluation Forum, vol. 1180, pp. 1035–1041, 2014.
- [14] S. M. Nirkhi, R. Dharaskar, and V. Thakare, “Authorship identification using generalized features and analysis of computational method,” Transactions on Machine Learning and Artificial Intelligence, vol. 3, no. 2, p. 41, 2015.
- [15] A. Bartoli, A. Dagri, A. De Lorenzo, E. Medvet, and F. Tarlao, “An author verification approach based on differential features,” in Proceedings of Conference and Labs of the Evaluation Forum, vol. 1391, 2015.
- [16] O. Pimas, M. Kr¨oll, and R. Kern, “Know-center at PAN 2015 author identification,” in Proceedings of Conference and Labs of the Evaluation Forum, vol. 1391, 2015.
- [17] M. A. Sanchez-Perez, I. Markov, H. G´omezAdorno, and G. Sidorov, “Comparison of character ngrams and lexical features on author, gender, and language variety identification on the same spanish news corpus,” in Proceedings of International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 145–151, 2017.
- [18] N. Deshmane, V. Chandani, K. Buva, S. Apte, and
- R. Prasad, “Author identification system using hybrid technique,” International Journal of Engineering Research and Technology, vol. 4, pp. 100–102, Apr. 2015.
- [19] S. Harvey, “Author verification using PPM with parts of speech tagging,” in Proceedings of Conference and Labs of the Evaluation Forum, vol. 1180, pp. 10631068, 2014. The CSI Journal on Computer Science and Engineering, Vol. 17, No. 2, 2020
- 44
- [20]
- [21] J. Soler and L. Wanner, “On the relevance of syntactic and discourse features for author profiling and identification,” Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 681–687, 2017.
- [22] S. Argamon, C. Whitelaw, P. Chase, S. R. Hota, N. Garg, and S. Levitan, “Stylistic text classification using functional lexical features,” Journal of the Association for Information Science and Technology, vol. 58, no. 6, pp. 802–822, 2007.
- [23] E. Villar-Rodriguez, J. Del Ser, M. N. Bilbao, and
- S. Salcedo-Sanz, “A feature selection method for author identification in interactive communications based on supervised learning and language typicality,” Engineering Applications of Artificial Intelligence, vol. 56, pp. 175184, 2016.
- [24] T. Chen and M.-Y. Kan, “Creating a live, public short message service corpus: the NUS SMS corpus,” Language Resources and Evaluation, vol. 47, no. 2, pp. 299–335, 2013.
- [25] A. Vorobeva, “Examining the performance of classification algorithms for imbalanced data sets in web author identification,” in Proceedings of the 18th Conference of Open Innovations Association FRUCT, pp. 385–390, 2016.
- [26] M. Al-Ayyoub, Y. Jararweh, A. Rababah, and M. Aldwairi, “Feature extraction and selection for arabic tweets authorship authentication,” Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 3, pp. 383–393, 2017.
- [27] H. G´omez-Adorno, G. Sidorov, D. Pinto, and I. Markov, “A graph based authorship identification approach: Notebook for PAN,” in Proceedings of Conference and Labs of the Evaluation Forum, vol. 1391, 2015.
- [28] D. Castro, Y. Adame, M. Pelaez, and R. Mun˜oz, “Authorship verification, combining linguistic features and different similarity functions,” in Proceedings of Conference and Labs of the Evaluation Forum, Sept. 2015.
- [29] S. Mechti, M. Jaoua, R. Faiz, L. H. Belguith, and B. Bsir, “On the empirical evaluation of author identification hybrid method,” in Workshop Proceedings of Conference and Labs of the Evaluation forum, vol. 1391, 2015.
- [30] Y. Sari and M. Stevenson, “A machine learning-based intrinsic method for cross-topic and cross-genre authorship verification,” in Workshop Proceedings of Conference and Labs of the Evaluation forum, vol. 1391, 2015.
- [31] S. Jie, “Authorship identification based on extraction and combined svm of similar attribute features,” Bolet´ın T´ecnico, vol. 55, no. 5, pp. 40–47, 2017.
- [32] S. Ferilli, “A sentence structure-based approach to unsupervised author identification,” Journal of Intelligent Information Systems, vol. 46, no. 1, pp. 1–19, 2016.
- [33] C.-L. Li, Y.-C. Su, et al., “Combination of feature engineering and ranking models for paper-author identification in KDD Cup 2013,” The Journal of
- C. Zhao, W. Song, L. Liu, C. Du, and X. Zhao, “Research on author identification based on deep syntactic features,” in Proceedings of the 10th International Symposium on Computational Intelligence and Design, vol. 1, pp. 276–279, 2017.
- Machine Learning Research, vol. 16, no. 1, pp. 29212947, 2015.
- [34] C. Klaussner, J. Nerbonne, and Ç.Çöltekin, “Finding characteristic features in stylometric analysis,” Digital Scholarship in the Humanities, vol. 30, no. suppl_1, pp. i114–i129, 2015.
- [35] StanfordTagger, “Stanford Log-linear Part-Of-Speech Tagger,” 2015. http://nlp. stanford.edu/software/tagger.html [Accessed: 08/08/2018].
- [36] StanfordCoreNLP, “Stanford CoreNLP a suite of core NLP tools,” 2015. http://stanfordnlp.github.io/CoreNLP/ [Accessed: 08/08/2018].
- [37] mojtaba khallash, “JHazm,” 2015.
- https://github.com/mojtaba-khallash/JHazm [Accessed: 08/08/2018].
- [38] FarseNet, “Farse Net,” 2015.
- http://dadegan.ir/catalog/farsnet [Accessed: 08/08/2018].
- [39] D. Agnihotri, K. Verma, and P. Tripathi, “An automatic classification of text documents based on correlative association of words,” Journal of Intelligent Information Systems, vol. 50, no. 3, pp. 549–572, 2018.
- [40] S. Sadeghi and H. Beigy, “A new ensemble method for feature ranking in text mining,” International Journal on Artificial Intelligence Tools, vol. 22, no. 3, 2013.
- [41] L. Thomas, "Class InfoGainAttributeEval," 2018.
- http://weka.sourceforge.net/doc. dev/weka/attributeSelection/InfoGainAttributeEval.html [Accessed: 10/23/2018].
- [42] “Reuters rcv1 rcv2 multilingual, multiview text categorization test collection data set.” https://archive.ics.uci.edu/ml/datasets/Reuters+RCV1+R CV2+Multilingual,+Multiview+Text+Categorization+Te st+collection. Accessed: 03/03/2017.
- [43] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” Journal of machine learning research, vol. 5, pp. 361–397, Apr. 2004.
- [44] Z. Farahmandpoor, H. Nikmehr, M. Mansoorizade, and
- O. Tabibzadeh Ghamsary, “A novel intelligent persian authorship system based on writing style,” Soft Computing Journal, vol.1, no.2, pp.35–26. 2013.
- [45] Z. Farahmandpour, and H. Nikmehr, 2015. A Study on Intelligent Authorship Methods in Persian Language. Journal of Computing and Security, 2(1), pp.63-76.
- [46] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song, “On the feasibility of internet-scale author identification,” in 2012 IEEE Symposium on Security and Privacy, pp.314–300 , IEEE, 2012.
- [47] Stamatatos, E., 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), pp.538-556.
- [48] El Bakly, A.H., Darwish, N.R. and Hefny, H.A., 2020. Using Ontology for Revealing Authorship Attribution of Arabic Text. Int. J. Eng. Adv. Technol.(IJEAT), 4, pp.143-151.
- [49] Sarwar, R., Porthaveepong, T., Rutherford, A., Rakthanmanon, T. and Nutanong, S., 2020. StyloThai: A scalable framework for stylometric authorship identification of thai documents. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 19(3), pp.1-15.