• Nursyahirah Tarmizi Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 14300, Kota Samarahan, Sarawak, Malaysia
  • Suhaila Saee Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 14300, Kota Samarahan, Sarawak, Malaysia
  • Dayang Hanani Abang Ibrahim Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 14300, Kota Samarahan, Sarawak, Malaysia



Author identification, authorship analysis, stylometry, social media, cyberbully


Online Social Network (OSN) is frequently used to carry out cyber-criminal actions such as cyberbullying. As a developing country in Asia that keeps abreast of ICT advancement, Malaysia is no exception when it comes to cyberbullying. Author Identification (AI) task plays a vital role in social media forensic investigation (SMF) to unveil the genuine identity of the offender by analysing the text written in OSN by the candidate culprits. Several challenges in AI dealing with OSN text, including limited text length and informal language full of internet jargon and grammatical errors that further impact AI's performance in SMF. The traditional AI system that analyses long text documents seems inadequate to analyse short OSN text's writing style. N-gram features are proven to efficiently represent the authors' writing style for shot text. However, representing N-grams in traditional representation like Tf-IDF resulted in sparse and difficult in grasping the semantic information from text. Besides, most AI works have been done in English but receive less attention in indigenous languages. In West Malaysia, the supreme languages that transcend ethnic boundaries are Iban of Sarawak and KadazanDusun of Sabah, which both are inherently under-resourced. This paper presented a proposed workflow of AI for short OSN text using two Under-Resourced Language (U-RL), Iban and KadazanDusun tweets, to curb the cyberbullying issue in Malaysia. This paper compares Tf-Idf (sparse) and SoA embedding-based (dense) feature representations to observe which representations best represent the stylistic features of the authors’ writing. N-grams of word, character, and POS were extracted as the features. The representation models were learned by different classifiers using machine learning (Naïve Bayes, Random Forest, and SVM). The convolutional neural network (CNN), a SoA deep learning model in sentence classification, was tested against the traditional classifiers. The result was observed by combining different representation models and classifiers on three datasets (English, Iban, and KadazanDusun). The best result was achieved when CNN learned embedding-based models with a combination of all features. KadazanDusun achieved the highest accuracy with 95.76%, English with 95.02%, and Iban with 94%..


Ghazali, A.H.A., Abdullah, H., Omar, S.Z., Ahmad, A., Samah, A.A., Ramli, S.A. and Shaffril, H.A.M., 2017. Malaysian youth perception on cyberbullying: The qualitative perspective. International Journal of Academic Research in Business and Social Sciences, 7: 87-98. DOI:10.6007/IJARBSS/v7-i4/2782

Mohammad, N. 2021. Let's Put A Stop To Cyber Bullying, The Faceless Beast, [Online]. Available: Accessed: Aug 2022

Cook. S. 2021. Cyberbullying facts and statistics for 2018 - 2021, [Online]. Available:. Accessed: Aug 2022

The Star. 2022. Malaysia is 2nd in Asia for youth cyberbullying, [Online]. Available: Accessed: Aug 2022

Jiexun L., R. Zheng, and H. Chen. 2006. From Fingerprint to Writeprint. Communications of the ACM, 49(4): 76-82. DOI:

P. Juola. 2008. Authorship attribution. Foundations and Trends in Information Retrieval. 1(3): 233-334. DOI:

Zhang, C., X. Wu, Z. Niu, and W. Ding. 2014. Authorship identification from unstructured texts. Knowledge-Based Systems, 66: 99-111. DOI:

Stamatatos,E. 2008. Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management. 44(2): 790-799. DOI:

Ghazali, K. 2012. National Identity and Minority Languages. UN Chronicle, 47(3): 17-20. DOI:

Omar, A. 2014. Processing Malaysian Indigenous Languages: A Focus on Phonology and Grammar. Open Journal of Modern Linguistis. 4(5): 728-738. DOI:

Tajuddin, M. S. A. 2019. Permanent Mission of Malaysia to the United Nations High-Level Plenary Meeting of the United Nations General Assembly on the Global Launch of the International Year of Indigenous Languages. [Online]. Available: Accessed: Aug 2022.

UNESCO. 2021. The International Year of Indigenous Languages: mobilizing the international community to preserve, revitalize and promote indigenous languages. 82-83. ISBN :978-92-3-100484-1.

Igawa, R. A., A. M. G. d. Almeida, B. B. Zarpelao, and S. Barbon. 2015. Recognition of Compromised Accounts on Twitter. SBSI 2015 Proceedings of the annual conference on Brazilian Symposium on Information Systems: Information Systems: A Computer Socio-Technical Perspective.1: 9-14. DOI:10.5753/sbsi.2015.5885

Banga, R., and P. Mehndiratta. 2017. Authorship attribution for textual data on online social networks. 2017 Tenth International Conference on Contemporary Computing (IC3). 1-7. DOI :

Fourkioti, O., S. Symeonidis, and A. Arampatzis.2019. Language Models and Fusion for Authorship Attribution. Information Processing & Management. 56(6): 102061.DOI :

Theophilo, A., R. Giot, and A. Rocha. 2021. Authorship Attribution of Social Media Messages. IEEE Transactions on Computational Social Systems. 10(1): 1-14, 2021. DOI:

Posadas-Durán, J.P., H. Gómez-Adorno, G. Sidorov, I. Batyrshin, D. Pinto, and L. Chanona-Hernández. 2017. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing. 21: 627-639. DOI:

Shrestha, P., S. Sierra, F. A. González, P. Rosso, M. Montes-y-Gómez, and T. Solorio. Convolutional Neural Networks for Authorship Attribution of Short Texts. 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. 669–674. DOI :

Rocha, A., W. J. Scheirer, C. W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, A. R. B. Carvalho, and E. Stamatatos. 2017. Authorship Attribution for Social Media Forensics. IEEE Transactions on Information Forensics and Security. 2(1): 5-33. DOI:

Jambi, K. M., I. H. Khan, M. A. Siddiqui and S. O. Alhaj. 2021. Towards Authorship Attribution in Arabic Short-Microblog Text. IEEE Access. 9: 128506-128520. DOI:

Chen, Y. 2015. Convolutional Neural Network for Sentence Classification. Thesis (Master), University of Waterloo, Ontario. DOI:

Khatun, A., A. Rahman, M. S. Islam and Marium-E-Jannat. 2020. Authorship Attribution in Bangla literature using Character-Level CNN. 2019 22nd International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh. 1-5. DOI: 10.1109/ICCIT48885.2019.9038560.

Dreher, J. J. 1970. The Computer-Linguistic Detective of Authorship. The Journal of Asian Studies. 29(4): 883-887. DOI:

Mendenhall, T. C. 1887. The Characteristic Curves of Composition. American Association for the Advancement of Science. 9(214): 237-249.DOI:

Mendenhall, T. C. 1901. A menchanical solution of a literary problem. Popular Science Monthly. 60: 97-105.

Yule, G. U. 1939. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika. 30(3/4): 363-390. DOI:

Ellegård, A. 1962. A statistical method for determining authorship: the junius letter. Gothenburg, Sweden: Acta Universitatis Gothoburgensis. 1769-1772. DOI:10.2307/411928

Sarndal, C. E. 1967. On Deciding Cases of Disputed Authorship. Journal of the Royal Statistical Society. 16(3): 251-268. DOI:

Morton, A. Q. 1978. Literary Detection : How to Prove Authorship and Fraud in Literature and Documents. New York: Scribner.

Bailey, R. W. 1978. Authorship Attribution in a Forensic Setting. Advances in Computer-aided Literary and Linguistic Research: 87-106.

Burrows, J. F. 1987. Word-patterns and story-shapes: The statistical analysis of narrative style. Literary & Linguistic Computing. 2(2): 61-70. DOI:

Burrows, J. F. 1989. An ocean where each kind...': Statistical analysis and some major determinants of literary style. Computers and the Humanities. 23: 309-321. DOI:

Holmes, D. I. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing. 13(3): 111-117. DOI:

Burrows, J. F. 1992. Not unles you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing. 7(2): 91-109. DOI:

Greenwood, H. H. 1995. Common word frequencies and authorship in Luke's Gospel and Acts. Literary and linguistic computing. 10(3): 183-187. DOI:

Holmes , D. I. and R. S. Forsyth. 1995. The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing. 10(2): 111-127. DOI:

Mealand, D. L. 1995. Correspondence analysis of Luke. Literary and linguistic computing. 10(3): 171-182. DOI:

Kjell, B. 1994. Authorship determination using letter pair frequency features with neural network classifiers. Literary and Linguistic Computing. 9(2): 119-124. DOI:

Hoorn, J. F., S. L. Frank, W. Kowalczyk, and F. v. d. Ham. 1999. Neural network identification of poets using letter sequences. Literary and Linguistic Computing. 14(3): 311-338. DOI :

Argamon, S., M. Šarić, and S. S. Stein. 2003. Style mining of electronic messages for multiple authorship discrimination: first results. Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York. 475–480. DOI :

Kešelj, V., F. Peng, N. Cercone, and C. Thomas. 2003. N-gram-based author profiles for authorship attribution. Proceedings of the conference pacific association for computational linguistics (PACLING), Nova Scotia. 3: 255-264.

Abbasi A. and H. Chen. Applying authorship analysis to extremist-gro up web forum messages. IEEE Intelligent Systems, 20(5): 67-75. DOI :

Zheng, R., J. Li, H. Chen, and Z. Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the Association for Information Science and Technology. 57(3): 378-393. DOI :

Potthast, M., P. Rosso, E. Stamatatos, and B. Stein. 2019. A Decade of Shared Tasks in Digital Text Forensics at PAN. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany. Part II 41: 291-300. DOI:

Almishari, M., E. Oguz and G. Tsudik. 2014. Fighting Authorship Linkability with Crowdsourcing. In Proceedings of the second ACM conference on Online social networks. 69-82. DOI:

Neocleous, A. and A. Loizides. 2021. Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century. IEEE Access. 9: 7143-7151. DOI:

Barlas, G. and E. Stamatatos. 2021. A transfer learning approach to cross domain authorship attribution. Evolving Systems. 12(3): 625-643. DOI:

Huang, W., R. Su, and M. Iwaihara. 2020. Contribution of improved character embedding and latent posting styles to authorship attribution of short texts. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China. Part II 4: 261-269. DOI :

Theóphilo, A., L. A. Pereira, and A. Rocha. Needle in a haystack? harnessing onomatopoeia and user-specific stylometrics for authorship attribution of micro-messages. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton. 2692-2696. DOI: 10.1109/ICASSP.2019.8683747. DOI:

Le, Q. and T. Mikolov. 2014. Distributed Representations of Sentences and Documents. In International conference on machine learning. DOI:

Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 5: 135-146. DOI:

Besacier, L., E. Barnard, A. Karpov, and T. Schultz. 2014. Automatic Speech Recognition for Under-Resourced Languages: A Survey. Speech Communication. 56: 85-100. DOI:

Apin, P. and K. A. Wahab. Tabu bahasa dalam masyarakat Dusun di Daerah Ranau, Sabah. Jurnal Melayu. 14(2): 224-239.

Kowsari, K., K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown. 2019. Text Classification Algorithms: A Survey. Information. 10(4): 150. DOI:

Savoy, J. 2020. Machine Learning Models. Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. Chapter 6: 109-151. DOI:

Wenjing, R. S., Huang, and M. Iwaihara. 2020. Contribution of Improved Character Embedding and Latent Posting Styles to Authorship Attribution of Short Texts. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China. Part II 4: 261-269. DOI:

Chowdhury, H. A., Imon, M. A. H., and Islam, M. S. 2018. A comparative analysis of word embedding representations in authorship attribution of bengali literature. In 2018 21st international conference of computer and information technology (ICCIT). 1-6. DOI: 10.1109/ICCITECHN.2018.8631977




How to Cite