TOWARDS CURBING CYBER-BULLYING IN MALAYSIA BY AUTHOR IDENTIFICATION OF IBAN AND KADAZANDUSUN OSN TEXT USING DEEP LEARNING
Keywords:Author identification, authorship analysis, stylometry, social media, cyberbully
AbstractOnline Social Network (OSN) is frequently used to carry out cyber-criminal actions such as cyberbullying. As a developing country in Asia that keeps abreast of ICT advancement, Malaysia is no exception when it comes to cyberbullying. Author Identification (AI) task plays a vital role in social media forensic investigation (SMF) to unveil the genuine identity of the offender by analysing the text written in OSN by the candidate culprits. Several challenges in AI dealing with OSN text, including limited text length and informal language full of internet jargon and grammatical errors that further impact AI's performance in SMF. The traditional AI system that analyses long text documents seems inadequate to analyse short OSN text's writing style. N-gram features are proven to efficiently represent the authors' writing style for shot text. However, representing N-grams in traditional representation like Tf-IDF resulted in sparse and difficult in grasping the semantic information from text. Besides, most AI works have been done in English but receive less attention in indigenous languages. In West Malaysia, the supreme languages that transcend ethnic boundaries are Iban of Sarawak and KadazanDusun of Sabah, which both are inherently under-resourced. This paper presented a proposed workflow of AI for short OSN text using two Under-Resourced Language (U-RL), Iban and KadazanDusun tweets, to curb the cyberbullying issue in Malaysia. This paper compares Tf-Idf (sparse) and SoA embedding-based (dense) feature representations to observe which representations best represent the stylistic features of the authors’ writing. N-grams of word, character, and POS were extracted as the features. The representation models were learned by different classifiers using machine learning (Naïve Bayes, Random Forest, and SVM). The convolutional neural network (CNN), a SoA deep learning model in sentence classification, was tested against the traditional classifiers. The result was observed by combining different representation models and classifiers on three datasets (English, Iban, and KadazanDusun). The best result was achieved when CNN learned embedding-based models with a combination of all features. KadazanDusun achieved the highest accuracy with 95.76%, English with 95.02%, and Iban with 94%..
Ghazali, A.H.A., Abdullah, H., Omar, S.Z., Ahmad, A., Samah, A.A., Ramli, S.A. and Shaffril, H.A.M., 2017. Malaysian youth perception on cyberbullying: The qualitative perspective. International Journal of Academic Research in Business and Social Sciences, 7: 87-98. DOI:10.6007/IJARBSS/v7-i4/2782
Mohammad, N. 2021. Let's Put A Stop To Cyber Bullying, The Faceless Beast, [Online]. Available: https://www.bernama.com/en/thoughts/news.php?id=1979465. Accessed: Aug 2022
Cook. S. 2021. Cyberbullying facts and statistics for 2018 - 2021, [Online]. Available:.https://www.comparitech.com/internet-providers/cyberbullyingstatistics/. Accessed: Aug 2022
The Star. 2022. Malaysia is 2nd in Asia for youth cyberbullying, [Online]. Available: https://www.thestar.com.my/news/nation/2022/01/14/malaysia-is2nd-in-asia-for-youth-cyberbullying. Accessed: Aug 2022
Jiexun L., R. Zheng, and H. Chen. 2006. From Fingerprint to Writeprint. Communications of the ACM, 49(4): 76-82. DOI: https://doi.org/10.1145/1121949.1121951
P. Juola. 2008. Authorship attribution. Foundations and Trends in Information Retrieval. 1(3): 233-334. DOI: https://doi.org/10.1561/1500000005
Zhang, C., X. Wu, Z. Niu, and W. Ding. 2014. Authorship identification from unstructured texts. Knowledge-Based Systems, 66: 99-111. DOI: https://doi.org/10.1016/j.knosys.2014.04.025
Stamatatos,E. 2008. Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management. 44(2): 790-799. DOI: https://doi.org/10.1016/j.ipm.2007.05.012
Ghazali, K. 2012. National Identity and Minority Languages. UN Chronicle, 47(3): 17-20. DOI: https://doi.org/10.18356/f3ee6e9c-en
Omar, A. 2014. Processing Malaysian Indigenous Languages: A Focus on Phonology and Grammar. Open Journal of Modern Linguistis. 4(5): 728-738. DOI: https://doi.org/10.4236/ojml.2014.45063
Tajuddin, M. S. A. 2019. Permanent Mission of Malaysia to the United Nations High-Level Plenary Meeting of the United Nations General Assembly on the Global Launch of the International Year of Indigenous Languages. [Online]. Available:https://www.kln.gov.my/web/usa_un-new-york/home/-/asset_publisher/ZJfQEzYEsqRQ/blog/statement-by-mr-mohdsuhaimi-ahmad-tajuddin-charge-d-affaires-permanent-mission-ofmalaysia-to-the-united-nations-high-level-plenary-meeting-of-theu?inheritRedirect=false. Accessed: Aug 2022.
UNESCO. 2021. The International Year of Indigenous Languages: mobilizing the international community to preserve, revitalize and promote indigenous languages. 82-83. ISBN :978-92-3-100484-1.
Igawa, R. A., A. M. G. d. Almeida, B. B. Zarpelao, and S. Barbon. 2015. Recognition of Compromised Accounts on Twitter. SBSI 2015 Proceedings of the annual conference on Brazilian Symposium on Information Systems: Information Systems: A Computer Socio-Technical Perspective.1: 9-14. DOI:10.5753/sbsi.2015.5885
Banga, R., and P. Mehndiratta. 2017. Authorship attribution for textual data on online social networks. 2017 Tenth International Conference on Contemporary Computing (IC3). 1-7. DOI : https://doi.org/10.1109/IC3.2017.8284311
Fourkioti, O., S. Symeonidis, and A. Arampatzis.2019. Language Models and Fusion for Authorship Attribution. Information Processing & Management. 56(6): 102061.DOI : https://doi.org/10.1016/j.ipm.2019.102061
Theophilo, A., R. Giot, and A. Rocha. 2021. Authorship Attribution of Social Media Messages. IEEE Transactions on Computational Social Systems. 10(1): 1-14, 2021. DOI: https://doi.org/10.1109/TCSS.2021.3123895
Posadas-Durán, J.P., H. Gómez-Adorno, G. Sidorov, I. Batyrshin, D. Pinto, and L. Chanona-Hernández. 2017. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing. 21: 627-639. DOI: https://doi.org/10.1007/s00500-016-2446-x
Shrestha, P., S. Sierra, F. A. González, P. Rosso, M. Montes-y-Gómez, and T. Solorio. Convolutional Neural Networks for Authorship Attribution of Short Texts. 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. 669–674. DOI : https://doi.org/10.18653/v1/E17-2106
Rocha, A., W. J. Scheirer, C. W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, A. R. B. Carvalho, and E. Stamatatos. 2017. Authorship Attribution for Social Media Forensics. IEEE Transactions on Information Forensics and Security. 2(1): 5-33. DOI: https://doi.org/10.1109/TIFS.2016.2603960
Jambi, K. M., I. H. Khan, M. A. Siddiqui and S. O. Alhaj. 2021. Towards Authorship Attribution in Arabic Short-Microblog Text. IEEE Access. 9: 128506-128520. DOI: https://doi.org/10.1109/ACCESS.2021.3112624
Chen, Y. 2015. Convolutional Neural Network for Sentence Classification. Thesis (Master), University of Waterloo, Ontario. DOI: https://doi.org/10.48550/arXiv.1408.5882
Khatun, A., A. Rahman, M. S. Islam and Marium-E-Jannat. 2020. Authorship Attribution in Bangla literature using Character-Level CNN. 2019 22nd International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh. 1-5. DOI: 10.1109/ICCIT48885.2019.9038560.
Dreher, J. J. 1970. The Computer-Linguistic Detective of Authorship. The Journal of Asian Studies. 29(4): 883-887. DOI: https://doi.org/10.2307/2943094
Mendenhall, T. C. 1887. The Characteristic Curves of Composition. American Association for the Advancement of Science. 9(214): 237-249.DOI: https://doi.org/10.1126/science.ns-9.214S.237
Mendenhall, T. C. 1901. A menchanical solution of a literary problem. Popular Science Monthly. 60: 97-105.
Yule, G. U. 1939. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika. 30(3/4): 363-390. DOI: https://doi.org/10.1093/biomet/30.3-4.363
Ellegård, A. 1962. A statistical method for determining authorship: the junius letter. Gothenburg, Sweden: Acta Universitatis Gothoburgensis. 1769-1772. DOI:10.2307/411928
Sarndal, C. E. 1967. On Deciding Cases of Disputed Authorship. Journal of the Royal Statistical Society. 16(3): 251-268. DOI: https://doi.org/10.2307/2985921
Morton, A. Q. 1978. Literary Detection : How to Prove Authorship and Fraud in Literature and Documents. New York: Scribner.
Bailey, R. W. 1978. Authorship Attribution in a Forensic Setting. Advances in Computer-aided Literary and Linguistic Research: 87-106.
Burrows, J. F. 1987. Word-patterns and story-shapes: The statistical analysis of narrative style. Literary & Linguistic Computing. 2(2): 61-70. DOI: https://doi.org/10.1093/llc/2.2.61
Burrows, J. F. 1989. An ocean where each kind...': Statistical analysis and some major determinants of literary style. Computers and the Humanities. 23: 309-321. DOI: https://doi.org/10.1007/BF02176636
Holmes, D. I. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing. 13(3): 111-117. DOI: https://doi.org/10.1093/llc/13.3.111
Burrows, J. F. 1992. Not unles you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing. 7(2): 91-109. DOI: https://doi.org/10.1093/llc/7.2.91
Greenwood, H. H. 1995. Common word frequencies and authorship in Luke's Gospel and Acts. Literary and linguistic computing. 10(3): 183-187. DOI: https://doi.org/10.1093/llc/10.3.183
Holmes , D. I. and R. S. Forsyth. 1995. The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing. 10(2): 111-127. DOI: https://doi.org/10.1093/llc/10.2.111
Mealand, D. L. 1995. Correspondence analysis of Luke. Literary and linguistic computing. 10(3): 171-182. DOI: https://doi.org/10.1093/llc/10.3.171
Kjell, B. 1994. Authorship determination using letter pair frequency features with neural network classifiers. Literary and Linguistic Computing. 9(2): 119-124. DOI: https://doi.org/10.1093/llc/9.2.119
Hoorn, J. F., S. L. Frank, W. Kowalczyk, and F. v. d. Ham. 1999. Neural network identification of poets using letter sequences. Literary and Linguistic Computing. 14(3): 311-338. DOI : https://doi.org/10.1093/llc/14.3.311
Argamon, S., M. Šarić, and S. S. Stein. 2003. Style mining of electronic messages for multiple authorship discrimination: first results. Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York. 475–480. DOI : https://doi.org/10.1145/956750.956805
Kešelj, V., F. Peng, N. Cercone, and C. Thomas. 2003. N-gram-based author profiles for authorship attribution. Proceedings of the conference pacific association for computational linguistics (PACLING), Nova Scotia. 3: 255-264.
Abbasi A. and H. Chen. Applying authorship analysis to extremist-gro up web forum messages. IEEE Intelligent Systems, 20(5): 67-75. DOI : https://doi.org/10.1109/MIS.2005.81
Zheng, R., J. Li, H. Chen, and Z. Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the Association for Information Science and Technology. 57(3): 378-393. DOI : https://doi.org/10.1002/asi.20316
Potthast, M., P. Rosso, E. Stamatatos, and B. Stein. 2019. A Decade of Shared Tasks in Digital Text Forensics at PAN. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany. Part II 41: 291-300. DOI: https://doi.org/10.1007/978-3-030-15719-7_39
Almishari, M., E. Oguz and G. Tsudik. 2014. Fighting Authorship Linkability with Crowdsourcing. In Proceedings of the second ACM conference on Online social networks. 69-82. DOI: https://doi.org/10.1145/2660460.2660486
Neocleous, A. and A. Loizides. 2021. Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century. IEEE Access. 9: 7143-7151. DOI: https://doi.org/10.1109/ACCESS.2020.3047583
Barlas, G. and E. Stamatatos. 2021. A transfer learning approach to cross domain authorship attribution. Evolving Systems. 12(3): 625-643. DOI: https://doi.org/10.1007/s12530-021-09377-2
Huang, W., R. Su, and M. Iwaihara. 2020. Contribution of improved character embedding and latent posting styles to authorship attribution of short texts. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China. Part II 4: 261-269. DOI : https://doi.org/10.1007/978-3-030-60290-1_20
Theóphilo, A., L. A. Pereira, and A. Rocha. Needle in a haystack? harnessing onomatopoeia and user-specific stylometrics for authorship attribution of micro-messages. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton. 2692-2696. DOI: 10.1109/ICASSP.2019.8683747. DOI: https://doi.org/10.1109/ICASSP.2019.8683747
Le, Q. and T. Mikolov. 2014. Distributed Representations of Sentences and Documents. In International conference on machine learning. DOI: https://doi.org/10.48550/arXiv.1405.4053
Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 5: 135-146. DOI: https://doi.org/10.1162/tacl_a_00051
Besacier, L., E. Barnard, A. Karpov, and T. Schultz. 2014. Automatic Speech Recognition for Under-Resourced Languages: A Survey. Speech Communication. 56: 85-100. DOI: https://doi.org/10.1016/j.specom.2013.07.008
Apin, P. and K. A. Wahab. Tabu bahasa dalam masyarakat Dusun di Daerah Ranau, Sabah. Jurnal Melayu. 14(2): 224-239.
Kowsari, K., K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown. 2019. Text Classification Algorithms: A Survey. Information. 10(4): 150. DOI: https://doi.org/10.3390/info10040150
Savoy, J. 2020. Machine Learning Models. Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. Chapter 6: 109-151. DOI: https://doi.org/10.1007/978-3-030-53360-1_6
Wenjing, R. S., Huang, and M. Iwaihara. 2020. Contribution of Improved Character Embedding and Latent Posting Styles to Authorship Attribution of Short Texts. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China. Part II 4: 261-269. DOI: https://doi.org/10.1007/978-3-030-60290-1_20
Chowdhury, H. A., Imon, M. A. H., and Islam, M. S. 2018. A comparative analysis of word embedding representations in authorship attribution of bengali literature. In 2018 21st international conference of computer and information technology (ICCIT). 1-6. DOI: 10.1109/ICCITECHN.2018.8631977