TOWARDS CURBING CYBER-BULLYING IN MALAYSIA BY AUTHOR IDENTIFICATION OF IBAN AND KADAZANDUSUN OSN TEXT USING DEEP LEARNING

Authors

Nursyahirah Tarmizi Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 14300, Kota Samarahan, Sarawak, Malaysia
Suhaila Saee Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 14300, Kota Samarahan, Sarawak, Malaysia
Dayang Hanani Abang Ibrahim Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 14300, Kota Samarahan, Sarawak, Malaysia

DOI:

https://doi.org/10.11113/aej.v13.19171

Keywords:

Author identification, authorship analysis, stylometry, social media, cyberbully

Abstract

Online Social Network (OSN) is frequently used to carry out cyber-criminal actions such as cyberbullying. As a developing country in Asia that keeps abreast of ICT advancement, Malaysia is no exception when it comes to cyberbullying. Author Identification (AI) task plays a vital role in social media forensic investigation (SMF) to unveil the genuine identity of the offender by analysing the text written in OSN by the candidate culprits. Several challenges in AI dealing with OSN text, including limited text length and informal language full of internet jargon and grammatical errors that further impact AI's performance in SMF. The traditional AI system that analyses long text documents seems inadequate to analyse short OSN text's writing style. N-gram features are proven to efficiently represent the authors' writing style for shot text. However, representing N-grams in traditional representation like Tf-IDF resulted in sparse and difficult in grasping the semantic information from text. Besides, most AI works have been done in English but receive less attention in indigenous languages. In West Malaysia, the supreme languages that transcend ethnic boundaries are Iban of Sarawak and KadazanDusun of Sabah, which both are inherently under-resourced. This paper presented a proposed workflow of AI for short OSN text using two Under-Resourced Language (U-RL), Iban and KadazanDusun tweets, to curb the cyberbullying issue in Malaysia. This paper compares Tf-Idf (sparse) and SoA embedding-based (dense) feature representations to observe which representations best represent the stylistic features of the authors’ writing. N-grams of word, character, and POS were extracted as the features. The representation models were learned by different classifiers using machine learning (Naïve Bayes, Random Forest, and SVM). The convolutional neural network (CNN), a SoA deep learning model in sentence classification, was tested against the traditional classifiers. The result was observed by combining different representation models and classifiers on three datasets (English, Iban, and KadazanDusun). The best result was achieved when CNN learned embedding-based models with a combination of all features. KadazanDusun achieved the highest accuracy with 95.76%, English with 95.02%, and Iban with 94%..

References

Ghazali, A.H.A., Abdullah, H., Omar, S.Z., Ahmad, A., Samah, A.A., Ramli, S.A. and Shaffril, H.A.M., 2017. Malaysian youth perception on cyberbullying: The qualitative perspective. International Journal of Academic Research in Business and Social Sciences, 7: 87-98. DOI:10.6007/IJARBSS/v7-i4/2782

Mohammad, N. 2021. Let's Put A Stop To Cyber Bullying, The Faceless Beast, [Online]. Available: https://www.bernama.com/en/thoughts/news.php?id=1979465. Accessed: Aug 2022

Cook. S. 2021. Cyberbullying facts and statistics for 2018 - 2021, [Online]. Available:.https://www.comparitech.com/internet-providers/cyberbullyingstatistics/. Accessed: Aug 2022

The Star. 2022. Malaysia is 2nd in Asia for youth cyberbullying, [Online]. Available: https://www.thestar.com.my/news/nation/2022/01/14/malaysia-is2nd-in-asia-for-youth-cyberbullying. Accessed: Aug 2022

Jiexun L., R. Zheng, and H. Chen. 2006. From Fingerprint to Writeprint. Communications of the ACM, 49(4): 76-82. DOI: https://doi.org/10.1145/1121949.1121951

P. Juola. 2008. Authorship attribution. Foundations and Trends in Information Retrieval. 1(3): 233-334. DOI: https://doi.org/10.1561/1500000005

Zhang, C., X. Wu, Z. Niu, and W. Ding. 2014. Authorship identification from unstructured texts. Knowledge-Based Systems, 66: 99-111. DOI: https://doi.org/10.1016/j.knosys.2014.04.025

Stamatatos,E. 2008. Author Identification: Using Text Sampling to Handle the Class Imbalance Problem. Information Processing and Management. 44(2): 790-799. DOI: https://doi.org/10.1016/j.ipm.2007.05.012

Ghazali, K. 2012. National Identity and Minority Languages. UN Chronicle, 47(3): 17-20. DOI: https://doi.org/10.18356/f3ee6e9c-en

Omar, A. 2014. Processing Malaysian Indigenous Languages: A Focus on Phonology and Grammar. Open Journal of Modern Linguistis. 4(5): 728-738. DOI: https://doi.org/10.4236/ojml.2014.45063

Tajuddin, M. S. A. 2019. Permanent Mission of Malaysia to the United Nations High-Level Plenary Meeting of the United Nations General Assembly on the Global Launch of the International Year of Indigenous Languages. [Online]. Available:https://www.kln.gov.my/web/usa_un-new-york/home/-/asset_publisher/ZJfQEzYEsqRQ/blog/statement-by-mr-mohdsuhaimi-ahmad-tajuddin-charge-d-affaires-permanent-mission-ofmalaysia-to-the-united-nations-high-level-plenary-meeting-of-theu?inheritRedirect=false. Accessed: Aug 2022.

UNESCO. 2021. The International Year of Indigenous Languages: mobilizing the international community to preserve, revitalize and promote indigenous languages. 82-83. ISBN :978-92-3-100484-1.

Igawa, R. A., A. M. G. d. Almeida, B. B. Zarpelao, and S. Barbon. 2015. Recognition of Compromised Accounts on Twitter. SBSI 2015 Proceedings of the annual conference on Brazilian Symposium on Information Systems: Information Systems: A Computer Socio-Technical Perspective.1: 9-14. DOI:10.5753/sbsi.2015.5885

Banga, R., and P. Mehndiratta. 2017. Authorship attribution for textual data on online social networks. 2017 Tenth International Conference on Contemporary Computing (IC3). 1-7. DOI : https://doi.org/10.1109/IC3.2017.8284311

Fourkioti, O., S. Symeonidis, and A. Arampatzis.2019. Language Models and Fusion for Authorship Attribution. Information Processing & Management. 56(6): 102061.DOI : https://doi.org/10.1016/j.ipm.2019.102061

Theophilo, A., R. Giot, and A. Rocha. 2021. Authorship Attribution of Social Media Messages. IEEE Transactions on Computational Social Systems. 10(1): 1-14, 2021. DOI: https://doi.org/10.1109/TCSS.2021.3123895

Posadas-Durán, J.P., H. Gómez-Adorno, G. Sidorov, I. Batyrshin, D. Pinto, and L. Chanona-Hernández. 2017. Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing. 21: 627-639. DOI: https://doi.org/10.1007/s00500-016-2446-x

Shrestha, P., S. Sierra, F. A. González, P. Rosso, M. Montes-y-Gómez, and T. Solorio. Convolutional Neural Networks for Authorship Attribution of Short Texts. 15th Conference of the European Chapter of the Association for Computational Linguistics, Spain. 669–674. DOI : https://doi.org/10.18653/v1/E17-2106

Rocha, A., W. J. Scheirer, C. W. Forstall, T. Cavalcante, A. Theophilo, B. Shen, A. R. B. Carvalho, and E. Stamatatos. 2017. Authorship Attribution for Social Media Forensics. IEEE Transactions on Information Forensics and Security. 2(1): 5-33. DOI: https://doi.org/10.1109/TIFS.2016.2603960

Jambi, K. M., I. H. Khan, M. A. Siddiqui and S. O. Alhaj. 2021. Towards Authorship Attribution in Arabic Short-Microblog Text. IEEE Access. 9: 128506-128520. DOI: https://doi.org/10.1109/ACCESS.2021.3112624

Chen, Y. 2015. Convolutional Neural Network for Sentence Classification. Thesis (Master), University of Waterloo, Ontario. DOI: https://doi.org/10.48550/arXiv.1408.5882

Khatun, A., A. Rahman, M. S. Islam and Marium-E-Jannat. 2020. Authorship Attribution in Bangla literature using Character-Level CNN. 2019 22nd International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh. 1-5. DOI: 10.1109/ICCIT48885.2019.9038560.

Dreher, J. J. 1970. The Computer-Linguistic Detective of Authorship. The Journal of Asian Studies. 29(4): 883-887. DOI: https://doi.org/10.2307/2943094

Mendenhall, T. C. 1887. The Characteristic Curves of Composition. American Association for the Advancement of Science. 9(214): 237-249.DOI: https://doi.org/10.1126/science.ns-9.214S.237

Mendenhall, T. C. 1901. A menchanical solution of a literary problem. Popular Science Monthly. 60: 97-105.

Yule, G. U. 1939. On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrika. 30(3/4): 363-390. DOI: https://doi.org/10.1093/biomet/30.3-4.363

Ellegård, A. 1962. A statistical method for determining authorship: the junius letter. Gothenburg, Sweden: Acta Universitatis Gothoburgensis. 1769-1772. DOI:10.2307/411928

Sarndal, C. E. 1967. On Deciding Cases of Disputed Authorship. Journal of the Royal Statistical Society. 16(3): 251-268. DOI: https://doi.org/10.2307/2985921

Morton, A. Q. 1978. Literary Detection : How to Prove Authorship and Fraud in Literature and Documents. New York: Scribner.

Bailey, R. W. 1978. Authorship Attribution in a Forensic Setting. Advances in Computer-aided Literary and Linguistic Research: 87-106.

Burrows, J. F. 1987. Word-patterns and story-shapes: The statistical analysis of narrative style. Literary & Linguistic Computing. 2(2): 61-70. DOI: https://doi.org/10.1093/llc/2.2.61

Burrows, J. F. 1989. An ocean where each kind...': Statistical analysis and some major determinants of literary style. Computers and the Humanities. 23: 309-321. DOI: https://doi.org/10.1007/BF02176636

Holmes, D. I. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing. 13(3): 111-117. DOI: https://doi.org/10.1093/llc/13.3.111

Burrows, J. F. 1992. Not unles you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing. 7(2): 91-109. DOI: https://doi.org/10.1093/llc/7.2.91

Greenwood, H. H. 1995. Common word frequencies and authorship in Luke's Gospel and Acts. Literary and linguistic computing. 10(3): 183-187. DOI: https://doi.org/10.1093/llc/10.3.183

Holmes , D. I. and R. S. Forsyth. 1995. The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing. 10(2): 111-127. DOI: https://doi.org/10.1093/llc/10.2.111

Mealand, D. L. 1995. Correspondence analysis of Luke. Literary and linguistic computing. 10(3): 171-182. DOI: https://doi.org/10.1093/llc/10.3.171

Kjell, B. 1994. Authorship determination using letter pair frequency features with neural network classifiers. Literary and Linguistic Computing. 9(2): 119-124. DOI: https://doi.org/10.1093/llc/9.2.119

Hoorn, J. F., S. L. Frank, W. Kowalczyk, and F. v. d. Ham. 1999. Neural network identification of poets using letter sequences. Literary and Linguistic Computing. 14(3): 311-338. DOI : https://doi.org/10.1093/llc/14.3.311

Argamon, S., M. Šarić, and S. S. Stein. 2003. Style mining of electronic messages for multiple authorship discrimination: first results. Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York. 475–480. DOI : https://doi.org/10.1145/956750.956805

Kešelj, V., F. Peng, N. Cercone, and C. Thomas. 2003. N-gram-based author profiles for authorship attribution. Proceedings of the conference pacific association for computational linguistics (PACLING), Nova Scotia. 3: 255-264.

Abbasi A. and H. Chen. Applying authorship analysis to extremist-gro up web forum messages. IEEE Intelligent Systems, 20(5): 67-75. DOI : https://doi.org/10.1109/MIS.2005.81

Zheng, R., J. Li, H. Chen, and Z. Huang. 2006. A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the Association for Information Science and Technology. 57(3): 378-393. DOI : https://doi.org/10.1002/asi.20316

Potthast, M., P. Rosso, E. Stamatatos, and B. Stein. 2019. A Decade of Shared Tasks in Digital Text Forensics at PAN. In Advances in Information Retrieval: 41st European Conference on IR Research, ECIR 2019, Cologne, Germany. Part II 41: 291-300. DOI: https://doi.org/10.1007/978-3-030-15719-7_39

Almishari, M., E. Oguz and G. Tsudik. 2014. Fighting Authorship Linkability with Crowdsourcing. In Proceedings of the second ACM conference on Online social networks. 69-82. DOI: https://doi.org/10.1145/2660460.2660486

Neocleous, A. and A. Loizides. 2021. Machine Learning and Feature Selection for Authorship Attribution: The Case of Mill, Taylor Mill and Taylor, in the Nineteenth Century. IEEE Access. 9: 7143-7151. DOI: https://doi.org/10.1109/ACCESS.2020.3047583

Barlas, G. and E. Stamatatos. 2021. A transfer learning approach to cross domain authorship attribution. Evolving Systems. 12(3): 625-643. DOI: https://doi.org/10.1007/s12530-021-09377-2

Huang, W., R. Su, and M. Iwaihara. 2020. Contribution of improved character embedding and latent posting styles to authorship attribution of short texts. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China. Part II 4: 261-269. DOI : https://doi.org/10.1007/978-3-030-60290-1_20

Theóphilo, A., L. A. Pereira, and A. Rocha. Needle in a haystack? harnessing onomatopoeia and user-specific stylometrics for authorship attribution of micro-messages. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton. 2692-2696. DOI: 10.1109/ICASSP.2019.8683747. DOI: https://doi.org/10.1109/ICASSP.2019.8683747

Le, Q. and T. Mikolov. 2014. Distributed Representations of Sentences and Documents. In International conference on machine learning. DOI: https://doi.org/10.48550/arXiv.1405.4053

Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 5: 135-146. DOI: https://doi.org/10.1162/tacl_a_00051

Besacier, L., E. Barnard, A. Karpov, and T. Schultz. 2014. Automatic Speech Recognition for Under-Resourced Languages: A Survey. Speech Communication. 56: 85-100. DOI: https://doi.org/10.1016/j.specom.2013.07.008

Apin, P. and K. A. Wahab. Tabu bahasa dalam masyarakat Dusun di Daerah Ranau, Sabah. Jurnal Melayu. 14(2): 224-239.

Kowsari, K., K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown. 2019. Text Classification Algorithms: A Survey. Information. 10(4): 150. DOI: https://doi.org/10.3390/info10040150

Savoy, J. 2020. Machine Learning Models. Machine Learning Methods for Stylometry: Authorship Attribution and Author Profiling. Chapter 6: 109-151. DOI: https://doi.org/10.1007/978-3-030-53360-1_6

Wenjing, R. S., Huang, and M. Iwaihara. 2020. Contribution of Improved Character Embedding and Latent Posting Styles to Authorship Attribution of Short Texts. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China. Part II 4: 261-269. DOI: https://doi.org/10.1007/978-3-030-60290-1_20

Chowdhury, H. A., Imon, M. A. H., and Islam, M. S. 2018. A comparative analysis of word embedding representations in authorship attribution of bengali literature. In 2018 21st international conference of computer and information technology (ICCIT). 1-6. DOI: 10.1109/ICCITECHN.2018.8631977

TOWARDS CURBING CYBER-BULLYING IN MALAYSIA BY AUTHOR IDENTIFICATION OF IBAN AND KADAZANDUSUN OSN TEXT USING DEEP LEARNING

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

Asean Engineering Journal

Information