A CATEGORY CLASSIFICATION ALGORITHM FOR INDONESIAN AND MALAY NEWS DOCUMENTS
DOI:
https://doi.org/10.11113/jt.v78.9549Keywords:
Text classification, Text Mining, Information RetrievalAbstract
Text classification (TC) provides a better way to organize information since it allows better understanding and interpretation of the content. It deals with the assignment of labels into a group of similar textual document. However, TC research for Asian language documents is relatively limited compared to English documents and even lesser particularly for news articles. Apart from that, TC research to classify textual documents in similar morphology such Indonesian and Malay is still scarce. Hence, the aim of this study is to develop an integrated generic TC algorithm which is able to identify the language and then classify the category for identified news documents. Furthermore, top-n feature selection method is utilized to improve TC performance and to overcome the online news corpora classification challenges: rapid data growth of online news documents, and the high computational time. Experiments were conducted using 280 Indonesian and 280 Malay online news documents from the year 2014 – 2015. The classification method is proven to produce a good result with accuracy rate of up to 95.63% for language identification, and 97.5%% for category classification. While the category classifier works optimally on n = 60%, with an average of 35 seconds computational time. This highlights that the integrated generic TC has advantage over manual classification, and is suitable for Indonesian and Malay news classification.
References
S. Brin and L. Page, 1998. The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst., 30: 107-117.
A. M. Z. Bidoki and N. Yazdani. 2008. DistanceRank: An intelligent ranking algorithm for web pages. Inf. Process. Manage. 44: 877-892.
C. Tian. 2010. A Kind Of Algorithm For Page Ranking Based On Classified Tree In Search Engine, In Computer Application And System Modeling (ICCASM). International Conference. V13-538-V13-541.
J. Elder IV and T. Hill. 2012. Practical Text Mining And Statistical Analysis For Non-Structured Text Data Applications: Academic Press, 2012.
A. Hotho, A. Nürnberger, and G. Paaß. 2005. A Brief Survey of Text Mining, in Ldv Forum. 19-62.
A. Kao and S. R. Poteet. 2007. Natural Language Processing And Text Mining: Springer Science & Business Media.
F. Sebastiani, 2005. Text Categorization ed, 2005.
B. Baharudin, L. H. Lee, and K. Khan. 2010. A Review Of Machine Learning Algorithms For Text-Documents Classification," Journal Of Advances In Information Technology. 1: 4-20.
A. Kilgarriff and G. Grefenstette. 2003. Introduction To The Special Issue On The Web As Corpus, Computational linguistics. 29: 333-347.
A. Selamat, H. Yanagimoto, and S. Omatu. 2002. Web news classification using neural networks based on PCA," in SICE 2002. Proceedings of the 41st SICE Annual Conference. 2389-2394.
H. M. Noaman, S. Elmougy, A. Ghoneim, and T. Hamza. 2010. Naive Bayes classifier based Arabic document categorization," in Informatics and Systems (INFOS), 2010 The 7th International Conference. 1-5.
S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Nakamura. 2008. Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project," in IJCNLP. 19-24.
T. Martin, T. Svendsen, and S. Sridharan. 2003. Cross-Lingual Pronunciation Modelling For Indonesian Speech Recognition. Language. 3: 2.
A. D. Asy'arie and A. W. Pribadi. 2009. Automatic News Articles Classification In Indonesian Language By Using Naive Bayes Classifier Method, in Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services. 658-662.
D. Y. Liliana, A. Hardianto, and M. Ridok. 2011. Indonesian News Classification using Support Vector Machine, World Academy of Science, Engineering and Technology. 57: 767-770, 2011.
A. Firdan and K. E. Purnama. 2012. Classification of Emotions in Indonesian Texts using K-NN Method, International Journal of Information and Electronics Engineering, 2: 899-903.
P. W. Buana, S. J. D.R.M., and I. K. G. D. Putra. 2012. Combination of K-Nearest Neighbor and K-Means based on Term Re-weighting for Classify Indonesian News, International Journal of Computer Applications (0975 – 8887), 50: 37-42.
N. M. A. Lestari, I. K. G. D. Putra, and A. K. A. Cahyawan. 2013. Personality Types Classification for Indonesian Text in Partners Searching Website Using Naïve Bayes Methods, IJCSI International Journal of Computer Science Issues. 10: 1-8.
S. Noah and F. Ismail. 2008. Automatic Classifications of Malay Proverbs Using Naive Bayesian Algorithm.
H. Alshalabi, S. Tiun, N. Omar, and M. Albared. 2013. Experiments on the Use of Feature Selection and Machine Learning Methods in Automatic Malay Text Categorization. Procedia Technology. 11: 748-754.
C. C. Aggarwal and C. Zhai. 2012. A survey of text classification algorithms, in Mining text data, ed: Springer, 163-222.
A. McCallum and K. Nigam. 1998. A Comparison Of Event Models For Naive Bayes Text Classification, in AAAI-98 workshop on learning for text categorization. 41-48.
P. Y. Pawar and S. Gawande. 2012. A Comparative Study On Different Types Of Approaches To Text Categorization, International Journal of Machine Learning and Computing, 2: 423-426.
P. Y. Pawar and S. Gawande. 2011. A Comparative Study on Different Types of Approaches to Text Categorization, in 3rd International Conference on Machine Learning and Computing (ICMLC 2011). 366-369.
V. Korde and C. N. Mahender. 2012. Text Classification And Classifiers: A survey, International Journal of Artificial Intelligence & Applications (IJAIA). 3: 85-99.
L. H. Lee and D. Isa. 2010. Automatically Computed Document Dependent Weighting Factor Facility For Naïve Bayes classification, Expert Systems with Applications, 37: 8471-8478.
K. A. Vidhya and G. Aghila. 2010. A Survey of Naive Bayes Machine Learning approach in Text Document Classification, International Journal of Computer Science and Information Security (IJCSIS), 7: 206-211.
D. D. Lewis. 1998. Naive (Bayes) at forty: The Independence Assumption In Information Retrieval, in Machine learning: ECML-98, ed: Springer. 4-15.
C. H. Lee and H. C. Yang. 2009. Construction Of Supervised And Unsupervised Learning Systems For Multilingual Text Categorization, Expert Systems with Applications, 36: 2400-2410.
W. Zhang and F. Gao. 2011. An Improvement to Naive Bayes for Text Classification, Procedia Engineering. 15: 2160-2164.
D. M. Farid, L. Zhang, C. M. Rahman, M. A. Hossain, and R. Strachan. 2014. Hybrid Decision Tree And Naïve Bayes Classifiers For Multi-Class Classification Tasks, Expert Systems with Applications. 41: 1937-1946.
D. Li-guo, T. Taiyuan University of, D. Peng, and L. Ai-ping. 2014. A New Naive Bayes Text Classification Algorithm. TELKOMNIKA Indonesian Journal of Electrical Engineering, 12: 947-952.
T. Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features: Springer.
M. Chau and H. Chen. 2008. A Machine Learning Approach To Web Page Filtering Using Content And Structure Analysis. Decision Support Systems. 44: 482-494.
Y. Zhang, Y. Dang, H. Chen, M. Thurmond, and C. Larson. 2009. Automatic Online News Monitoring And Classification For Syndromic Surveillance. Decision Support Systems. 47: 508-517.
O. Chapelle, V. Sindhwani, and S. S. Keerthi. 2008. Optimization Techniques For Semi-Supervised Support Vector Machines. The Journal of Machine Learning Research. 9: 203-233.
V. T. Nguyen. 2010. Support Vector Machines Combined With Fuzzy C-Means For Text Classification. IJCSNS. 10: 222.
Z. Wang, X. Sun, and D. Zhang. 2006. An Optimal Text Categorization Algorithm Based On SVM, in Communications, Circuits and Systems Proceedings, 2006 International Conference. 2137-2140.
F. Colas and P. Brazdil. 2006. Comparison of SVM And Some Older Classification Algorithms In Text Classification Tasks, in Artificial Intelligence in Theory and Practice, ed: Springer. 169-178.
A. Sun, E. P. Lim, and Y. Liu. 2009. On Strategies For Imbalanced Text Classification Using SVM: A Comparative Study, Decision Support Systems. 48: 191-201.
Y. Yang and X. Liu. 1999. A Re-Examination Of Text Categorization Methods, in Proceedings of the 22nd annual international ACM SIGIR Conference On Research And Development In Information Retrieval. 42-49.
S. C. Dharmadhikari, M. Ingle, and P. Kulkarni. 2011. Empirical Studies on Machine Learning Based Text Classification Algorithms, Advanced Computing: An International Journal (ACIJ. 2.
E. H. S. Han, G. Karypis, and V. Kumar. 2001. Text Categorization Using Weight Adjusted K-Nearest Neighbor Classification: Springer.
J. He, A.-H. Tan, and C.-L. Tan. 2003. On Machine Learning Methods For Chinese Document Categorization, Applied Intelligence. 18: 311-322.
Q. Hu, D. Yu, and Z. Xie. 2008. Neighborhood classifiers, Expert systems with applications. 34: 866-876.
X. Geng, T.-Y. Liu, T. Qin, A. Arnold, H. Li, and H. Y. Shum. 2008. Query Dependent Ranking Using K-Nearest Neighbor, in Proceedings of the 31st Annual International ACM SIGIR Conference On Research And Development In Information Retrieval. 115-122.
J. Callut, K. Françoisse, M. Saerens, and P. Dupont. 2008. Semi-Supervised Classification From Discriminative Random Walks, in Machine Learning and Knowledge Discovery in Databases, ed: Springer. 162-177.
R. Baeza-Yates and B. Ribeiro-Neto. 1999. Modern Information Retrieval. 463: ACM Press New York.
Y. Yang and X. Liu. 1999. A Re-Examination Of Text Categorization Methods, Presented At The Proceedings Of The 22nd Annual International ACM SIGIR Conference On Research And Development In Information Retrieval, Berkeley, California, USA.
Downloads
Published
Issue
Section
License
Copyright of articles that appear in Jurnal Teknologi belongs exclusively to Penerbit Universiti Teknologi Malaysia (Penerbit UTM Press). This copyright covers the rights to reproduce the article, including reprints, electronic reproductions, or any other reproductions of similar nature.