The Hybrid Feature Selection k-means Method for Arabic Webpage Classification

Authors

Hanan Alghamdi Faculty of Computing, Universiti Teknologi Malaysia (UTM), 81310 UTM Johor Bahru, Johor, Malaysia
Ali Selamat Faculty of Computing, Universiti Teknologi Malaysia (UTM), 81310 UTM Johor Bahru, Johor, Malaysia

DOI:

https://doi.org/10.11113/jt.v70.3518

Keywords:

Feature selection, Arabic, webpage classification, k-means

Abstract

The high-dimensional data features found in the enormous amount of Arabic text available on the Internet is an important research problem in Web information retrieval. It reduces the accuracy of the clustering algorithms and maximizes the processing time. Selecting the relevant features is the best solution. Therefore, in this paper, we propose a feature selection model that incorporates three different feature selection methods (CHI-squared, mutual information, and term frequency-inverse document frequency) to build a hybrid feature selection model (Hybrid-FS) for k-means clustering. This model represents text data in a high structure (consisting of three types of objects, namely, the terms, documents and categories). We evaluate the model on a set of common Arabic online newspapers. We assess the effect of using the Hybrid-FS with standard k-means clustering. The experimental results show that the proposed method increases purity by 28% and lowers the runtime by 80% compared to the standard k-means algorithm. We conclude that the proposed hybrid feature selection model enhances the accuracy of the k-means algorithm and successfully produces coherent-compact clusters that are well-separated when applied to high-dimensional datasets.Â Â

References

Chang, Y. and K. Lee. 2011. Bayesian Feature Selection for Sparse Topic Model. IEEE InternationalWorkshop on Machine Learning for Signal Processing. 1â€“6.

Zhang, Y. and Q. Zhang. 2006. A Text Classifier Based on Sentence Category VSM. Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation. 244â€“249.

Gharib, T. F., M. M. Fouad, A. Mashat, and I. Bidawi. 2012. Self Organizing Map -based Document Clustering Using WordNet Ontologies. Int. J. Comput. Sci. 9(1): 88â€“95.

Napoleon, D. and S. Pavalakodi. 2011. A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set. Int. J. Comput. Appl. 13(7): 41â€“46.

Farahat, A. K. and M. S. Kamel.Statistical Semantics for Enhancing Document Clustering. 2011. Knowl. Inf. Syst. 28( 2): 365â€“393.

Li ,R. Z. and Y. Sen Zhang. 2012. Study on the Method of Feature Selection Based on Hybrid Model for Text Classification. Adv. Mater. Res. 433â€“440: 2881â€“2886.

Gunal, S. 2012. Hybrid Feature Selection for Text Classification. Turkish J. Electr. Eng. Comput. Sci. 20(2): 1296â€“1311.

Isa, D., L. H. Lee, V. P. Kallimani, and R. RajKumar. 2008. Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine. Knowl. Data Eng. 20(9): 1264â€“1272.

Jing, L., J. Yun, J. Yu, and J. Huang. 2011. High-Order Co-clustering Text Data on Semantics-Based Representation Model. Advances in Knowledge Discovery and Data Mining. 171â€“182.

Uysal, A. K. and S. Gunal. 2012. A Novel Probabilistic Feature Selection Method for Text Classification. Knowledge-Based Syst. 36: 226â€“235.

Zhou, Y., Y. Yang, W. Peng, and Y. Ping. 2010. A Novel Term Weighting Scheme With Distributional Coefficient For Text Categorization With Support Vector Machine. IEEE Youth Conference onInformation Computing and Telecommunications (YC-ICT). 2â€“5.

Guru, D. S., B. S. Harish, and S. Manjunath. 2010.Symbolic Representation of Text Documents. Third Annual ACM Bangalore Conference. 1â€“4.

Mesleh, A. 2007. Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. J. Comput. Sci. 3(6): 430â€“435.

Machova, K., A. Szaboova, and P. Bednar. 2007. Generation of a Set of Key Terms Characterising Text Documents. J. Inf. Organ. Sci. 31(1).

Yongqing, W., L. Pei-yu, and Z. Zhu. 2008. A Feature Selection Method based on Improved TFIDF. Third International Conference on Pervasive Computing and Applications. 94â€“97.

Qu, S., S. Wang, and Y. Zou. 2008. Improvement of Text Feature Selection Method Based on TFIDF. International Seminar on Future Information Technology and Management Engineering. 79â€“81.

Ramos, J. 1999. Using TF-IDF to Determine Word Relevance in Document Queries. First International Conference on Machine Learning.

Andrews, N. O. and E. A. Fox. 2007. Recent Developments in Document Clustering.

Jain, A. and M. Murty. 1999. Data Clustering: A Review. ACM Comput. Surv. 31(3): 255â€“323.

Larkey, L., L. Ballesteros, and M. Connell. 2007. Light Stemming for Arabic Information Retrieval. Arabic Computational Morphology, no. Ldc, A. Soud, A. van den Bosch, and G. Neumann, Eds. Springer. 221â€“243.

Al-shammari, E. 2010. Improving Arabic Document Categorizationâ€¯: Introducing Local Stem. 10th International Conference on Intelligent Systems Design and Applications. 385â€“390.

Alghamdi, H. M. and A. Selamat. 2012. Topic Detections in Arabic Dark Websites Using Improved Vector Space Model. 4th Conference on Data Mining and Optimization (DMO). 6â€“11.

Al-diabat, M. 2012. Arabic Text Categorization Using Classification Rule Mining. Appl. Math. Sci. 6(81): 4033â€“4046.

Alsaleem, S. 2011. Automated Arabic Text Categorization Using SVM and NB. Int. Arab J. e-Technology. 2(2): 124â€“128.

Huang, A. 2008. Similarity Measures for Text Document Clustering. The New Zealand Computer Science Research Student Conference.

Rokach, L. and O. Maimon. Clustering Methods. Data Mining and Knowledge Discovery Handbook. O. M. and L. Rokach, Ed. New York.

Rana S., Jasola S., and Kumar R. 2013. A Boundary Restricted Adaptive Particle Swarm Optimization for Data Clustering. Int. J. Mach. Learn. Cybern. 4(4): 391â€“400.

Downloads

Published

2014-09-18

Issue

Vol. 70 No. 5: Special Issue in Science and Technology

Section

Science and Engineering

License

Copyright of articles that appear in Jurnal Teknologi belongs exclusively to Penerbit Universiti Teknologi Malaysia (Penerbit UTM Press). This copyright covers the rights to reproduce the article, including reprints, electronic reproductions, or any other reproductions of similar nature.