FIREFLYCLUST: AN AUTOMATED HIERARCHICAL TEXT CLUSTERING APPROACH

Authors

  • Athraa Jasim Mohammed Information Technology Center, University of Technology, Baghdad, Iraq
  • Yuhanis Yusof School of Computing, College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah, Malaysia
  • Husniza Husni School of Computing, College of Arts and Sciences, Universiti Utara Malaysia, 06010 Sintok, Kedah, Malaysia

DOI:

https://doi.org/10.11113/jt.v79.5408

Keywords:

Firefly algorithm, clustering, data mining, swarm intelligence

Abstract

Text clustering is one of the text mining tasks that is employed in search engines. Discovering the optimal number of clusters for a dataset or repository is a challenging problem. Various clustering algorithms have been reported in the literature but most of them rely on a pre-defined value of the k clusters. In this study, a variant of Firefly algorithm, termed as FireflyClust, is proposed to automatically cluster text documents in a hierarchical manner. The proposed clustering method operates based on five phases: data pre-processing, clustering, item re-location, cluster selection and cluster refinement. Experiments are undertaken based on different selections of threshold value. Results on the TREC collection named TR11, TR12, TR23 and TR45, showed that the FireflyClust is a better approach than the Bisect K-means, hybrid Bisect K-means and Practical General Stochastic Clustering Method. Such a result would enlighten the directions in developing a better information retrieval engine for this dynamic and fast growing big data era.

References

Hu, G., Zhou, S., Guan, J., and Hu, X. 2008. Towards Effective Document Clustering: A constrained K-means based Approach. Elsevier, Information Processing & Management. 44(4): 1397-1409.

Aliguliyev, R. M. 2009. Performance Evaluation of Density-based Clustering Methods. Elsevier, Information Sciences. 179(20): 3583-3602.

Banati, H., and Bajaj, M. 2013. Performance Analysis of Firefly Algorithm for Data Clustering. Int. J. Swarm Intelligence. 1(1): 19-35.

Kashef, R., and Kamel, M. 2010. Cooperative Clustering. Elsevier, Pattern Recognition. 43(6): 2315-2329.

Gil-Garicia, R., and Pons-Porrata, A. 2010. Dynamic Hierarchical Algorithms for Document Clustering. Elsevier, Pattern Recognition Letters. 31(6): 469-477.

Jain, A. K. 2010. Data Clustering: 50 Years Beyond K-means. Elsevier, Pattern Recognition Letters. 31(8): 651-666.

Kashef, R., and Kamel, M. S. 2009. Enhanced Bisecting k-means Clustering using Intermediate Cooperation. Elsevier, Pattern Recognition. 42(11): 2557-2569.

Murugesan, K., and Zhang, J. 2011. Hybrid Bisect K-means Clustering Algorithm. International Conference on Business Computing and Global Informatization. 29-31.

Murugesan, K., and Zhang, J. 2011. Hybrid Hierarchical Clustering: An Experimental Analysis (No. CMIDA-HIPSCCS#001-11). University of Kentucky.

Cui, X., Potok, T. E., and Palathingal, P. 2005. Document Clustering using Particle Swarm Optimization. IEEE Swarm Intelligence Symposium, SIS 2005. 185-191.

He, Y., Hui, S. C., and Sim, Y. 2006. Anovel Ant-based Clustering Approach Document Clustering. In H. Tou Ng, M. K. Leong, M. Y. Kan and D. Ji (Eds.). Information Retrieval Technology Springer Berlin Heidelberg. 4182: 537-544.

Karaboga, D., and Ozturk, C. 2011. A Novel Clustering Approach: Artificial Bee Colony (ABC) Algorithm. Elsevier, Applied Soft Computing. 11(1): 652-657.

Zaw, M. M., and Mon, E. E. 2013. Web Document Clustering Using Cuckoo Search Clustering Algorithm based on Levy Flight. International Journal of Innovation and Applied Studies. 4(1): 182-188.

Rui, T., Fong, S., Yang, X. S., and Deb, S. 2012. Nature-Inspired Clustering Algorithms for Web Intelligence Data. IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT). 147-153.

Deneubourg, J. L., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., and Chrétien, L. 1991. The Dynamics of Collective Sorting: Robot-like Ants and Ant-like. The First International Conference On Simulation Of Adaptive Behavior On From Animals To Animates.

Picarougne, F., Azzag, H., Venturini, G., and Guinot, C. 2007. A New Approach of Data Clustering Using a Flock of Agents. Evolutionary Computation, Cambridge: MIT Press. 15(3): 345-367.

Tan, S. C., Ting, K. M., and Teng, S. W. 2011. A General Stochastic Clustering Method for Automatic Cluster Discovery. Elsevier, Pattern Recognition. 44(10-11): 2786-2799.

Yang, X. S. 2010. Nature-Inspired Metaheuristic Algorithms. 2nd Edition. United Kingdom: Luniver Press.

Yang, X. S., Hosseini, S. S. S., and Gandomi, A. H. 2012. Firefly Algorithm for Solving Non-Convex Economic Dispatch Problems with Valve Loading Effect. Elsevier, Applied Soft Computing. 12(3): 1180-1186.

Dos Santos Coelho, L., de Andrade Bernert, D. L., and Mariani, V. C. 2011. A Chaotic Firefly Algorithm Applied to Reliability-Redundancy Optimization. IEEE Congress on Evolutionary Computation (CEC), New Orleans. 517-521.

Horng, M. H., and Jiang, T. W. 2010. Multilevel Image Thresholding Selection based on the Firefly Algorithm. The 7th International Conference on Ubiquitous Intelligence & Computing and 7th International Conference on Autonomic & Trusted Computing (UIC/ATC), Xian, Shaanxi. 58-63.

Senthilnath, J., Omkar, S. N., and Mani, V. 2011. Clustering using Firefly Algorithm: Performance Study. Elsevier, Swarm and Evolutionary Computation. 1(3): 164-171.

TREC. 1999. Text Retrieval Conference (TREC).

Steinbach, M., Karypis, G., and Kumar, V. 2000. A Comparison of Document Clustering Techniques. The KDD workshop on Text Mining, Boston.

Aliguliyev, R. M. 2009. Clustering of Document Collection-A Weighted Approach. Elsevier, Expert Systems with Applications. 36(4): 7904-7916.

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. 1 ed. Cambridge University Press.

Luo, C., Li, Y., and Chung, S. M. 2009. Text Document Clustering Based on Neighbors. Elsevier, Data & Knowledge Engineering. 68(11): 1271-1288.

Shannon, C. E. 1948. A Mathematical Theory of Communication. Bell System Technical Journal. 27: 379-423, 623–656.

Apostolopoulos, T., and Vlachos, A. 2011. Application of the Firefly Algorithm for Solving the Economic Emissions Load Dispatch Problem. International Journal of Combinatorics. 2011: 523806.

Hassanzadeh, T., Vojodi, H., and Moghadam, A. M. E. 2011. An Image Segmentation Approach Based on Maximum Variance Intra-Cluster Method and Firefly Algorithm. IEEE Seventh International Conference on Natural Computation (ICNC), Shanghai. 1817-1821.

Bojic, I., Podobnik, V., Ljubi, I., Jezic, G. and Kusek, M. 2012. A Self-Optimizing Mobile Network: Auto-Tuning the Network with Firefly-Synchronized Agents. Elsevier, Information Sciences. 182(1): 77-92.

Hassanzadeh, T., Faez, K., and Seyfi, G. 2012. A Speech Recognition System Based on Structure Equivalent Fuzzy Neural Network Trained by Firefly Algorithm. IEEE International Conference on Biomedical Engineering (ICoBE). 63-67.

Tan, S. C. 2012. Simplifying and Improving Swarm Based Clustering. IEEE Congress on Evolutionary Computation (CEC), Brisbane, QLD. 1-8.

Bonabeau, E., Dorigo, M., and Theraulaz, G. x. 1994. Swarm Intelligence: From Natural to Artificial Systems. New York, NY: Oxford University Press, Santa Fe Institute Studies in the Sciences of Complexity.

Mohammed, A. J., Yusof, Y., and Husni, H. 2015. Document Clustering Based on Firefly Algorithm. Journal of Computer Science. 11(3): 453-465.

Mohammed, A. J., Yusof, Y., and Husni, H. 2016. Discovering Optimal Clusters using Firefly Algorithm. International Journal of Data Mining, Modeling and Management. 8(4): 330-347.

Mohammed, A. J., Yusof, Y. and Husni, H. 2016. Integrated Bisect K-means and firefly Algorithm for Hierarchical Text Clustering. Journal of Engineering and Applied Sciences. 11(3): 522-527. ISSN 1816-949X.

Downloads

Published

2017-06-21

Issue

Section

Science and Engineering

How to Cite

FIREFLYCLUST: AN AUTOMATED HIERARCHICAL TEXT CLUSTERING APPROACH. (2017). Jurnal Teknologi (Sciences & Engineering), 79(5). https://doi.org/10.11113/jt.v79.5408