TACKLING IMBALANCED CLASS IN SOFTWARE DEFECT PREDICTION USING TWO-STEP CLUSTER BASED RANDOM UNDERSAMPLING AND STACKING TECHNIQUE

Adi Wijaya; Romi Satria Wahono

doi:10.11113/jt.v79.11874

Authors

Adi Wijaya Informatics Engineering Department, MH Thamrin University, Jakarta, Indonesia
Romi Satria Wahono Faculty of Computer Science, Dian Nuswantoro University, Semarang, Indonesia

DOI:

https://doi.org/10.11113/jt.v79.11874

Keywords:

Software defect prediction, two-step cluster, random undersampling, ensemble learning, stacking technique

Abstract

The cost of finding and correcting the software defects are high and increases exponentially in the software development. The software defect prediction (SDP) can be used in the early phases to reduce the testing and maintenance time, cost and effort; thus, improves the quality of the software. SDP performance is poor caused by imbalanced class in datasets where defective modules as minority compared to defect-free ones. In this study, we propose the combination of random undersampling based on two-step cluster and stacking technique for improving the accuracy of SDP. In stacking technique, Decision Tree, Logistic Regression and k-Nearest Neighbor are used as base learner while Naive Bayes as stacking model learner. The proposed method is evaluated using nine datasets from NASA metrics data program repository and area under curve (AUC) as main evaluation. Results have indicated that the proposed method yield excellent performance for 5 of 9 datasets (AUC > 0.9). Compared to the prior researches, the proposed method has first position for 3 datasets, second position for 5 datasets and only 1 dataset in third position for AUC value comparison. Therefore, it can be concluded that the proposed method has an impressive and promising result in prediction performance for most datasets compared with prior research performance.

References

M. J. Siers and M. Z. Islam. 2015. Software Defect prediction using a Cost Sensitive Decision Forest and Voting, and a Potential Solution to the Class Imbalance Problem. Inf. Syst. 51: 62-71.

H. B. Yadav and D. K. Yadav. 2015. A Fuzzy Logic Based Approach for Phase-wise Software Defects Prediction Using Software Metrics. Inf. Softw. Technol. 63: 44-57.

R. Malhotra. 2016. An Empirical Framework for Defect Prediction Using Machine Learning Techniques with Android Software. Appl. Soft Comput. 1-17.

C. Catal. 2011. Software Fault Prediction : A Literature Review and Current Trends. Expert Syst. Appl. 38(4): 4626-4636.

Ã–. F. Arar and K. Ayan. 2015. Software Defect Prediction Using Cost-sensitive Neural Network. Appl. Soft Comput. 33: 263-277.

I. Arora, V. Tetarwal, and A. Saha. 2015. Open Issues in Software Defect Prediction. Procedia Comput. Sci. 46:. 906-912.

A. Ali, S. M. Shamsuddin, and A. L. Ralescu. 2015. Classification with Class Imbalance Problem: A Review. Int. J. Adv. Soft Comput. its Appl. 7(3): 176-204.

R. S. Wahono and N. S. Herman. 2013. Genetic Feature Selection for Software Defect Prediction. Adv. Sci. Lett. 4(2): 400-407.

R. S. Wahono and N. Suryana. 2013. Combining Particle Swarm Optimization based Feature Selection and Bagging Technique for Software Defect Prediction. Int. J. Softw. Eng. Its Appl. 7(5): 153-166.

I. H. Laradji, M. Alshayeb, and L. Ghouti. 2015. Software Defect Prediction Using Ensemble Learning on Selected Features. Inf. Softw. Technol. 58: 388-402.

G. Czibula, Z. Marian, and I. G. Czibula. 2014. Software Defect Prediction Using Relational Association Rule Mining. Inf. Sci. (Ny). 264: 260-278.

Z. A. Rana, M. A. Mian, and S. Shamail. 2015. Improving Recall of software Defect Prediction Models Using Association Mining. Knowledge-Based Syst. 90: 1-13.

I. H. Laradji, M. Alshayeb, and L. Ghouti. 2015. Software Defect Prediction Using Ensemble Learning on Selected Features. Inf. Softw. Technol. 58: 388-402.

R. S. Wahono. 2015. A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks. J. Softw. Eng. 1: 1.

C. Michailidou, P. Maheras, a. Arseni-Papadimititriou, F. Kolyva-Machera, and C. Anagnostopoulou. 2008. A Study of Weather Types at Athens and Thessaloniki and Their Relationship to Circulation Types for the Cold-wet Period, Part I: Two-Step Cluster Analysis. Theor. Appl. Climatol. 97(1â€“2): 163-177.

T. Chiu, D. Fang, J. Chen, Y. Wang, and C. Jeris. 2001. A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment. Proceedings of the 7th ACM SIGKDD Internation-al Conference on Knowledge Discovery and Data Mining. 263-268.

S. M. Satish and S. Bharadhwaj. 2010. Information Search Behaviour Among New Car Buyers: A Two-step Cluster Analysis. IIMB Manag. Rev. 22(1â€“2): 5-15.

D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. 2012. Reflections on the NASA MDP Data Sets. IET Softw. 6(February): 549-558.

S. Lessmann, S. Member, B. Baesens, C. Mues, and S. Pietsch. 2008. Benchmarking Classification Models for Software Defect Prediction : A Proposed Framework and Novel Findings. IEEE Trans. Softw. Eng. 34(4): 485-496.

F. Gorunescu. 2011. Data Mining: Concepts,Models and Techniques. Springer-Verlag Berlin Heidelberg.

TACKLING IMBALANCED CLASS IN SOFTWARE DEFECT PREDICTION USING TWO-STEP CLUSTER BASED RANDOM UNDERSAMPLING AND STACKING TECHNIQUE

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Database Indexing

Submission Guide