FEATURE SELECTION AND MACHINE LEARNING CLASSIFICATION FOR MALWARE DETECTION

Authors

Ban Mohammed Khammas Network Engineering Department, Collage of Information Engineering, the University of Al-Nahrain, Baghdad, Iraq
Alireza Monemi Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Johor Bahru, Malaysia
Joseph Stephen Bassi Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Johor Bahru, Malaysia
Ismahani Ismail Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Johor Bahru, Malaysia
Sulaiman Mohd Nor Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Johor Bahru, Malaysia
Muhammad Nadzir Marsono Faculty of Electrical Engineering, Universiti Teknologi Malaysia, 81310 Johor Bahru, Malaysia

DOI:

https://doi.org/10.11113/jt.v77.3558

Keywords:

Malware detection, machine learning, feature selection, principal component analysis, support vector machine

Abstract

Malware is a computer security problem that can morph to evade traditional detection methods based on known signature matching. Since new malware variants contain patterns that are similar to those in observed malware, machine learning techniques can be used to identify new malware. This work presents a comparative study of several feature selection methods with four different machine learning classifiers in the context of static malware detection based on n-grams analysis. The result shows that the use of Principal Component Analysis (PCA) feature selection and Support Vector Machines (SVM) classification gives the best classification accuracy using a minimum number of features.

References

Chionis, I., Nikolopoulos, S. D., and Polenakis I. 2013. A Survey on Algorithmic Techniques for Malware Detection. Proc. 2nd Int'l Symposium on Computing in Informatics and Mathematics (ISCIM'13). 29-34.

Vinod, P., Laxmi, V., and Gaur, M. S. 2012. REFORM: Relevant Features for Malware Analysis. 26th International Conference on Advanced Information Networking and Applications Workshops (WAINA). 738-744.

O'Kane, P., Sezer, S., McLaughlin, K., and Im, E. 2013. SVM Training Phase Reduction Using Dataset Feature Filtering for Malware Detection. IEEE Transactions on Information Forensics and Security. 8(3): 500-509.

HadÅ¾iosmanoviÄ‡, D., Simionato, L., Bolzoni, D., Zambon, E., and Etalle, S. 2012. N-Gram Against the Machine: On the Feasibility of the N-Gram Network Analysis for Binary Protocols. Research in Attacks, Intrusions, and Defenses. Springer. 354-373.

de Lima, I. V. M., Degaspari, J. A., and Sobral, J. B. M. 2008. Intrusion Detection Through Artificial Neural Networks. Network Operations and Management Symposium, NOMS. IEEE. 867-870.

Zhang, B., Yin, J., Hao, J., Zhang, D., and Wang, S. 2007. Malicious Codes Detection Based on Ensemble Learning. Autonomic and Trusted Computing. Springer. 468-477.

Ismail, I. 2013. Naive Bayes Classification with Domain Knowledge for New Malware Variants and Stateless Packet Level Detections.

Yan, G., Brown, N., and Kong, D. 2013. Exploring Discriminatory Features for Automated Malware Classification. Detection of Intrusions and Malware, and Vulnerability Assessment. Springer. 41-61.

Islam, R. and Altas, I. 2012. A Comparative Study of Malware Family Classification. Information and Communications Security. Springer. 488-496.

Reddy, D. K. S. and Pujari, A. K. 2006. N-gram Analysis for Computer Virus Detection. Journal in Computer Virology. Springer. 2: 231-239.

Moskovitch, R., Stopel D., Feher C., Nissim N., and Elovici Y. 2008. Unknown Malcode Detection via Text Categorization and the Imbalance Problem. IEEE International Conference on Intelligence and Security Informatics. ISI. 156-161.

Jain, S. and Meena, Y. K. 2011. Byte Level nâ€“Gram Analysis for Malware Detection. Computer Networks and Intelligent Computing. Springer. 51-59.

Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Japkowicz, N., and Elovici, Y. 2009. Unknown Malcode Detection and the Imbalance Problem. Journal in Computer Virology. Springer. 5: 295-308.

Liangboonprakong, C. and Sornil, O. 2013. Classification of Malware Families Based on N-Grams Sequential Pattern Features. 8th IEEE Conference on Industrial Electronics and Applications (ICIEA). 777-782.

Xu, X. and Wang, X. 2005. An Adaptive Network Intrusion Detection Method Based on PCA and Support Vector Machines. Advanced Data Mining and Applications. Springer. 696-703.

Kolter, J. Z. and Maloof, M. A. 2004. Learning to Detect Malicious Executables in the Wild. Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 470-478.

Leder, F., Steinbock, B., and Martini, P. 2009. Classification and Detection of Metamorphic Malware Using Value Set Analysis. 4th International Conference on Malicious and Unwanted Software (MALWARE). 39-46.

Agrawal, H. 2011. Detection of Global Metamorphic Malware Variants Using Control and Data Flow Analysis. Google Patents.

Witten, I. H., and Frank, E. 2005. Data Miming Practical Machine Learning Tools and Techniques.

Karegowda, A. G., Manjunath, A., and Jayaram, M. 2010. Comparative Study of Attribute Selection Using Gain Ratio and Correlation Based Feature Selection. International Journal of Information Technology and Knowledge Management. 2: 271-277.

Wang, W., Zhang, X., and Gombault, S. 2009. Constructing Attribute Weights from Computer Audit Data for Effective Intrusion Detection. Journal of Systems and Software. 82: 1974-1981.

Wang, W., Guan, X., and Zhang, X. 2008. Processing of Massive Audit Data Streams for Real-time Anomaly Intrusion Detection. Computer Communications. 31: 58-72.

Hall, M. A. 1999. Correlation-Based Feature Selection for Machine Learning. The University of Waikato.

Hu, L. and Zhang, L. 2012. Real-time Internet Traffic Identification Based on Decision Tree. World Automation Congress (WAC). IEEE. 1-3.

Ganapathy, S., Kulothungan, K., Muthurajkumar, S., Vijayalakshmi, M., Yogesh, P., and Kannan, A. 2013. Intelligent Feature Selection and Classification Techniques for Intrusion Detection in Networks: A Survey. EURASIP Journal on Wireless Communications and Networking. Springer. 2013(1): 1-16.

Shabtai, A., Moskovitch, R., Elovici, Y., and Glezer, C. 2009. Detection of Malicious Code by Applying Machine Learning Classifiers on Static Features: A State-of-the-Art Survey. Information Security Technical Report. 14: 16-29.

Vapnik, V. 1998. Statistical Learning Theory. Wiley. New York.

Konig, R., Johansson U., Lofstrom T., and Niklasson L. 2010. Improving Gp Classification Performance by Injection of Decision Trees. IEEE Congress on Evolutionary Computation (CEC). 1-8.

Ã–nem, I. M. 2013. Testing and Improving the Performance of SVM Classifier in Intrusion Detection Scenario. Knowledge Discovery. Knowledge Engineering and Knowledge Management. Springer. 173-184.

Weka library, Data Mining Software in Java. [Online]. Available http://www.cs.waikato.ac.nz/ml/weka.

VX Heaven collection, VX Heaven website, available at: http://vx.netlux.org.

Malware-Wikipedia, http://en.wikipedia.org/wiki/Malware.

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

JURNAL TEKNOLOGI