SYNTHETIC MULTIVARIATE DATA GENERATION PROCEDURE WITH VARIOUS OUTLIER SCENARIOS USING R PROGRAMMING LANGUAGE

Authors

  • Sharifah Sakinah Syed Abd Mutalib aCentre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, 26300 Gambang, Kuantan, Pahang, Malaysia bFaculty of Computer, Media and Technology Management, University College TATI, Jalan Panchur, Telok Kalong, 24000 Kemaman, Terengganu, Malaysia https://orcid.org/0000-0003-1312-6158
  • Siti Zanariah Satari Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, 26300 Gambang, Kuantan, Pahang, Malaysia
  • Wan Nur Syahidah Wan Yusoff Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, 26300 Gambang, Kuantan, Pahang, Malaysia

DOI:

https://doi.org/10.11113/jurnalteknologi.v84.17900

Keywords:

Data generation procedure, multivariate data, outlier generating model, Chernoff faces, scatterplot 3D, R

Abstract

A synthetic data generation procedure is a procedure to generate data from either a statistical or mathematical model. The data generation procedure has been used in simulation studies to compare statistical performance methods or propose a new statistical method with a specific distribution. A synthetic multivariate data generation procedure with various outlier scenarios using R is formulated in this study. An outlier generating model is used to generate multivariate data that contains outliers. Data generation procedures for various outlier scenarios by using R are explained. Three outlier scenarios are produced, and graphical representations using 3D scatterplot and Chernoff faces for these outlier scenarios are shown. The graphical representation shows that as the distance between outliers and inliers by shifting the mean, increases in Outlier Scenario 1, the outliers and inliers are completely separated. The same pattern can also be seen when the distance between outliers and inliers, by shifting the covariance, increase in Outlier Scenario 2. For Outlier Scenario 3, when both values  and  increase, the separation of outliers and inliers are more apparent. The data generation procedure in this study will be continually used in other applications, such as identifying outliers by using the clustering method.

References

Camacho, J. 2017. On the Generation of Random Multivariate Data. Chemometrics and Intelligent Laboratory Systems. 160: 40-51. DOI: 10.1016/j.chemolab.2016.11.013.

Qu, W., Liu, H., and Zhang, Z. 2020. A Method of Generating Multivariate Non-normal Random Numbers with Desired Multivariate Skewness and Kurtosis. Behavior Research Methods. 52(3): 939-946. DOI: 10.3758/s13428-019-01291-5.

Riley, R. D., Snell, K. I. E., Martin, G. P., Whittle, R., Archer, L., Sperrin, M., and Collins, G. S. 2021. Penalization and Shrinkage Methods Produced Unreliable Clinical Prediction Models Especially when Sample Size was Small. Journal of Clinical Epidemiology. 132: 88-96. DOI: 10.1016/j.jclinepi.2020.12.005.

Cerioli, A., Marco, R., and Francesca, T. 2011. Accurate and Powerful Multivariate Outlier Detection. Int Statistical Inst: Proc. 58th World Statistical Congress 2011, Dublin. 5608-5613. https://2011.isiproceedings.org/papers/950478.pdf.

Filzmoser, P., Maronna, R., and Werner, M. 2008. Outlier Identification in High Dimensions. Computational Statistics & Data Analysis. 52: 1694-1711. DOI: 10.1016/j.csda.2007.05.018.

Abd Mutalib, S. S. S., Satari, S. Z., and Wan Yusoff, W. N. S. 2019. A New Robust Estimator to Detect Outliers for Multivariate Data. Journal of Physics: Conference Series. 1366: 1-9. DOI: 10.1088/1742-6596/1366/1/012104.

Abd Mutalib, S. S. S., Satari, S. Z., and Wan Yusoff, W. N. S. 2021. Comparison of Robust Estimators' for Detecting Outliers in Multivariate Data. Journal of Statistical Modeling and Analysis. 3(2): 36-64. DOI: 10.1088/1742-6596/1988/1/012095.

Werner, M. 2003. Identification of Multivariate Outliers in Large Data Sets. Doctoral Thesis, University of Colorado. http://math.ucdenver.edu/graduate/thesis/werner_thesis.pdf.

Herwindiati, D. E., Djauhari, M. A., and Mashuri, M. 2007. Robust Multivariate Outlier Labeling. Communications in Statistics - Simulation and Computation. 36(6): 1287-1294. DOI: 10.1080/03610910701569044.

Su, X., and Tsai, C-L. 2011. Outlier Detection. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 1(3): 261-268. DOI: 10.1002/widm.19.

Johnson, R. A., and Wichern, D. W. 2002. Applied Multivariate Statistical Analysis. Fifth Edition. Prentice Hall, Inc.

Rousseeuw, P. J., and Van Driessen, K. 1999. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics. 41(3): 212-223. DOI: 10.1080/00401706.1999.10485670.

Filzmoser, P., and Gregorich, M. 2020. Multivariate Outlier Detection in Applied Data Analysis: Global, Local, Compositional and Cellwise Outliers. Mathematical Geosciences. 52 (8): 1049-1066. DOI: 10.1007/s11004-020-09861-6.

Zheng, S., Zhu, Y. X., Li, D. Q., Cao, Z. J., Deng, Q. X., and Phoon, K. K. 2021. Probabilistic Outlier Detection for Sparse Multivariate Geotechnical Site Investigation Data using Bayesian Learning. Geoscience Frontiers. 12(1): 425–439. DOI: 10.1016/j.gsf.2020.03.017.

Ismail, M. T., and Mohd Nasir, I. N. 2019. Outliers in Islamic and Conventional Stock Indices: An Empirical Analysis using Impulse Saturation Indicator. ASM Science Journal. 12 (Special Issue 5): 130-136. https://www.akademisains.gov.my/asmsj/article/outliers-in-islamic-and-conventional-stock-indices-an-empirical-analysis-using-impulse-saturation-indicator/.

Domino, K. 2020. Multivariate Cumulants in Outlier Detection for Financial Data Analysis. Physica A: Statistical Mechanics and its Applications. 558: 1-13. DOI: 10.1016/j.physa.2020.124995.

Estiri, H., and Murphy, S. N. 2019. Semi-supervised Encoding for Outlier Detection in Clinical Observation Data. Computer Methods and Programs in Biomedicine. 181: 1-16. DOI: 10.1016/j.cmpb.2019.01.002.

Abuzaid, A. H. 2020. Identifying Density-based Local Outliers in Medical Multivariate Circular Data. Statistics in Medicine. 39(21): 2793-2798. DOI: 10.1002/sim.8576.

Barnett. V., and Lewis, T. 1984. Outliers in Statistical Data. Second Edition. John Wiley and Sons.

Wada, K., Kawano, M., and Tsubaki, H. 2020. Comparison of Multivariate Outlier Detection Methods for Nearly Elliptical Distributions. Austrian Journal of Statistics. 49: 1-17. DOI: 10.17713/ajs.v49i2.872.

Djauhari, M. A., Mashuri, M., and Herwindiati, D. E. 2008. Multivariate Process Variability Monitoring. Communication in Statistics - Theory and Methods. 37(11): 1742-1754. DOI: 10.1080/03610920701826286.

Abd Mutalib, S. S. S., Satari, S. Z., and Wan Yusoff, W. N. S. 2021. Comparison of Robust Estimators for Detecting Outliers in Multivariate Datasets. Journal of Physics: Conference Series. 1988: 1-9. DOI: 10.1088/1742-6596/1988/1/012095.

Filzmoser, P. 2005. Identification of Multivariate Outliers: A Performance Study. Austrian Journal of Statistics. 34(2): 127-138. DOI: 10.17713/ajs.v34i2.406.

Pan, J-X., Fung, W-K., and Fang K-T. 2000. Multiple Outlier Detection in Multivariate Data using Projection Pursuit Techniques. Journal of Statistical Planning and Inference. 83(1): 153-167. DOI: 10.1016/s0378-3758(99)00091-9.

Chernoff, H. 1973. The Use of Faces to Represent Points in k-Dimensional Space Graphically. Journal of the American Statistical Association. 68(342): 361-368. DOI: 10.1080/01621459.1973.10482434.

Zuziak, J., Moskal, G., and Jakubowska, M. 2017. Effective Multivariate Data Presentation and Modeling in Distinction of the Tea Infusions. Journal of Electroanalytical Chemistry. 806: 97-106. DOI: 10.1016/j.jelechem.2017.10.059.

Downloads

Published

2022-03-31

Issue

Section

Science and Engineering

How to Cite

SYNTHETIC MULTIVARIATE DATA GENERATION PROCEDURE WITH VARIOUS OUTLIER SCENARIOS USING R PROGRAMMING LANGUAGE. (2022). Jurnal Teknologi, 84(3), 89-101. https://doi.org/10.11113/jurnalteknologi.v84.17900