COMPARISON OF VSM, GVSM, AND LSI IN INFORMATION RETRIEVAL FOR INDONESIAN TEXT

Authors

  • Jasman Pardede Informatics Department, Faculty of Industrial Technology, Institut Teknologi Nasional (Itenas), Bandung, Indonesia
  • Milda Gustiana Husada Informatics Department, Faculty of Industrial Technology, Institut Teknologi Nasional (Itenas), Bandung, Indonesia

DOI:

https://doi.org/10.11113/jt.v78.8637

Keywords:

VSM, GVSM, LSI, performance, multithread

Abstract

Vector space model (VSM) is an Information Retrieval (IR) system model that represents query and documents as n-dimension vector. GVSM is an expansion from VSM that represents the documents base on similarity value between query and minterm vector space of documents collection. Minterm vector is defined by the term in query. Therefore, in retrieving a document can be done base on word meaning inside the query. On the contrary, a document can consist the same information semantically. LSI is a method implemented in IR system to retrieve document base on overall meaning of users’ query input from a document, not based on each word translation. LSI uses a matrix algebra technique namely Singular Value Decomposition (SVD). This study discusses the performance of VSM, GVSM and LSI that are implemented on IR to retrieve Indonesian sentences document of .pdf, .doc and .docx extension type files, by using Nazief and Adriani stemming algorithm. Each method implemented either by thread or no-thread. Thread is implemented in preprocessing process in reading each document from document collection and stemming process either for query or documents. The quality of information retrieval performance is evaluated based-on time response, values of recall, precision, and F-measure were measured. The results show that for each method, the fastest execution time is .docx extension type file followed by .doc and .pdf. For the same document collection, the results show that time response for LSI is more faster, followed by GVSM then VSM. The average of recall value for VSM, GVSM and LSI are 82.86 %, 89.68 % and 84.93 % respectively. The average of precision value for VSM, GVSM and LSI are 64.08 %, 67.51 % and 62.08 % respectively. The average of F-measure value for VSM, GVSM and LSI are 71.95 %, 76.63 % and 71.02 % respectively. Implementation of multithread for preprocessing for VSM, GVSM, and LSI can increase average time response required is about 30.422%, 26.282%, and 31.821% respectively.  

References

Goker, A., and Davies, J. 2009. Information Retrieval: Searching In The 21st Century. United Kingdom: A John Wiley and Sons, Ltd., Publication.

Ingwersen, I. and Järvelin, K. 2005. The Turn: Integration Of Information Seeking And Retrieval In Context. Springer.

Kowalski, G. 1997. Information Retrieval System Theory and Implementation. United States of America: Kluwer Academic Publishers,

Manning, C., D., et al. 2009. An Introduction to Information Retrieval. England: Cambridge University Press.

Yates, R. B., and Neto, B. R. 1999. Modern Information Retrieval. New York: ACM Press.

Robertson S. 2007. On Document Populations and Measures of IR Effectiveness, UK: Microsoft Research, Cambridge.

Berry, M. W., Drmac, Z. and Jessup, E. R. 1991. Matrices, Vector Spaces, and Information Retrieval. SIAM Review.

Signh, N. J., and Dwivedi, S. K. 2012. Analysis of Vector Space Model in Information Retrieval. International Journal of Computer Applications.

Wong, S. K. M., Ziarko, W., and Wong, C. N. P. 1985. Generalized Vector Space Model in Information Retrieval. Proceedings of 8th ACM SIGIR Conference on Research and Development in Information Retrieval.

Soboroff, I., and Nicholas, C. 2000. Collaborative Filtering and The Generalized Vector Space Model. Athen, Greece. 351-353.

Tsatsaronis, G., and Panagiotopoulou, V., 2009. A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness, Proceedings of the EACL 2009 Student Research Workshop. Athen, Greece. 70-78.

Deshmukh, A., and Hegde, G. 2012. A Literature Survey On Latent Semantic Indexing, International Journal of Engineering Inventions. 1(4): 2278-7461.

Kontosthathis, A., and Pottenger, W. M. 2006. A Framework for Understanding Latent Semantic Indexing Performance. Journal of Information Processing and Management.

Rosario, B. 2000. Latent Semantic Indexing: An Overview. INFOSYS 240, Spring 2000.

Waraich, N. K., Sinder, H. S. 2014. Text Search Optimization Using Latent Semantic Indexing, International Journal of Computer Science and Information Technologies. 5(5): 0975-9646.

Oracle. 2012. Multithreaded Programming Guide. Oracle And/Or Its Affiliates.

Nazief, B., Adriani, M. Confix-Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Faculty of Computer Science University of Indonesia.

Asian, J., Williams, H. E., and Tahaghoghi, S. M. M. 2005. Stemming Indonesia, Australia: School of Computer Science and Information Technology.

Power, D. M. W. 2011. Evaluation: From Precision, Recall and F-Measure To Roc, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies. 37-63.

Smucker, M. D, and Clarke, C. L. A. 2012. Time-Based Calibration of Effectiveness Measures, SIGIR’12. Portland, Oregon, USA.

Jensi, R., and Jiji, W.G. 2013. A Survey on Optimization Approaches to Text Document Clustering, International Journal of Computational Sciences & Applications. 3(6): 10.5121.

Datta, J., and Bhatttacharyya, P. 2010. Ranking Information Retrieval.

Downloads

Published

2016-05-16

How to Cite

COMPARISON OF VSM, GVSM, AND LSI IN INFORMATION RETRIEVAL FOR INDONESIAN TEXT. (2016). Jurnal Teknologi, 78(5-6). https://doi.org/10.11113/jt.v78.8637