A Survey Of Challenges And Resolutions Of Mining Question-Answer Pairs From Internet Forum

Authors

Adekunle Isiaka Obasa SCRG Lab, Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor Malaysia
Naomie Salim SCRG Lab, Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor Malaysia
Yazan A. Al-Khassawneh SCRG Lab, Faculty of Computing, Universiti Teknologi Malaysia, 81310 UTM Johor Bahru, Johor Malaysia

DOI:

https://doi.org/10.11113/jt.v71.3865

Keywords:

Internet forum, question-answer pairs, lexical chasm, casual language

Abstract

Internet forum is a web community that brings people in different geographical locations together. Members of the forum exchange ideas and expertise and as a result generate huge amount of content on different topics on daily basis. A good percentage of human generated content of Internet forums have been found to be question-answer (QA) pairs. These QA pairs are useful for automating question answering system. Mining these QA pairs has become a hot issue in the research community. Effective mining of the QA pairs is being hindered by a number of factors. Lexical chasm that renders some Information Retrieval (IR) techniques less effective, casual language that creates noisy data; multiple authors that bring about unfocused topics are some of the issues that need to be addressed. In this paper, an extensive overview of the strategies and findings relevant to these three challenges are addressed. The survey revealed that researchers are adopting non-lexical features as against lexical to resolve the issue of data sparseness. Noise level is mostly controlled using conventional dictionary rather than using domain-specific dictionary.

References

Cong G, Wang L, Lin C-Y, Song Y-I, Sun Y, 2008. Finding question-answer pairs from online forums. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval: ACM. 467â€“74.

Hong L, Davison B.D. 2009. A classification-based approach to question answering in discussion boards. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval: ACM. 171â€“8.

Raghavan P, Catherine R, Ikbal S, Kambhatla N, Majumdar D. 2010. Extracting problem and resolution information from online discussion forums. Management of Data. 77.

Sumit B, Prakhar B, Prasenjit M. 2012. Classifying User Messages For Managing Web Forum Data. Fifteenth International Workshop on the Web and Databases (WebDB 2012), Scottsdale, AZ, USA.

Hu W-C, Yu D-F, Jiau HC. 2010. A FAQ Finding Process in Open Source Project Forums. Fifth International Conference on Software Engineering Advances. 259â€“64.

Obasa AI, Salim N. 2014. Mining FAQ From Forum Threads: Theoretical Framework. Journal of Theoretical & Applied Information Technology. 63.

Wang B-X, Liu B-Q, Sun C-J, Wang X-L, Sun L. 2013. Thread Segmentation Based Answer Detection in Chinese Online Forums. Acta Automatica Sinica. 39:11â€“20.

Brill E, Dumais S, Banko M. 2002. An analysis of the AskMSR question-answering system. Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10: Association for Computational Linguistics. 257â€“64.

Wang B, Liu B, Sun C, Wang X, Sun L. 2009. Extracting Chinese question-answer pairs from online forums. Systems, Man and Cybernetics, 2009 SMC 2009 IEEE International Conference on: IEEE. 1159â€“64.

Bentivogli L, Pianta E. Looking for lexical gaps. 2000. Proceedings of the ninth EURALEX International Congress: Citeseer. 8â€“12.

Bernhard D, Gurevych I. 2009. Combining lexical semantic resources with question & answer archives for translation-based answer finding. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Association for Computational Linguistics 2: 728-36.

Gong Z, Muyeba M, Guo J. 2010. Business information query expansion through semantic network. Enterprise Information Systems. 4:1â€“22.

Bai J, Song D, Bruza P, Nie J-Y, Cao G. 2005. Query expansion using term relationships in language models for information retrieval. Proceedings of the 14th ACM international conference on Information and knowledge management: ACM. 688â€“95.

Riezler S, Vasserman A, Tsochantaridis I, Mittal V, Liu Y. 2007. Statistical machine translation for query expansion in answer retrieval. Annual Meeting-Association For Computational Linguistics. 464.

Lee J-T, Kim S-B, Song Y-I, Rim H-C. 2008. Bridging lexical gaps between queries and questions on large online Q&A collections with compact translation models. Proceedings of the Conference on Empirical Methods in Natural Language Processing: Association for Computational Linguistics. 410â€“8.

Carpineto C, Romano G. 2012. A survey of automatic query expansion in information retrieval. ACM Computing Surveys (CSUR). 44:1.

Zhong Z, Ng HT. 2012. Word sense disambiguation improves information retrieval. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Association for Computational Linguistics.1: 273â€“82.

Berger A, Lafferty J. 1999. Information retrieval as statistical translation. Proceedings Of The 22nd Annual International ACM SIGIR Conference On Research And Development In Information Retrieval: ACM. 222â€“9.

Sun L, Liu B, Wang B, Zhang D, Wang X. 2010. A study of features on Primary Question detection in Chinese online forums. Fuzzy Systems and Knowledge Discovery (FSKD), 2010 Seventh International Conference on: IEEE. 2422â€“7.

Catherine R, Singh A, Gangadharaiah R, Raghu D, Visweswariah K 2012. Does Similarity Matter? The Case of Answer Extraction from Technical Discussion Forums. COLING (Posters). 175â€“84.

Jeon J, Croft WB, Lee JH, Park S. 2006. A framework to predict the quality of answers with non-textual features. Proceedings Of The 29th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval: ACM. 228â€“35

Clark E, Araki K. 2011. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English. Procedia-Social and Behavioral Sciences. 27:2â€“11.

Muthmann K, LÃ¶ser A. 2010. Detecting near-duplicate relations in user generated forum content. On the Move to Meaningful Internet Systems: OTM 2010 Workshops: Springer. 698â€“707.

Pattabiraman K, Sondhi P, Zhai C. 2013. Exploiting Forum Thread Structures to Improve Thread Clustering. Proceedings of the 2013 Conference on the Theory of Information Retrieval: ACM. 15.

Subramaniam LV, Roy S, Faruquie TA, Negi S. 2009. A survey of types of text noise and techniques to handle noisy text. Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data: ACM. 115â€“22.

Kukich K. 1992. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR). 24:377â€“439.

Xi W, Lind J, Brill E. 2004. Learning effective ranking functions for newsgroup search. Proceedings Of The 27th Annual International ACM SIGIR Conference On Research And Development In Information Retrieval: ACM. 394â€“401.

Xue Z, Yin D, Davison BD. 2011. Normalizing microtext. Proceedings of the AAAI Workshop on Analyzing Microtext. 74â€“9.

Rama T, Singh AK, Kolachina S. 2009. Modeling letter-to-phoneme conversion as a phrase based statistical machine translation problem with minimum error rate training. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium: Association for Computational Linguistics. 90â€“5.

Dou Q, Bergsma S, Jiampojamarn S, Kondrak G. 2009. A ranking approach to stress prediction for letter-to-phoneme conversion. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Association for Computational Linguistics. 1: 118â€“26.

Bartlett S, Kondrak G, Cherry C. 2008. Automatic Syllabification with Structured SVMs for Letter-to-Phoneme Conversion. ACL . 568â€“76.

Seo J, Croft WB, Smith DA. 2009. Online community search using thread structure. Proceedings Of The 18th ACM Conference On Information And Knowledge Management: ACM. 1907â€“10.

Kim SN, Wang L, Baldwin T. 2010. Tagging and linking web forum posts. Proceedings of the Fourteenth Conference on Computational Natural Language Learning: Association for Computational Linguistics. 192â€“202.

Adams PH, Martell CH. 2008. Topic detection and extraction in chat. Semantic Computing, IEEE International Conference on: IEEE. 581â€“8.

Khandelwal SHS. 2004. Automatic Topic Extraction and Classification of Usenet Threads.

Shen D, Yang Q, Sun J-T, Chen Z. 2006. Thread detection in dynamic text message streams. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval: ACM. 35â€“42.

Shi L, Sun B, Kong L, Zhang Y. 2009. Web forum Sentiment analysis based on topics. Computer and Information Technology, 2009 CIT'09 Ninth IEEE International Conference on: IEEE. 148â€“53.

Huang J, Zhou M, Yang D. 2007. Extracting Chatbot Knowledge from Online Discussion Forums. IJCAI. 423â€“8.

Kim JW, Candan KS, DÃ¶nderler ME. 2005. Topic segmentation of message hierarchies for indexing and navigation support. Proceedings of the 14th international conference on World Wide Web: ACM. 322â€“31.

LabadiÃ© A, Prince V. 2008. Intended boundaries detection in topic change tracking for text segmentation. International Journal of Speech Technology. 11: 167â€“80.

Georgiou T, Karvounis M, Ioannidis Y. 2010. Extracting Topics of Debate between Users on Web Discussion Boards. ACM SIGMOD Conf, Undergraduate Research Poster Competition.

Downloads

Published

2014-12-30

Issue

Vol. 71 No. 5: â€‹Special Issue on Fifth International Graduate Conference on Engineering, Science & Humanities

Section

Science and Engineering

License

Copyright of articles that appear in Jurnal Teknologi belongs exclusively to Penerbit Universiti Teknologi Malaysia (Penerbit UTM Press). This copyright covers the rights to reproduce the article, including reprints, electronic reproductions, or any other reproductions of similar nature.