A MALAY TEXT CORPUS ANALYSIS FOR SENTENCE COMPRESSION USING PATTERN-GROWTH METHOD

Authors

  • Suraya Alias Faculty of Computing and Informatics, Universiti Malaysia Sabah, 88400 Kota Kinabalu, Sabah, Malaysia
  • Siti Khaotijah Mohammad School of Computer Sciences, Universiti Sains Malaysia, 11800 USM Pulau Pinang, Malaysia
  • Gan Keng Hoon School of Computer Sciences, Universiti Sains Malaysia, 11800 USM Pulau Pinang, Malaysia
  • Tan Tien Ping School of Computer Sciences, Universiti Sains Malaysia, 11800 USM Pulau Pinang, Malaysia

DOI:

https://doi.org/10.11113/jt.v78.7413

Keywords:

Sentence Compression, Pattern-Growth, Text Summarization, Malay

Abstract

A text summary extracts serves as a condensed representation of a written input source where important and salient information is kept. However, the condensed representation itself suffer in lack of semantic and coherence if the summary was produced in verbatim using the input itself. Sentence Compression is a technique where unimportant details from a sentence are eliminated by preserving the sentence’s grammar pattern. In this study, we conducted an analysis on our developed Malay Text Corpus to discover the rules and pattern on how human summarizer compresses and eliminates unimportant constituent to construct a summary. A Pattern-Growth based model named Frequent Eliminated Pattern (FASPe) is introduced to represent the text using a set of sequence adjacent words that is frequently being eliminated across the document collection. From the rules obtained, some heuristic knowledge in Sentence Compression is presented with confidence value as high as 85% - that can be used for further reference in the area of Text Summarization for Malay language.

References

Das, D. and A. F. T. Martins. 2007. A Survey on Automatic Text Summarization. Literature Survey for the Language and Statistics II Course at CMU 4. 192-195.

Hahn, U. and I. Mani. 2000. The Challenges of Automatic Summarization. Computer. 33(11): 29-36.

Lloret, E. and M. Palomar. 2012. Text Summarisation in Progress: A Literature Review. Artificial Intelligence Review. 37(1): 1-41.

Nenkova, A. and K. McKeown. 2011. Automatic Summarization. Foundations and Trends® in Information Retrieval. 5(2-3): 103-233.

Saggion, H. and T. Poibeau. 2013. Automatic Text Summarization: Past, Present and Future. Springer Berlin Heidelberg.

Jing, H. 2000. Sentence Reduction For Automatic Text Summarization. Proceedings of the Sixth Conference on Applied Natural Language Processing. 310-315.

Perera, P. and L. Kosseim. 2014. Evaluation of Sentence Compression Techniques against Human Performance. In Computational Linguistics and Intelligent Text Processing. 553-565.

Conroy, J. M., et al. 2006. Back to Basics: CLASSY 2006. Proceedings of DUC. 150.

Jing, H. and K. R. McKeown. 1999. The Decomposition of Human-Written Summary Sentences. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 129-136.

Zajic, D., et al. 2007. Multi-Candidate Reduction: Sentence Compression as a Tool for Document Summarization Tasks. Information Processing & Management. 43(6): 1549-1570.

Galley, M. and K. McKeown. 2007. Lexicalized Markov Grammars for Sentence Compression. HLT-NAACL. 180-187.

Knight, K. and D. Marcu. 2000. Statistics-based Summarization-Step One: Sentence Compression. AAAI/IAAI. 703-710.

Knight, K. and D. Marcu. 2002. Summarization Beyond Sentence Extraction: A Probabilistic Approach To Sentence Compression. Artificial Intelligence. 139(1): 91-107.

Nguyen, L. M., et al. 2007. A New Sentence Reduction Technique Based on a Decision Tree Model. International Journal on Artificial Intelligence Tools. 16(01): 129-137.

Turner, J. and E. Charniak. 2005. Supervised And Unsupervised Learning For Sentence Compression. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics. 290-297.

Clarke, J. and M. Lapata. 2008. Global Inference for Sentence Compression: An Integer Linear Programming Approach. Journal of Artificial Intelligence Research. 399-429.

Cohn, T. and M. Lapata. 2008. Sentence Compression Beyond Word Deletion. Proceedings of the 22nd International Conference on Computational Linguistics. 137-144.

Boudin, F. and E. Morin. 2013. Keyphrase Extraction for N-Best Reranking in Multi-Sentence Compression. In North American Chapter of the Association for Computational Linguistics (NAACL).

Filippova, K. 2010. Multi-Sentence Compression: Finding Shortest Paths in Word Graphs. Proceedings of the 23rd International Conference on Computational Linguistics. 322-330.

Filippova, K. and M. Strube. 2008. Dependency Tree Based Sentence Compression. Proceedings of the Fifth International Natural Language Generation Conference. 25-32.

Gupta, M. and J. Han. 2011. Applications of Pattern Discovery Using Sequential Data Mining. IGI Global.

Li, Y., S. M. Chung, and J. D. Holt. 2008. Text Document Clustering Based on Frequent Word Meaning Sequences. Data & Knowledge Engineering. 64(1): 381-404.

Ning, Z., L. Yuefeng, and W. Sheng-Tang. 2012. Effective Pattern Discovery for Text Mining. Knowledge and Data Engineering, IEEE Transactions. 24(1): 30-44.

Lin, C.-Y. 2004. Rouge: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74-81.

Lin, C.-Y. 2003. Improving Summarization Performance by Sentence Compression: A Pilot Study. Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages. Association for Computational Linguistics. 11: 1-8.

Agrawal, R. and R. Srikant. 1995. Mining Sequential Patterns. 11th International Conference on Data Engineering (ICDE'95). Taipei, Taiwan.

Mabroukeh, N. and C. I. Ezeife. 2010. A Taxonomy of Sequential Pattern Mining Algorithms. ACM Computing Surveys (CSUR). 43(1): 1-41.

Srikant, R. and R. Agrawal. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. Proceedings of the Fifth International Conference on Extending Database Technology. Avignon, France.

Zaki, M. J. 2001. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning Journal. 42(1): 31-60.

Han, J., et al. 2007. Frequent Pattern Mining: Current Status and Future Directions. Data Mining and Knowledge Discovery. 15(1): 55-86.

Mooney, C. H. and J. F. Roddick. 2013. Sequential Pattern Mining – Approaches and Algorithms. ACM Computing Surveys. 45(2): 1-39.

Pei, J., et al. 2004. Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. Knowledge and Data Engineering, IEEE Transactions. 16(11): 1424-1440.

Han, J., et al. 2000. FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 355-359.

Han, J., et al. 2004. Mining Frequent Patterns Without Candidate Generation: A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery. 8(1): 53-87.

Nik Safiah Karim, Farid M Onn, and H. H. Musa. 2008. Tatabahasa Dewan Edisi Ketiga. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Vadivel, A. and S.G. Shaila. 2014. Event Pattern Analysis and Prediction at Sentence Level using Neuro-Fuzzy Model for Crime Event Detection. Pattern Analysis and Applications. 1-20.

Kupiec, J., J. Pedersen, and F. Chen. 1995. A Trainable Document Summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, Washington, USA. 68-73.

Downloads

Published

2016-07-25

Issue

Section

Science and Engineering

How to Cite

A MALAY TEXT CORPUS ANALYSIS FOR SENTENCE COMPRESSION USING PATTERN-GROWTH METHOD. (2016). Jurnal Teknologi, 78(8). https://doi.org/10.11113/jt.v78.7413