NEWS CLASSIFICATION WITH HUMAN ANNOTATORS: A CASE STUDY

Authors

Aini Fuddoly Department of Computer & Information Sciences, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia
Jafreezal Jaafar Department of Computer & Information Sciences, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia
Norshuhani Zamin Department of Computer & Information Sciences, Universiti Teknologi PETRONAS, Seri Iskandar, Malaysia

DOI:

https://doi.org/10.11113/jt.v74.4829

Keywords:

Bracewell Algorithm, text classification, Indonesian news classification, category classification, topic identification, human annotator

Abstract

The need to classify textual documents has become an increasingly vibrant research field due to the development of online news. While most of the news in news websites are categorised manually, the task becomesmore strenuous considering the tremendous surge of data updates every day. This paper addresses the question of how text classification algorithms can substitute the particular task over manual classification methods. A combined method using Bracewell's algorithm and top-n method is demonstrated and tested using Indonesian language corpus. The experiment also uses human evaluation as the benchmark. The result from the human evaluation is further investigated in order to understand how the annotators classify documents and the aspects that can be improved to enhance the method in the future. The results indicate that the method can outperform human annotators by 13% in terms of accuracy.

References

F. Sebastiani. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys (CSUR). 34: 1-47.

M. Chang and C. K. Poon. 2009. Using Phrases as Features in Email Classification. Journal of Systems and Software. 82: 1036-1045.

S. Kiritchenko and S. Matwin. 2011. Email Classification with Co-training. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research. 301-312.

X. Wang, et al. 2011. Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach. In Proceedings of the 20th ACM international conference on Information and Knowledge Management. 1031-1040.

H. Chen and D. Zimbra. 2010. AI and Opinion Mining. Intelligent Systems, IEEE. 25: 74-80.

D. B. Bracewell, et al. 2009. Category Classification and Topic Discovery of Japanese and English News Articles. Electronic Notes in Theoretical Computer Science. 225: 51-65.

F. Rodrigues, et al. 2013. Learning from Multiple Annotators: Distinguishing Good from Random Labelers. Pattern Recognition Letters. 34: 1428-1436.

A. Yessenalina, et al. 2010. Automatically Generating Annotator Rationales to Improve Sentiment Classification. In Proceedings of the ACL 2010 Conference Short Papers. 336-341.

A. McCallum and K. Nigam. 1998. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI-98 Workshop on Learning for Text Categorization. 41-48.

C. C. Aggarwal and C. Zhai. 2012. A Survey of Text Classification Algorithms. In Mining Text Data. ed: Springer, 163-222.

H. M. Noaman, et al. 2010. Naive Bayes Classifier based Arabic Document Categorization. In Informatics and Systems (INFOS), 2010 The 7th International Conference on. 1-5.

J. MacQueen. 1967. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 14.

K. Wagstaff. et al. 2001. Constrained k-means Clustering with Background Knowledge. In ICML. 577-584.

M. Steinbach, et al. 2000. A Comparison of Document Clustering Techniques. In KDD Workshop on Text Mining. 525-526.

J. C. Dunn. 1973. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters.

L. Kaufman and P. J. Rousseeuw. 2009. Finding Groups In Data: An Introduction to Cluster Analysis. 344: John Wiley & Sons.

O. Zaidan, et al. 2007. Using "Annotator Rationales" to Improve Machine Learning for Text Categorization. In HLT-NAACL. 260-267.

O. F. Zaidan and J. Eisner. 2008. Modeling Annotators: A Generative Approach to Learning from Annotator Rationales. in Proceedings of the Conference on Empirical Methods in Natural Language Processing. 31-40.

A. Srivastava and M. Sahami. 2010. Text Mining: Classification, Clustering, and Applications. CRC Press.

P. K. Bhowmick, et al. 2010. Classifying Emotion in News Sentences: When Machine Classification Meets Human Classification. International Journal on Computer Science and Engineering. 2: 98-108.

D. B. Bracewell, et al. 2005. Multilingual single document keyword extraction for information retrieval. In Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE'05. Proceedings of 2005 IEEE International Conference on. 517-522.

G. Salton, et al. 1975. A Vector Space Model for Automatic Indexing. Commun. ACM. 18: 613-620.

B. Bigi, et al. 2001. A Comparative Study of Topic Identification on Newspaper and E-mail. In String Processing and Information Retrieval-SPIRE. Villers-l`es-Nancy.

D. Higgins. 2007. Reliability of Human Annotation of Semantic Roles in Noisy Text. In Semantic Computing, 2007. ICSC 2007. International Conference on. 501-508.

Downloads

Published

2015-06-21

Issue

Vol. 74 No. 10: New Technologies in Mechanical Engineering

Section

Science and Engineering

License

Copyright of articles that appear in Jurnal Teknologi belongs exclusively to Penerbit Universiti Teknologi Malaysia (Penerbit UTM Press). This copyright covers the rights to reproduce the article, including reprints, electronic reproductions, or any other reproductions of similar nature.