The Spam Detection Model for Web Forums using Text Mining Techniques (텍스트 마이닝을 이용한 웹 포럼 불량글 탐지 모델)

Jiyoung Woo (우지영)

The Spam Detection Model for Web Forums using Text Mining Techniques

Journal of Knowledge Information Technology and Systems
Abbr : JKITS
2012, 7(1), pp.159~166
Publisher : Korea Knowledge Information Technology Society
Research Area : Interdisciplinary Studies > Interdisciplinary Research
Published : February 29, 2012

Jiyoung Woo ¹

¹고려대학교

Accredited

ABSTRACT

The spam in the discussion web forum causes user inconvenience and lowers the value of the web forum as the open source of user opinion. The importance of postings is evaluated in terms of the number of involved authors, so the spam distorts the analysis result by adding the unnecessary data in the opinion analysis. We propose the automatic detection model of spam postings in the web forum. We extract text features of posting contents using text mining techniques from the perspective of linguistics and then perform supervised learning to recognize spam from normal postings. Significant features are derived through the learning process and the automatic detection model is built based on those features. To build the automatic detection model of normal postings and spam, four evaluators are asked to recognize the spam posting in prior. We adopted the Naive Bayesian, Support Vector Machine (SVM), decision tree, which are known to perform well in data and text mining tasks. We can extract the text features to recognize the spam and detect automatically the newly posted spam. We apply the proposed model to the YahooFinace-Walmart forum, which is the world largest Walmart-related web forum.

KEYWORDS

Web forum, Social media, Spam, Posting quality, Text mining

Citation status

* References for papers published after 2024 are currently being built.

[journal] Sampson S. / 1998 / Gathering customer feedback via the Internet: instruments and prospects / Industrial Management & Data Systems 98(1-2) : 71~

[book] Gillin P. / 2007 / The New Influencers, A Marketer’s Guide to the New Social Media / Quill Driver Books\Word Dancer Press

[confproc] Morinaga S. / 2002 / Mining product reputations on the Web / The eighthACM SIGKDD international conference on Knowledgediscovery and data mining 341

[book] Liu Y / 2007 / ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs / SIGIR

[book] Glance N. / 2005 / Deriving Marketing Intelligence from Online Discussion / KDD

[book] Gruhl D. / 2005 / The predictive power of online chatter / KDD : 78~87

[journal] Wenger A. / 2008 / Analysis of travel bloggers' characteristics and their communication about Austria as a tourism destination / Journal of Vacation Marketing 14(169)

[book] Wanas N. / 2008 / Automatic Scoring of Online Discussion Posts / WICOW

[confproc] Niu Y. / 2007 / A Quantitative Study of Forum Spamming Using Context-based Analysis / Network & Distributed System Security (NDSS) Symposium

[confproc] Hayati P. / 2009 / Toward spam 2.0: An evaluation of Web 2.0 anti-spam methods Industrial Informatics / INDIN 2009. 7th IEEE International Conference on : 875~880

[confproc] Lin Y. / 2007 / Splog detection using self-similarity analysis on blogtemporal dynamics / AIRWeb '07 Proceedings of the 3rdinternational workshop on Adversarial informationretrieval on the web

[book] Mishne G. / 2005 / Blocking Blog Spam with Language Model Disagreement / AIRWeb

[book] Han S. / 2006 / Collaborative blog spam filtering using adaptive percolation search / WWW

[book] Jindal N. / 2008 / Opinion Spam and Analysis / WSDM’08

[confproc] Zinman A. / 2007 / Is Britney Spears spam / Fourth Con-ference on Email and Anti-Spam Mountain View

[book] Benevenuto F. / 2008 / Identifying Video Spammers in Online Social Networks / AIRWeb

[report] Dunning T. / 1994 / Statistical Identification of Language / New Mexico State University : 94~273

[book] Paul K. / 2005 / Analyzing Grammar: An Introduction / Cambridge University Press : 35~

[book] Robert F. / 2006 / Syntax. Critical Concepts in Linguistics / Routledge

[book] Lewis D. / 1998 / Naive (Bayes) at forty: The independence assumption in information retrieval / Machine Learning : 4~15

[book] Vapnik VN. / 1995 / The nature of statistical learning theory / Springer-Verlag

[other] Quinlan JR. / 1986 / Induction of decision trees. In Machine Learning : 81~106

[journal] Buckland M. / 1999 / The relationship between Recall and Precision / Journal of the American Society for Information Science 45(1) : 12~19

[book] Gwet K. / 2010 / Handbook of Inter-Rater Reliability (Second Edition) / ISBN

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

Journal of Knowledge Information Technology and Systems KCI Impact Factor : 0.0