A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning (토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구)

JeeHee Yuk (육지희); Min Song (송민)

doi:10.3743/KOSIM.2018.35.2.063

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

Journal of the Korean Society for Information Management
Abbr : JKOSIM
2018, 35(2), pp.63~88
DOI : 10.3743/KOSIM.2018.35.2.063
Publisher : 한국정보관리학회
Research Area : Interdisciplinary Studies > Library and Information Science
Received : May 20, 2018
Accepted : June 19, 2018
Published : June 30, 2018

JeeHee Yuk ¹, Min Song ²

¹연세대학교 일반대학원 문헌정보학과
²연세대학교

Accredited

ABSTRACT

This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

KEYWORDS

document classification, feature selection, text categorization, topic model, deep learning, LDA, Doc2Vec, text mining

Citation status

* References for papers published after 2024 are currently being built.

[journal] 김도우 / 2017 / Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec / 정보과학회논문지 / 한국정보과학회 44(7) : 742~747

[journal] 김판준 / 2016 / An Analytical Study on Performance Factors of Automatic Classification based on Machine Learning / 정보관리학회지 / 한국정보관리학회 33(2) : 33~59

[journal] 이재윤 / 2005 / Empirical Study on Improving the Performance of Text Categorization Considering the Relationships between Feature Selection Criterea and Weighting Methods / 한국문헌정보학회지 / 한국문헌정보학회 39(2) : 123~146

[book] 정영미 / 2012 / 정보검색연구 / 연세대학교 출판문화원

[journal] 진설아 / 2016 / Topic Modeling based Interdisciplinarity Measurement in the Informatics Related Journals / 정보관리학회지 / 한국정보관리학회 33(1) : 7~32

[journal] 최상희 / 2012 / Usability Analysis of Structured Abstracts in Journal Articles for Document Clustering / 정보관리학회지 / 한국정보관리학회 29(1) : 331~349

[journal] Atlig, C / 2017 / Learning-based classification of natural science articles / International Journal of Scientific Research in Information Systems and Engineering (IJSRISE) 2(3) : 20~26

[journal] Bengio, Y / 2003 / A neural probabilistic language model / Journal of Machine Learning Research 3 : 1137~1155

[confproc] Bhushan, S. B / 2017 / A novel integer representation based approach for classification of text documents / Proceedings of the International Conference on Data Engineering and Communication Technology / Springer : 557~564

[journal] Blei, D. M / 2012 / Probabilistic topic models / Communications of the ACM 55(4) : 77~84

[journal] Blei, D. M / 2003 / Latent dirichlet allocation / Journal of Machine Learning Research 3 : 993~1022

[confproc] Collobert, R / 2008 / A unified architecture for natural language processing:Deep neural networks with multitask learning / Proceedings of the 25th International Conference on Machine Learning / ACM : 160~167

[other] Dai, A. M / 2015 / Document embedding with paragraph vectors / arXiv preprint arXiv:1507.07998

[journal] Deerwester, S / 1990 / Indexing by latent semantic analysis / Journal of the American Society for Information Science 41(6) : 391~407

[journal] Forman, G / 2003 / An extensive empirical study of feature selection metrics for text classification / Journal of Machine Learning Research 3 : 1289~1305

[journal] Fuhr, N / 1991 / A probabilistic learning approach for document indexing / ACM Transactions on Information Systems(TOIS) 9(3) : 223~248

[journal] Harter, S. P / 1975 / A probabilistic approach to automatic keyword indexing. Part II. An algorithm for probabilistic indexing / Journal of the American Society for Information Science 26(5) : 280~289

[journal] Hofmann, T / 2017 / Probabilistic latent semantic indexing / ACM SIGIR Forum / ACM 51(2) : 211~218

[journal] Hughes, M / 2017 / Medical text classification using convolutional neural networks / Stud Health Technol Inform 235 : 246~250

[confproc] Jiang, S / 2016 / Integrating rich document representations for text classification / Systems and Information Engineering Design Symposium (SIEDS), 2016 IEEE / IEEE : 303~308

[confproc] John, G. H / 1994 / Irrelevant features and the subset selection problem / Proceedings of the Eleventh International Conference on Machine Learning : 121~129

KJCKorea
Journal Central

Journal of the Korean Society for Information Management 2024 KCI Impact Factor : 1.35

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of the Korean Society for Information Management 2024 KCI Impact Factor : 1.35

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (16)

REFERENCES (21) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.