Self-Supervised Document Representation Method

Yun Yeo Il (윤여일); Namgyu Kim (김남규)

doi:10.9708/jksci.2020.25.05.187

Self-Supervised Document Representation Method

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2020, 25(5), pp.187~197
DOI : 10.9708/jksci.2020.25.05.187
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science
Received : April 7, 2020
Accepted : May 5, 2020
Published : May 29, 2020

Yun Yeo Il ¹, Namgyu Kim ¹

¹국민대학교

Accredited

ABSTRACT

Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.

KEYWORDS

Deep Learning, Document Embedding, Pre-Trained Language Model, Self-Supervised Learning, Text Mining

Citation status

* References for papers published after 2024 are currently being built.

[other] T. Mikolov / 2013 / Efficient Estimation of Word Representations in Vector Space / arXiv : 1301. 3781

[confproc] J. Pennington / 2014 / Glove: Global Vectors for Word Representation / Proceedings of the 2014Conference on Empirical Methods in Natural Language Processing : 1532~1543

[other] P. Bojanowski / 2016 / Enriching Word Vectors with Subword Information / arXiv:1607.04606

[journal] T. Mikolov / 2013 / Distributed Representations of Words and Phrases and their Compositionality / Advances in Neural Information Processing Systems 26 : 3111~3119

[other] M. E. Peters / 2018 / Deep Contextualized Word Representations / arXiv : 1802. 05365

[other] J. Devlin / 2018 / BERT : Pre-Training of Deep Bidirectional Transformers for Language Understanding / arXiv : 1810. 04805

[journal] Z. Yang / 2019 / XLNet : Generalized Autoregressive Pretraining for Language Understanding / Advances in Neural Information Processing Systems 32 : 1~11

[other] Y. Liu / 2019 / RoBERTa : A Robustly Optimized BERT Pretraining Approach / arXiv : 1907. 11692

[other] Z. Lan / 2019 / ALBERT : A Lite BERT for Self-Supervised Learning of Language Representations / arXiv:1909.11942

[other] V. Sanh / 2019 / DistilBERT, A Distilled Version of BERT : Smaller, Faster, Cheaper and Lighter / arXiv:1910.01108

[confproc] A. Vaswani / 2017 / Attention is All You Need / Proceedings of the 31st Conference on Neural Information Processing Systems : 1~11

[other] K. Clark / 2019 / What Does BERT Looking At? An Analysis of BERT's Attention / arXiv : 1906. 04341

[other] Z. Dai / 2019 / Transformers-XL : Attentive Language Models Beyond a Fixed-Length Context / arXiv : 1901. 02860

[confproc] C. Sun / 2019 / How to Fine-Tune BERT for Text Classification? / Proceedings of the 18th China National Conference on Chinese Computational Linguistics : 194~206

[other] A. Adhikari / 2019 / DocBERT : BERT for Document Classification / arXiv:1904.08398

[other] R. Pappagari / 2019 / Hierarchical Transformers for Long Document Classification / arXiv : 1910. 10781

[other] N. Reimers / 2019 / Sentence-BERT : Sentence Embeddings using Siamese BERT-Networks / arXiv:1908.10084

[confproc] R. Zhang / 2020 / BERT-AL: BERT for Arbitrarily Long Document Understandding / Proceedings of the International Conference on Learning Representations 2020 : 1~10

[confproc] D. Lee / 2013 / Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks / Proceedings of the International Conference on Machine Learning 2013 Workshop : 1~6

[confproc] M. S. Ahmed / 2011 / Pseudo-Label Generation for Multi-Label Text Classfication / Proceedings of the 2011Conference on Intelligent Data Understanding : 60~74

[journal] J. Xu / 2017 / Self-Taught Convolutional Neural Networks for Short Text Clustering / Neural Networks 88 : 22~31

[confproc] Z. Yang / 2017 / Improved Variational Autoencoders for Text Modeling using Detailed Convolutions / Proceedings of the 34th International Conference on Machine Learning : 3881~3890

[journal] 여도엽 / 2020 / Pipe Leak Detection System using Wireless Acoustic Sensor Module and Deep Auto-Encoder / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 25(2) : 59~66

[other] A. V. M. Barone / 2016 / Towards Cross-lingual Distributed Representations without Parallel Text Trained with Adversarial Autoencoders / arXiv : 1608. 02996

[other] L. Jiwei / 2015 / A Hierarchical Neural Autoencoder for Paragraph and Documents / arXiv:1506.01057

[confproc] Y. Chen / 2017 / KATE : K-Competitive Autoencoder for Text / Proceedings of the 23rd International Conference on Knowledge Discovery and Data Mining : 85~94

[other] A. Bakarov / 2018 / A Survey of Word Embeddings Evaluation Methods / arXiv : 1801. 09536

[other] Y. Tsvetkov / 2016 / Correlation-based Intrinsic Evaluation of Word Vector Representations / arXiv : 1606. 06710

[other] J. Zhang / 2019 / Evaluating the Utility of Document Embedding Vector Difference for Relation Learning / arXiv:1907.08184

[confproc] T. Baumel / 2016 / Sentence Embedding Evaluation using Pyramid Annotation / Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP : 145~149

[other] J. H. Lau / 2016 / An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation / arXiv : 1607. 05368

[confproc] F. F. Liza / 2016 / An Improved Crowdsourcing based Evaluation Technique for Word Embeddings Methods / Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP : 55~61

[confproc] M. Batchkarov / 2016 / A Critique of Word Similarity as a Method for Evaluating Distributional Semantic Models / Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP : 7~12

[journal] Guangxing Wang / 2019 / A Text Sentiment Classification Method Based on LSTM-CNN / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 24(12) : 1~7

[other] M. Faruqui / 2016 / Problems with Evaluation of Word Embeddings using Word Similarity Task / arXiv:1605.02276

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2024 KCI Impact Factor : 0.81

Self-Supervised Document Representation Method

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of The Korea Society of Computer and Information 2024 KCI Impact Factor : 0.81

Self-Supervised Document Representation Method

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (1)

REFERENCES (35) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.