Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets (단어 시퀀스 레이블링과 후처리에 기반한 참고문헌 인용열 인식: 영어 및 독일어 데이터셋 대상 평가)

In-Su Kang (강인수)

doi:10.9708/jksci.2018.23.05.001

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2018, 23(5), pp.1~7
DOI : 10.9708/jksci.2018.23.05.001
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science
Received : March 28, 2018
Accepted : May 16, 2018
Published : May 31, 2018

In-Su Kang ¹

¹경성대학교

Accredited

ABSTRACT

Reference string recognition is to extract individual reference strings from a reference section of an academic article, which consists of a sequence of reference lines. This task has been attacked by heuristic-based, clustering-based, classification-based approaches, exploiting lexical and layout characteristics of reference lines. Most classification-based methods have used sequence labeling to assign labels to either a sequence of tokens within reference lines, or a sequence of reference lines. Unlike the previous token-level sequence labeling approach, this study attempts to assign different labels to the beginning, intermediate and terminating tokens of a reference string. After that, post-processing is applied to identify reference strings by predicting their beginning and/or terminating tokens. Experimental evaluation using English and German reference string recognition datasets shows that the proposed method obtains above 94% in the macro-averaged F1.

KEYWORDS

Reference String Recognition, Sequence Labeling, Citation

Citation status

* References for papers published after 2025 are currently being built.

[confproc] I. Councill / 2008 / ParsCit: an Open-source CRF Reference String Parsing Package / Proceedings of the 6th International Conference on Language Resources and Evaluation(LREC)

[journal] R. Kern / 2013 / Extraction of References Using Layout and Formatting Information from Scientific Articles / D-Lib Magazine 19(9/10)

[thesis] D. Tkaczyk / 2015 / New Methods for Metadata Extraction from Scientific Literature / PhD / ICM, University of Warsaw

[confproc] M. Korner / 2017 / Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications / ADBIS 2017 / CCIS 767 : 137~145

[thesis] J. Boyd / 2015 / Automatic Metadata Extraction The High Energy Physics Use Case, CERN-THESIS-2015-105 / Master's

[web] / Pdfextract / https://www.crossref.org/labs/pdfextract/

[journal] Dominika Tkaczyk / 2015 / CERMINE: automatic extraction of structured metadata from scientific literature / International Journal on Document Analysis and Recognition (IJDAR) / Springer Nature 18(4) : 317~335

[confproc] P. Lopez / 2009 / GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications / Proceedings of the 13th European Conference on Digital Libraries(ECDL) : 473~474

[confproc] A. Bhardwaj / 2017 / DeepBIBX: Deep Learning for Image Based Bibliographic Data Extraction / ICONIP 2017 : 286~293

[confproc] J. Lafferty / 2001 / Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data / Proceedings of the 18th International Conference on Machine Learning(ICML) : 282~289

[confproc] S. Bird / 2008 / The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics / Proceedings of the 6th International Conference on Language Resources and Evaluation(LREC)

[confproc] S. Anzaroot / 2013 / A New Dataset for Fine-grained Citation Field Extraction / Proceedings of the ICML Workshop on Peer Reviewing and Publishing Models

[web] US Census Bureau / Frequently Occurring Surnames from the 2010 Census / https://www.census.gov/topics/population/genealogy/data/2010_surnames.html

[web] Wikipedia / 2004 / The Free Encyclopedia / Wikimedia Foundation, Inc.

[web] / 0000 / CRF++: Yet Another CRF toolkit / https://taku910.github.io/crfpp/

[web] DBLP / 0000 / / https://dblp.uni-trier.de/

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2025 are currently being built.

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01

Reference String Recognition based on Word Sequence Tagging and Post-processing: Evaluation with English and German Datasets

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (0)

REFERENCES (16) * References for papers published after 2025 are currently being built.

Search PDF

Citation

* References for papers published after 2025 are currently being built.