Big Data Smoothing and Outlier Removal for Patent Big Data Analysis (특허 빅데이터 분석을 위한 데이터 평활 및 이상치 제거)

Choi.Jun-Hyeog (최준혁); Sung-Hae Jun (전성해)

Big Data Smoothing and Outlier Removal for Patent Big Data Analysis

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2016, 21(8), pp.77~84
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science

Choi.Jun-Hyeog ¹, Sung-Hae Jun ²

¹김포대학교
²청주대학교

Accredited

ABSTRACT

In general statistical analysis, we need to make a normal assumption. If this assumption is not satisfied, we cannot expect a good result of statistical data analysis. Most of statistical methods processing the outlier and noise also need to the assumption. But the assumption is not satisfied in big data because of its large volume and heterogeneity. So we propose a methodology based on box-plot and data smoothing for controling outlier and noise in big data analysis. The proposed methodology is not dependent upon the normal assumption. In addition, we select patent documents as target domain of big data because patent big data analysis is a important issue in management of technology. We analyze patent documents using big data learning methods for technology analysis. The collected patent data from patent databases on the world are preprocessed and analyzed by text mining and statistics. But the most researches about patent big data analysis did not consider the outlier and noise problem. This problem decreases the accuracy of prediction and increases the variance of parameter estimation. In this paper, we check the existence of the outlier and noise in patent big data. To know whether the outlier is or not in the patent big data, we use box-plot and smoothing visualization. We use the patent documents related to three dimensional printing technology to illustrate how the proposed methodology can be used for finding the existence of noise in the searched patent big data.

KEYWORDS

Patent big data, Smoothing, Box-plot, Noise, Outlier, Statistical analysis

Citation status

* References for papers published after 2025 are currently being built.

[book] J. J. Berman / 2013 / Principles of Big Data / Morgan Kaufmann

[book] K. Krishnan / 2013 / Data Warehousing in the Age of Big Data / Morgan Kaufmann

[journal] B. Chun / 2014 / A Study on Big Data Processing Mechanism & Applicability / International Journal of Software Engineering and Its Applications 8(8) : 73~82

[journal] S. Ha / 2014 / Standardization Requirements Analysis on Big Data in Public Sector based on Potential Business Models / International Journal of Software Engineering and Its Applications 8(11) : 165~172

[journal] S. Jeon / 2014 / Redundant Data Removal Technique for Efficient Big Data Search Processing / International Journal of Software Engineering and Its Applications 7(4) : 427~436

[thesis] M. Riondato / 2014 / Sampling-based Randomized Algorithms for Big Data Analytics / PhD / Brown University

[journal] J. Lu / 2013 / Correction in a Small Sample from Big Data / IEEE Transactions on Knowledge and Data Engineering 25(11) : 2658~2663

[book] A. T. Roper / 2011 / Forecasting and Management of Technology / John Wiley & Sons

[book] D. Hunt / 2007 / Patent Searching Tools & Techniques / Wiley

[book] J. Han / 2012 / Data Mining: Concepts and Techniques / Morgan Kaufmann

[web] WIPSON / 2016 / WIPS Corporation / http://www.wipson.com

[web] USPTO / 2016 / The United States Patent and Trademark Office / http://www.uspto.gov

[web] KIPRIS / 2016 / Korea Intellectual Property Rights Information Service / www.kipris.or.kr

[thesis] I. Feinerer / 2008 / Department of Statistics and Mathematics / Vienna University

[journal] I. Feinerer / 2008 / Text mining infrastructure in R / Journal of Statistical Software 25(5) : 1~54

[report] I. Feinerer / 2016 / Package ‘tm’ Ver. 0.6, Text Mining Package / CRAN

[journal] S. Jun / 2012 / Technology Forecasting using Matrix Map and Patent Clustering / Industrial Management & Data Systems 112(5) : 786~807

[book] B. L. Bowerman / 2005 / Forecasting, Time Series, and Regression, An Applied Approach, Independence / Brooks/Cole

[journal] W. S. Cleveland / 1981 / LOWESS : A program for smoothing scatterplots by robust locally weighted regression / The American Statistician 35(1) : 54~

[journal] D. Ruppert / 1994 / Multivariate locally weighted least squares regression / The annals of statistics : 1346~1370

[journal] G. Guo / 2008 / Image-based human age estimation by manifold learning and locally adjusted robust regression / IEEE Transactions on Image Processing 17(7) : 1178~1188

[book] M. Akritas / 2016 / Probability and Statistics with R for Engineers and Scientists / Pearson

[journal] 최준혁 / 2016 / Bayesian Regression Modeling for Patent Keyword Analysis / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 21(1) : 125~129

[journal] S. Park / 2016 / Methodology of Technological Evolution for Three-dimensional Printing / Industrial Management & Data Systems 116(1) : 122~146

[web] / 2016 / R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria / http://www.R-project.org

[journal] 최준혁 / 2015 / A Technology Analysis Model using Dynamic Time Warping / 한국컴퓨터정보학회논문지 / 한국컴퓨터정보학회 20(2) : 113~120

[journal] S. Jun / 2012 / Technology Forecasting using Matrix Map and Patent Clustering / Industrial Management & Data Systems 112(5) : 786~807

[journal] S. Lee / 2014 / Key IPC Codes Extraction Using Classification and Regression Tree Structure / Advances in Intelligent Systems and Computing 271 : 101~109

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2025 KCI Impact Factor : 1.01