Two-steps Data Quality Assessment Methodology for Handling Drift of Machine Learning (머신러닝 드리프트에 대한 2-단계 데이터 품질 평가 방법론)

Okjoo Choi (최옥주); Yukyong Kim (김유경)

doi:10.29056/jsav.2024.03.07

Two-steps Data Quality Assessment Methodology for Handling Drift of Machine Learning

Journal of Software Assessment and Valuation
Abbr : JSAV
2024, 20(1), pp.75~85
DOI : 10.29056/jsav.2024.03.07
Publisher : Korea Software Assessment and Valuation Society
Research Area : Engineering > Computer Science
Received : March 5, 2024
Accepted : March 20, 2024
Published : March 31, 2024

Okjoo Choi ¹, Yukyong Kim ²

¹배재대학교 AI·소프트웨어공학부
²숙명여자대학교 기초공학부

Accredited

ABSTRACT

Data quality of data-based information technologies such as big data analysis and machine learning directly affects the quality of the entire system. In particular, the properties of the data used to train machine learning models change over time, causing the model to become less accurate or behave differently than it was designed to. This phenomenon is called drift. Drift can occur for a variety of reasons, including data collection issues or market volatility. Data drift is difficult to detect immediately and can lead to inaccurate predictions, compromising business decisions based on it. The actions required to manage drift will depend on the type, extent, and nature of the drift. To take appropriate action, it is important to establish repeatable procedures for identifying drift, controlling and assessing data quality, setting thresholds for drift rates, and configuring proactive warnings. In this paper, we propose a two-step data quality assessment framework that can manage drift problems that occur in machine learning projects through data quality assessment indicators. In addition, evaluation indices and evaluation procedures according to drift type for drift detection are also defined.

KEYWORDS

Data quality assessment, Data quality metric, Data Drift, Concept Drift

Citation status

* References for papers published after 2024 are currently being built.

[web] M. Ali / 2023 / Understanding Data Drift and Model Drift : Drift Detection in Python / Founder&Creator of PyCaret / https://www.datacamp.com/tutorial/understanding-data-drift-model-drift

[confproc] A. Bifet / 2007 / Learning from time-changing data with adaptive windowing / Proceedings of the SIAM International Conference on Data Mining (ICDM) : 443~448

[journal] J. G. Moreno-Torres / 2012 / A unifying view on dataset shift in classification / Pattern recognition 45(1) : 521~530

[journal] A. Suprem / 2020 / ODIN : Automated drift detection and recovery in video analytics / Proceedings of the VLDB Endowent 13(12) : 2453~2465

[web] A. Tahmasbi / 2020 / Driftsurf : A risk-competitive learning algorithm under concept drift / ArXiv / https://doi.org/10.48550/arXiv.2003.06508

[journal] G. Widmer / 1996 / Learning in the presence of concept drift and hidden contexts / Machine learning 23(1) : 69~101

[confproc] A. Pesaranghader / 2018 / Mcdiarmid drift detection methods for evolving data streams / Proceedings of the International Joint Conference on Neural Networks : 1~9

[journal] D. Brzezinski / 2014 / Reacting to different types of concept drift : The accuracy updated ensemble algorithm / IEEE Transactions on Neural Networks and Learning Systems 25(1) : 81~94

[web] A. Acharya / 2023 / How to Detect Data Drift on Datasets / https://encord.com/blog/detect-data-drift/

[confproc] S. Ashok / 2023 / Remediating data drifts and re-establishing ML models / Proceedings of International Conference on Machine Learning and Data Engineering 218 : 799~809

[confproc] Y. Konno / 2022 / Efficient Data Selection Indicators for Updating Models under Data Drifted Environment / Proceedings of International Conference on Big Data : 6724~6726

[journal] R. S. Barros / 2018 / A large-scale comparison of concept drift detectors / Information Science 451-452 : 348~370

[journal] Y. Gong / 2023 / A Survey on dataset quality in machine learning / Infromation and software technology 162

[confproc] A. Mallick / 2022 / Matchmaker : Data Drift mitigation in machine learning for large-scale systems / Proceedings of the Machine Learning and Systems : 77~94

[web] J. Pan / 2020 / Adversarial validation approach to concept drift problem in automated machine learning systems / ArXiv / https://doi.org/10.48550/arXiv.2004.03045

[journal] J. Gama / 2004 / Learning with drift detection / LNAI 3171 : 286~295

[confproc] M. Baena-Garcia / 2006 / Early drift detection method / Proceedings of the International Workshop on Knowledge Discovery from Data Streams : 77~86

[book] K. Nishida / 2007 / LNCS, vol. 4755 : 264~269

[confproc] A. Bifet / 2007 / Learning from time-changing data with adaptive windowing / Proceedings of SIAM International Conference on Data Mining : 443~448

[book] J. Gama / 2010 / Knowledge Discovery from Data Streams, Chapman & Hall/CRC Data mining and knowledge discovery series / Chapman & Hall/CRC

[journal] O. Choi / 2023 / A Survey of Data Quality Assessment Methods for Big Data / Journal of Software Assessment and Valuation 19(4) : 89~98

KJCKorea
Journal Central

Journal of Software Assessment and Valuation 2024 KCI Impact Factor : 0.32

Two-steps Data Quality Assessment Methodology for Handling Drift of Machine Learning

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of Software Assessment and Valuation 2024 KCI Impact Factor : 0.32

Two-steps Data Quality Assessment Methodology for Handling Drift of Machine Learning

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (0)

REFERENCES (21) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.