Latency Hiding based Warp Scheduling Policy for High Performance GPUs (GPU 성능 향상을 위한 지연시간 숨김 기반 워프 스케줄링)

Gwang Bok Kim (김광복); Jong Myon Kim (김종면); Cheol Hong Kim (김철홍)

doi:10.9708/jksci.2019.24.04.001

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

Journal of The Korea Society of Computer and Information
Abbr : JKSCI
2019, 24(4), pp.1~9
DOI : 10.9708/jksci.2019.24.04.001
Publisher : The Korean Society Of Computer And Information
Research Area : Engineering > Computer Science
Received : January 22, 2019
Accepted : April 11, 2019
Published : April 30, 2019

Gwang Bok Kim ¹, Jong Myon Kim ², Cheol Hong Kim ¹

¹전남대학교
²울산대학교

Accredited

ABSTRACT

LRR(Loose Round Robin) warp scheduling policy for GPU architecture results in high warp-level parallelism and balanced loads across multiple warps. However, traditional LRR policy makes multiple warps execute long latency operations at the same time. In cases that no more warps to be issued under long latency, the throughput of GPUs may be degraded significantly. In this paper, we propose a new warp scheduling policy which utilizes latency hiding, leading to more utilized memory resources in high performance GPUs. The proposed warp scheduler prioritizes memory instruction based on GTO(Greedy Then Oldest) policy in order to provide reduced memory stalls. When no warps can execute memory instruction any more, the warp scheduler selects a warp for computation instruction by round robin manner. Furthermore, our proposed technique achieves high performance by using additional information about recently committed warps. According to our experimental results, our proposed technique improves GPU performance by 12.7% and 5.6% over LRR and GTO on average, respectively.

KEYWORDS

GPUs, Warp Scheduler, Latency Hiding, Thread Level Parallelism, Data Locality

Citation status

* References for papers published after 2024 are currently being built.

[other] NVIDIA / 2012 / CUDA C Programming Guide

[other] Khronos OpenCL Group / 2011 / The OpenCL Specification

[confproc] T. G. Rogers. / 2012 / Cache-conscious wavefront scheduling / Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture : 72~83

[confproc] T. G. Rogers / 2013 / Divergence-aware Warp Scheduling / Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture : 99~110

[confproc] Kim / 2018 / International Conference on Parallel and Distributed Computing: Applications and Technologies / International Conference on Parallel and Distributed Computing: Applications and Technologies / Springer : 230~239

[journal] Zhang, Y / 2018 / Locality based warp scheduling in GPGPUs / Future Generation Computer Systems 82 : 520~527

[confproc] ElTantawy, A. / 2018 / Warp scheduling for fine-grained synchronization / 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) : 375~388

[journal] Oh, Yunho / 2019 / Adaptive Cooperation of Prefetching and Warp Scheduling on GPUs / IEEE Transactions on Computers 68(4) : 609~616

[confproc] V. Narasiman / 2011 / Improving GPU performance via large warps and two-level warp scheduling / Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture : 308~317

[confproc] S. Y. Lee / 2015 / CAWA : Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads / ACM SIGARCH Computer Architecture (ISCA) : 515~527

[confproc] M. Lee / 2016 / iPAWS : Instruction-issue pattern-based adaptive warp scheduling for GPGPUs / High Performance Computer Architecture (HPCA), IEEE International Symposium on : 370~381

[confproc] M. K. Yoon / 2015 / Draw : investigating benefits of adaptive fetch group size on gpu / Performance Analysis of Systems and Software (ISPASS) : 183~192

[confproc] A. Bakhoda / 2009 / Analyzing CUDA Workloads Using a Detailed GPU Simulator / Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software : 163~174

[web] / 2015 / NVIDIA CUDA SDK Code Samples / http://developer.nvidia.com/cuda-downloads

[confproc] S. Che / 2009 / Rodinia : A Benchmark Suite for Heterogeneous Computing / Proceedings of the International Symposium on Workload Characterization (IISWC) : 44~54

[journal] S. Grauer-Gray / 2012 / Auto-tuning a high-level language targeted to gpu codes / Innovative Parallel Computing (InPar) : 1~10

[report] J. A. Stratton / 2012 / Parboil:A Revised Benchmark Suite for Scientific and Commercial Throughput Computing / Center for Reliable and High-Performance Computing

[journal] M. Gebhart / 2012 / A hierarchical thread scheduler and register file for energy-efficient throughput processors / ACM Transactions on Computer Systems (TOCS) 30(2)

This paper was written with support from the National Research Foundation of Korea.

KJCKorea
Journal Central

Journal of The Korea Society of Computer and Information 2024 KCI Impact Factor : 0.81

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

ABSTRACT

KEYWORDS

Citation status

* References for papers published after 2024 are currently being built.

Journal of The Korea Society of Computer and Information 2024 KCI Impact Factor : 0.81

Latency Hiding based Warp Scheduling Policy for High Performance GPUs

ABSTRACT

KEYWORDS

Statistics

Tools

Issue List

Citation status

KCI Citation Counts (2)

REFERENCES (18) * References for papers published after 2024 are currently being built.

Search PDF

Citation

* References for papers published after 2024 are currently being built.