Modeling GPU Dynamic Parallelism for self similar density workloads

Felipe A. Quezada; Cristóbal A. Navarro; Miguel Romero; Cristhian Aguilera

doi:10.1016/j.future.2023.03.046

Modeling GPU Dynamic Parallelism for self similar density workloads

Felipe A. Quezada, Cristóbal A. Navarro^*, Miguel Romero, Cristhian Aguilera

^*Autor correspondiente de este trabajo

Producción científica: Contribución a una revista › Artículo › revisión exhaustiva

Resumen

Dynamic Parallelism (DP) is a GPU programming abstraction that can make parallel computation more efficient for problems that exhibit heterogeneous workloads. With DP, GPU threads can launch kernels with more threads, recursively, producing a subdivision effect where resources are focused on the regions that exhibit more parallel work. Doing an optimal subdivision process is not trivial, as the combination of different parameters play a relevant role in the final performance of DP. Also, the current programming abstraction of DP relies on kernel recursion, which has performance overhead. This work presents a new subdivision cost model for problems that exhibit self similar density (SSD) workloads, useful for finding efficient subdivision schemes. Also, a new subdivision implementation free of recursion overhead is presented, named Adaptive Serial Kernels (ASK). Using the Mandelbrot set as a case study, the cost model shows that optimal performance is achieved when using {g∼32,r∼2,B∼32} for the initial subdivision, recurrent subdivision and stopping size, respectively. Experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the ASK approach runs up to ∼60% faster than DP in the Mandelbrot set, and up to 12× faster than a basic exhaustive implementation, whereas DP is up to 7.5× faster. In terms of energy efficiency, ASK is up to ∼2× and ∼20× more energy efficient than DP and the exhaustive approach, respectively. These results put the subdivision cost model and the ASK approach as useful tools for analyzing the potential improvement of subdivision based approaches and for developing more efficient GPU-based libraries or fine-tune specific codes in research teams.

Idioma original	Inglés
Páginas (desde-hasta)	239-253
Número de páginas	15
Publicación	Future Generation Computer Systems
Volumen	145
DOI	https://doi.org/10.1016/j.future.2023.03.046
Estado	Publicada - 2023

Nota bibliográfica

Funding Information:
This work was supported by the ANID FONDECYT grant #1221357, FONDEF, Chile grant ID20I10262, the Temporal research group and the Patagón supercomputer of Universidad Austral de Chile (FONDEQUIP EQM180042). Romero is funded by Fondecyt grant 11200956, the National Center for Artificial Intelligence CENIAFB210017, Basal ANID, and the Data Observatory Foundation.

Funding Information:
This work was supported by the ANID FONDECYT grant # 1221357 , FONDEF, Chile grant ID20I10262 , the Temporal research group and the Patagón supercomputer of Universidad Austral de Chile ( FONDEQUIP EQM180042 ). Romero is funded by Fondecyt grant 11200956 , the National Center for Artificial Intelligence CENIA FB210017 , Basal ANID , and the Data Observatory Foundation .

Publisher Copyright:
© 2023 Elsevier B.V.

Áreas temáticas de ASJC Scopus

Software
Hardware y arquitectura
Redes de ordenadores y comunicaciones

ODS de las Naciones Unidas

Este resultado contribuye a los siguientes Objetivos de Desarrollo Sostenible

Acceder al documento

10.1016/j.future.2023.03.046

Otros archivos y enlaces

Enlace a la publicación en Scopus

Citar esto

@article{d5be2780c4124f569978d41348b54b07,

title = "Modeling GPU Dynamic Parallelism for self similar density workloads",

abstract = "Dynamic Parallelism (DP) is a GPU programming abstraction that can make parallel computation more efficient for problems that exhibit heterogeneous workloads. With DP, GPU threads can launch kernels with more threads, recursively, producing a subdivision effect where resources are focused on the regions that exhibit more parallel work. Doing an optimal subdivision process is not trivial, as the combination of different parameters play a relevant role in the final performance of DP. Also, the current programming abstraction of DP relies on kernel recursion, which has performance overhead. This work presents a new subdivision cost model for problems that exhibit self similar density (SSD) workloads, useful for finding efficient subdivision schemes. Also, a new subdivision implementation free of recursion overhead is presented, named Adaptive Serial Kernels (ASK). Using the Mandelbrot set as a case study, the cost model shows that optimal performance is achieved when using {g∼32,r∼2,B∼32} for the initial subdivision, recurrent subdivision and stopping size, respectively. Experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the ASK approach runs up to ∼60% faster than DP in the Mandelbrot set, and up to 12× faster than a basic exhaustive implementation, whereas DP is up to 7.5× faster. In terms of energy efficiency, ASK is up to ∼2× and ∼20× more energy efficient than DP and the exhaustive approach, respectively. These results put the subdivision cost model and the ASK approach as useful tools for analyzing the potential improvement of subdivision based approaches and for developing more efficient GPU-based libraries or fine-tune specific codes in research teams.",

keywords = "Dynamic Parallelism, GPU, Heterogeneous workload, Kernel recursion overhead, Self similar density, Subdivision",

author = "Quezada, {Felipe A.} and Navarro, {Crist{\'o}bal A.} and Miguel Romero and Cristhian Aguilera",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2023",

month = aug,

doi = "10.1016/j.future.2023.03.046",

language = "English",

volume = "145",

pages = "239--253",

journal = "Future Generation Computer Systems",

issn = "0167-739X",

publisher = "Elsevier",

}

TY - JOUR

T1 - Modeling GPU Dynamic Parallelism for self similar density workloads

AU - Quezada, Felipe A.

AU - Navarro, Cristóbal A.

AU - Romero, Miguel

AU - Aguilera, Cristhian

PY - 2023/8

Y1 - 2023/8

N2 - Dynamic Parallelism (DP) is a GPU programming abstraction that can make parallel computation more efficient for problems that exhibit heterogeneous workloads. With DP, GPU threads can launch kernels with more threads, recursively, producing a subdivision effect where resources are focused on the regions that exhibit more parallel work. Doing an optimal subdivision process is not trivial, as the combination of different parameters play a relevant role in the final performance of DP. Also, the current programming abstraction of DP relies on kernel recursion, which has performance overhead. This work presents a new subdivision cost model for problems that exhibit self similar density (SSD) workloads, useful for finding efficient subdivision schemes. Also, a new subdivision implementation free of recursion overhead is presented, named Adaptive Serial Kernels (ASK). Using the Mandelbrot set as a case study, the cost model shows that optimal performance is achieved when using {g∼32,r∼2,B∼32} for the initial subdivision, recurrent subdivision and stopping size, respectively. Experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the ASK approach runs up to ∼60% faster than DP in the Mandelbrot set, and up to 12× faster than a basic exhaustive implementation, whereas DP is up to 7.5× faster. In terms of energy efficiency, ASK is up to ∼2× and ∼20× more energy efficient than DP and the exhaustive approach, respectively. These results put the subdivision cost model and the ASK approach as useful tools for analyzing the potential improvement of subdivision based approaches and for developing more efficient GPU-based libraries or fine-tune specific codes in research teams.

AB - Dynamic Parallelism (DP) is a GPU programming abstraction that can make parallel computation more efficient for problems that exhibit heterogeneous workloads. With DP, GPU threads can launch kernels with more threads, recursively, producing a subdivision effect where resources are focused on the regions that exhibit more parallel work. Doing an optimal subdivision process is not trivial, as the combination of different parameters play a relevant role in the final performance of DP. Also, the current programming abstraction of DP relies on kernel recursion, which has performance overhead. This work presents a new subdivision cost model for problems that exhibit self similar density (SSD) workloads, useful for finding efficient subdivision schemes. Also, a new subdivision implementation free of recursion overhead is presented, named Adaptive Serial Kernels (ASK). Using the Mandelbrot set as a case study, the cost model shows that optimal performance is achieved when using {g∼32,r∼2,B∼32} for the initial subdivision, recurrent subdivision and stopping size, respectively. Experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the ASK approach runs up to ∼60% faster than DP in the Mandelbrot set, and up to 12× faster than a basic exhaustive implementation, whereas DP is up to 7.5× faster. In terms of energy efficiency, ASK is up to ∼2× and ∼20× more energy efficient than DP and the exhaustive approach, respectively. These results put the subdivision cost model and the ASK approach as useful tools for analyzing the potential improvement of subdivision based approaches and for developing more efficient GPU-based libraries or fine-tune specific codes in research teams.

KW - Dynamic Parallelism

KW - GPU

KW - Heterogeneous workload

KW - Kernel recursion overhead

KW - Self similar density

KW - Subdivision

UR - http://www.scopus.com/inward/record.url?scp=85151513205&partnerID=8YFLogxK

U2 - 10.1016/j.future.2023.03.046

DO - 10.1016/j.future.2023.03.046

M3 - Article

AN - SCOPUS:85151513205

SN - 0167-739X

VL - 145

SP - 239

EP - 253

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

ER -

Modeling GPU Dynamic Parallelism for self similar density workloads

Resumen

Nota bibliográfica

Áreas temáticas de ASJC Scopus

ODS de las Naciones Unidas

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto