Á¤º¸°úÇÐȸ ÄÄÇ»ÆÃÀÇ ½ÇÁ¦ ³í¹®Áö (KIISE Transactions on Computing Practices)
Current Result Document :
ÇѱÛÁ¦¸ñ(Korean Title) |
BERT ÇнÀ¿¡¼ GEMM ¿¬»êÀÇ ³·Àº GPU È°¿ëµµ ºÐ¼® |
¿µ¹®Á¦¸ñ(English Title) |
Performance Analysis of GPU Under-utilization when Operating GEMM in BERT Training |
ÀúÀÚ(Author) |
À̼±Á¤
¾ÈÁ¤È£
Sunjung Lee
Jung Ho Ahn
|
¿ø¹®¼ö·Ïó(Citation) |
VOL 49 NO. 04 PP. 0232 ~ 0238 (2022. 04) |
Çѱ۳»¿ë (Korean Abstract) |
GPU´Â È¿À²ÀûÀÎ º´·ÄÈ ¿¬»êÀ» ¹ÙÅÁÀ¸·Î µö ´º·² ³×Æ®¿öÅ©(Deep Neural Network) ÇнÀ¿¡ ÁÖ·Î »ç¿ëµÈ´Ù. ÇÏÁö¸¸, BERT ÇнÀ °£ ³ªÅ¸³ª´Â GEMMÀÇ ¿¬»ê Ư¼ºÀ¸·Î ÀÎÇØ GPU´Â ÃÖ´ë ¼º´ÉÀ» Á¦°øÇÏÁö ¸øÇÑ´Ù. º» ³í¹®¿¡¼ ¿ì¸®´Â V100, A100 GPU¸¦ ÀÌ¿ëÇÏ¿© BERT ÇнÀÀÇ °¡Àå Áß¿äÇÑ ¿¬»êÀÎ GEMMÀ» ¼öÇàÇßÀ» ¶§ GPU°¡ ¿¬»ê±âµéÀ» È¿À²ÀûÀ¸·Î È°¿ëÇÏÁö ¸øÇÏ´Â ¿øÀεéÀ» ºÐ¼®ÇÏ¿´´Ù. À̸¦ ÅëÇØ DRAM ¿ë·®ÀÇ Á¦ÇÑ°ú BERTÀÇ ±¸Á¶ÀûÀΠƯ¼ºÀ¸·Î ÀÎÇØ GPU°¡ ÀÏÀ» ±ÕµîÇÏ°Ô ÇÒ´ç¹ÞÁö ¸øÇÏ´Â ¹®Á¦¸¦ È®ÀÎÇÏ¿´´Ù. Ãß°¡ÀûÀ¸·Î, ÀÏÀÇ ¾çÀ» ÀÛÀº ´ÜÀ§·Î ³ª´©¾î GPUÀÇ º´·Ä¼ºÀ» ³ôÀÌ´Â ¹æ¹ý°ú ¸Þ¸ð¸® °èÃþÀÇ ´ë¿ªÆøÀÇ Æ®·¹À̵å-¿ÀÇÁ¿¡ ´ëÇؼ ºÐ¼®ÇÏ¿´À¸¸ç º´·Ä¼ºÀ» ³ôÀÌ´õ¶óµµ ¸Þ¸ð¸® ´ë¿ªÆø º´¸ñ¿¡ ÀÇÇؼ ½ÇÁ¦ GPUÀÇ ¼º´ÉÀº ³·¾ÆÁö´Â °ÍÀ» È®ÀÎÇÏ¿´´Ù. ÀÌ·¯ÇÑ ºÐ¼® °á°úµéÀ» ¹ÙÅÁÀ¸·Î GPUÀÇ DRAM ¿ë·®°ú ¸Þ¸ð¸® °èÃþ ±¸Á¶¿¡¼ ´ë¿ªÆøÀÇ Á߿伺À» È®ÀÎÇÑ´Ù. |
¿µ¹®³»¿ë (English Abstract) |
Graphics processing units (GPUs) are mainly used for deep neural network training based on efficient parallel computation. However, due to the computational characteristics of GEMM when executing BERT training, GPUs do not provide maximum performance. In this paper, we analyze the reasons behind why GPUs cannot be utilized efficiently when GPUs perform the GEMM operation, which is the most important task in BERT training. We identify challenges that the GPU does not allocate tasks evenly to parallel computing units due to the limitation of DRAM capacity and the structural characteristics of BERT. In addition, we analyze the trade-off between increasing the parallelism of the GPU by dividing the number of tasks into smaller units and the memory bandwidth. We confirm that even if the parallelism increases, the performance of the actual GPU is reduced due to the memory bandwidth bottleneck. Based on our results, we explain the importance of the DRAM capacity and bandwidth of the memory hierarchy in the GPU. |
Å°¿öµå(Keyword) |
BERT ÇнÀ
³·Àº GPU È°¿ëµµ
¸Þ¸ð¸® °èÃþ ±¸Á¶
DRAM ¿ë·®
BERT training
GPU under-utilization
memory hierarchy
DRAM capacity
|
ÆÄÀÏ÷ºÎ |
PDF ´Ù¿î·Îµå
|