• Àüü
  • ÀüÀÚ/Àü±â
  • Åë½Å
  • ÄÄÇ»ÅÍ
´Ý±â

»çÀÌÆ®¸Ê

Loading..

Please wait....

±¹³» ³í¹®Áö

Ȩ Ȩ > ¿¬±¸¹®Çå > ±¹³» ³í¹®Áö > Çѱ¹Á¤º¸°úÇÐȸ ³í¹®Áö > Á¤º¸°úÇÐȸ ÄÄÇ»ÆÃÀÇ ½ÇÁ¦ ³í¹®Áö (KIISE Transactions on Computing Practices)

Á¤º¸°úÇÐȸ ÄÄÇ»ÆÃÀÇ ½ÇÁ¦ ³í¹®Áö (KIISE Transactions on Computing Practices)

Current Result Document :

ÇѱÛÁ¦¸ñ(Korean Title) BERT ÇнÀ¿¡¼­ GEMM ¿¬»êÀÇ ³·Àº GPU È°¿ëµµ ºÐ¼®
¿µ¹®Á¦¸ñ(English Title) Performance Analysis of GPU Under-utilization when Operating GEMM in BERT Training
ÀúÀÚ(Author) À̼±Á¤   ¾ÈÁ¤È£   Sunjung Lee   Jung Ho Ahn  
¿ø¹®¼ö·Ïó(Citation) VOL 49 NO. 04 PP. 0232 ~ 0238 (2022. 04)
Çѱ۳»¿ë
(Korean Abstract)
GPU´Â È¿À²ÀûÀÎ º´·ÄÈ­ ¿¬»êÀ» ¹ÙÅÁÀ¸·Î µö ´º·² ³×Æ®¿öÅ©(Deep Neural Network) ÇнÀ¿¡ ÁÖ·Î »ç¿ëµÈ´Ù. ÇÏÁö¸¸, BERT ÇнÀ °£ ³ªÅ¸³ª´Â GEMMÀÇ ¿¬»ê Ư¼ºÀ¸·Î ÀÎÇØ GPU´Â ÃÖ´ë ¼º´ÉÀ» Á¦°øÇÏÁö ¸øÇÑ´Ù. º» ³í¹®¿¡¼­ ¿ì¸®´Â V100, A100 GPU¸¦ ÀÌ¿ëÇÏ¿© BERT ÇнÀÀÇ °¡Àå Áß¿äÇÑ ¿¬»êÀÎ GEMMÀ» ¼öÇàÇßÀ» ¶§ GPU°¡ ¿¬»ê±âµéÀ» È¿À²ÀûÀ¸·Î È°¿ëÇÏÁö ¸øÇÏ´Â ¿øÀεéÀ» ºÐ¼®ÇÏ¿´´Ù. À̸¦ ÅëÇØ DRAM ¿ë·®ÀÇ Á¦ÇÑ°ú BERTÀÇ ±¸Á¶ÀûÀΠƯ¼ºÀ¸·Î ÀÎÇØ GPU°¡ ÀÏÀ» ±ÕµîÇÏ°Ô ÇÒ´ç¹ÞÁö ¸øÇÏ´Â ¹®Á¦¸¦ È®ÀÎÇÏ¿´´Ù. Ãß°¡ÀûÀ¸·Î, ÀÏÀÇ ¾çÀ» ÀÛÀº ´ÜÀ§·Î ³ª´©¾î GPUÀÇ º´·Ä¼ºÀ» ³ôÀÌ´Â ¹æ¹ý°ú ¸Þ¸ð¸® °èÃþÀÇ ´ë¿ªÆøÀÇ Æ®·¹À̵å-¿ÀÇÁ¿¡ ´ëÇؼ­ ºÐ¼®ÇÏ¿´À¸¸ç º´·Ä¼ºÀ» ³ôÀÌ´õ¶óµµ ¸Þ¸ð¸® ´ë¿ªÆø º´¸ñ¿¡ ÀÇÇؼ­ ½ÇÁ¦ GPUÀÇ ¼º´ÉÀº ³·¾ÆÁö´Â °ÍÀ» È®ÀÎÇÏ¿´´Ù. ÀÌ·¯ÇÑ ºÐ¼® °á°úµéÀ» ¹ÙÅÁÀ¸·Î GPUÀÇ DRAM ¿ë·®°ú ¸Þ¸ð¸® °èÃþ ±¸Á¶¿¡¼­ ´ë¿ªÆøÀÇ Á߿伺À» È®ÀÎÇÑ´Ù.
¿µ¹®³»¿ë
(English Abstract)
Graphics processing units (GPUs) are mainly used for deep neural network training based on efficient parallel computation. However, due to the computational characteristics of GEMM when executing BERT training, GPUs do not provide maximum performance. In this paper, we analyze the reasons behind why GPUs cannot be utilized efficiently when GPUs perform the GEMM operation, which is the most important task in BERT training. We identify challenges that the GPU does not allocate tasks evenly to parallel computing units due to the limitation of DRAM capacity and the structural characteristics of BERT. In addition, we analyze the trade-off between increasing the parallelism of the GPU by dividing the number of tasks into smaller units and the memory bandwidth. We confirm that even if the parallelism increases, the performance of the actual GPU is reduced due to the memory bandwidth bottleneck. Based on our results, we explain the importance of the DRAM capacity and bandwidth of the memory hierarchy in the GPU.
Å°¿öµå(Keyword) BERT ÇнÀ   ³·Àº GPU È°¿ëµµ   ¸Þ¸ð¸® °èÃþ ±¸Á¶   DRAM ¿ë·®   BERT training   GPU under-utilization   memory hierarchy   DRAM capacity  
ÆÄÀÏ÷ºÎ PDF ´Ù¿î·Îµå