Á¤º¸°úÇÐȸ³í¹®Áö (Journal of KIISE)
Current Result Document :
ÇѱÛÁ¦¸ñ(Korean Title) |
Intel KNL Ŭ·¯½ºÅÍ È¯°æ¿¡¼ AVX-512 ±â¹Ý Blocked GEMM ¾Ë°í¸®ÁòÀ» È°¿ëÇÑ ScaLAPACKÀÇ º´·Ä Çà·Ä °ö¼À ¿¬»êPDGEMM ¼º´É Çâ»ó |
¿µ¹®Á¦¸ñ(English Title) |
Improvement on Parallel Matrix Multiplication Routines in ScaLAPACK using Blocked Matrix Multiplication Algorithm on Intel KNL Clusters with AVX-512 |
ÀúÀÚ(Author) |
¾È¼ºÈ¯
ÀåÀ¯Áø
ÇϽÂÁØ
³²¹ü¼®
Sunghwan Ahn
Yujin Jang
Seungjun Ha
Beomseok Nam
ÀÀÀ¢Æ¼¹Ì¶Ñ¿£
¹ÚÀ¯»ó
ÃÖÀ翵
Thi My Tuyen Nguyen
Yoosang Park
Jaeyoung Choi
|
¿ø¹®¼ö·Ïó(Citation) |
VOL 48 NO. 01 PP. 0007 ~ 0012 (2021. 01) |
Çѱ۳»¿ë (Korean Abstract) |
Çà·Ä°ö¼À¿¬»ê(DGEMM)Àº ¼±Çü´ë¼öÇÐ, ¸Ó½Å·¯´×, Åë°èºÐ¾ß µî¿¡¼ Àû¿ëµÇ´Â ÇÙ½É °è»ê ·çƾÀ¸·Î, ÇÁ·Î¼¼¼ Á¦Á¶È¸»çµéÀÌ ¿©·¯ Äھ °¡Áø ´ÜÀϳëµå¿¡¼ ¾î¼Àºí¸® Äڵ带 »ç¿ëÇÏ¿© Á÷Á¢ ÃÖÀûȽÃŲ ·çƾµéÀ» ¹ßÇ¥ÇÏ¿´À¸¸ç, ´Ù¾çÇÑ ÀÚµ¿ Æ©´× ±â¹ýÀ» ÅëÇØ °è»ê°úÁ¤À» ÃÖÀûȽÃÅ°±â À§ÇÑ ¸¹Àº ¿¬±¸µéÀ» ¼öÇàÇÏ¿´´Ù. Çà·Ä°ö¼À¿¬»êÀÇ Ã³¸® ½Ã°£À» È¿°úÀûÀ¸·Î ÁÙÀ̱â À§Çؼ´Â ³ëµåº°·Î ¼öÇàµÇ´Â °ö¼À°úÁ¤À» ÃÖÀûȽÃÄÑ º´·ÄÄÄÇ»Æà ȯ°æ¿¡ ÀûÇÕÇÑ ÇüÅ·Πó¸®ÇÒ ¼ö ÀÖ´Â À§ÇÑ ¹æ¹ýÀÌ ÇÊ¿äÇÏ´Ù. º» ³í¹®¿¡¼´Â Intel Knights Landing (KNL) ȯ°æ¿¡¼ÀÇ º´·Ä ¹èÁ¤¹Ðµµ ºÎµ¿¼Ò¼öÁ¡ Çà·Ä°ö¼À¿¬»ê(PDGEMM) ¹× Àû¿ë°úÁ¤À» ¼Ò°³ÇÑ´Ù. Á¦¾ÈÇÏ´Â Àû¿ë°úÁ¤ÀÇ ¼¼ºÎ»çÇ×Àº º´·ÄÄÄÇ»Æà ½ÇÇà ȯ°æÀ» À§ÇÑ ´ÜÀÏ ³ëµåÀÇ Çà·Ä°ö¼À¿¬»êÀ» ÃÖÀûÈÇÑ ºÎºÐÇà·Ä°ö¼À °úÁ¤°ú KNL ½ÇÇà ȯ°æ¿¡ Àû¿ëÇÒ ¼ö ÀÖ´Â Intel AVX-512 ¸í·É¾î¸¦ Àû¿ëÇÒ ¼ö ÀÖ´Â ÄÄÆÄÀÏ °úÁ¤À» Æ÷ÇÔÇÑ´Ù. ½ÇÇè¿¡¼´Â Á¦¾ÈÇÏ´Â PDGEMMÀÇ ¼º´ÉÀÌ °¢ 4°³ ¹× 16°³ ³ëµå·Î ±¸¼ºµÈ KNL Ŭ·¯½ºÅÍ È¯°æ¿¡¼ Intel Math Kernel Library (MKL)ÀÇ º´·Ä Çà·Ä°ö¼À·çƾº¸´Ù °¢ 6% ¹× 68% Çâ»óµÈ ¼º´ÉÀ» º¸ÀÓÀ» È®ÀÎÇÏ¿´´Ù.
|
¿µ¹®³»¿ë (English Abstract) |
General matrix multiplication (GEMM) is a core computation algorithm in linear algebra, machine learning, statistics, and many other domains. Optimizations of such routines, including GEMM, have been conducted by vendors and researches with auto-tuning techniques. To achieve high performance for parallel matrix multiplication, a matrix multiplication processing scheme based on the optimization of local matrix multiplication at each node should be necessarily applied. In this paper, the application of parallel double-precision general matrix multiplication (PDGEMM) on Intel KNL was examined. The application of DGEMM calculated sub-matrices multiplication at each node. Details of the proposed DGEMM were introduced, including a blocked matrix multiplication algorithm with AVX-512 instruction sets and several optimization techniques, such as the data prefetching, loop unrolling, and cache blocking. This study found that the proposed PDGEMM performance was better than that in the ordinary cases of PDGEMM from the Intel Math Kernel Library (MKL) on both 4 and 16-node KNL clusters, with the flop rate improvements of 6% and 68%, respectively.
|
Å°¿öµå(Keyword) |
ºñÈֹ߼º ¸Þ¸ð¸®
¿ø°Ý Á¢±Ù
SkipList
non-volatile memory
NUMA
remote access
SkipList
º´·Ä Çà·Ä°ö¼À¿¬»ê
º´·Ä BLAS
KNL
AVX-512
parallel matrix-matrix multiplication
parallel BLAS
Intel Knights Landing
AVX-512
|
ÆÄÀÏ÷ºÎ |
PDF ´Ù¿î·Îµå
|