• Àüü
  • ÀüÀÚ/Àü±â
  • Åë½Å
  • ÄÄÇ»ÅÍ
´Ý±â

»çÀÌÆ®¸Ê

Loading..

Please wait....

±¹³» ³í¹®Áö

Ȩ Ȩ > ¿¬±¸¹®Çå > ±¹³» ³í¹®Áö > Çѱ¹ÀÎÅͳÝÁ¤º¸ÇÐȸ ³í¹®Áö

Çѱ¹ÀÎÅͳÝÁ¤º¸ÇÐȸ ³í¹®Áö

Current Result Document : 1 / 16   ´ÙÀ½°Ç ´ÙÀ½°Ç

ÇѱÛÁ¦¸ñ(Korean Title) ¿ÀÇÁ Æú¸®½Ã °­È­ÇнÀ¿¡¼­ ¸óÅ× Ä®·Î¿Í ½Ã°£Â÷ ÇнÀÀÇ ±ÕÇüÀ» »ç¿ëÇÑ ÀûÀº »ùÇà º¹Àâµµ
¿µ¹®Á¦¸ñ(English Title) Random Balance between Monte Carlo and Temporal Difference in off-policy Reinforcement Learning for Less Sample-Complexity
ÀúÀÚ(Author) ±èÂ÷¿µ   ¹Ú¼­Èñ   ÀÌ¿ì½Ä   Chayoung Kim   Seohee Park   Woosik Lee  
¿ø¹®¼ö·Ïó(Citation) VOL 21 NO. 05 PP. 0001 ~ 0007 (2020. 10)
Çѱ۳»¿ë
(Korean Abstract)
°­È­ÇнÀ¿¡¼­ ±Ù»çÇÔ¼ö·Î½á »ç¿ëµÇ´Â µö Àΰø ½Å°æ¸ÁÀº ÀÌ·ÐÀûÀ¸·Îµµ ½ÇÁ¦¿Í °°Àº ±ÙÁ¢ÇÑ °á°ú¸¦ ³ªÅ¸³½´Ù. ´Ù¾çÇÑ ½ÇÁúÀûÀÎ ¼º°ø »ç·Ê¿¡¼­ ½Ã°£Â÷ ÇнÀ(TD) Àº ¸óÅ×-Ä®·Î ÇнÀ(MC) º¸´Ù ´õ ³ªÀº °á°ú¸¦ º¸¿©ÁÖ°í ÀÖ´Ù. ÇÏÁö¸¸, ÀϺΠ¼±Çà ¿¬±¸ Áß¿¡¼­ ¸®¿öµå°¡ ¸Å¿ì µå¹®µå¹® ¹ß»ýÇϴ ȯ°æÀ̰ųª, µô·¹ÀÌ°¡ »ý±â´Â °æ¿ì, MC °¡ TD º¸´Ù ´õ ³ªÀ½À» º¸¿©ÁÖ°í ÀÖ´Ù. ¶ÇÇÑ, ¿¡ÀÌÀüÆ®°¡ ȯ°æÀ¸·ÎºÎÅÍ ¹Þ´Â Á¤º¸°¡ ºÎºÐÀûÀÏ ¶§¿¡, MC°¡ TDº¸´Ù ¿ì¼öÇÔÀ» ³ªÅ¸³½´Ù. ÀÌ·¯ÇÑ È¯°æµéÀº ´ëºÎºÐ 5-½ºÅÜ Å¥-·¯´×À̳ª 20-½ºÅÜ Å¥-·¯´×À¸·Î º¼ ¼ö Àִµ¥, ÀÌ·¯ÇÑ È¯°æµéÀº ¼º´É-Å𺸸¦ ³·Ãߴµ¥ µµ¿ò µÇ´Â ±ä ·Ñ-¾Æ¿ô ¾øÀ̵µ ½ÇÇèÀÌ °è¼Ó ÁøÇàµÉ ¼ö Àִ ȯ°æµéÀÌ´Ù. Áï, ±ä·Ñ-¾Æ¿ô¿¡ »ó°ü¾ø´Â ³ëÀÌÁö°¡ ÀÖ´Â ³×Æ®¿÷ÀÌ ´ëÇ¥ÀûÀε¥, À̶§¿¡´Â TD º¸´Ù´Â ½Ã°£Àû ¿¡·¯¿¡ °ß°íÇÑ MC À̰ųª MC¿Í °ÅÀÇ µ¿ÀÏÇÑ ÇнÀÀÌ ´õ ³ªÀº °á°ú¸¦ º¸¿©ÁÖ°í ÀÖ´Ù. ÀÌ·¯ÇÑ ÇØ´ç ¼±Çà ¿¬±¸µéÀº TD°¡ MCº¸´Ù ³´´Ù°í ÇÏ´Â ±âÁ¸ÀÇ Åë³ä¿¡ À§¹èµÇ´Â °ÍÀÌ´Ù. ´Ù½Ã ¸»Çϸé, ÇØ´ç ¿¬±¸µéÀº TD¸¸ÀÇ »ç¿ëÀÌ ¾Æ´Ï¶ó, MC¿Í TDÀÇ º´ÇÕµÈ »ç¿ëÀÌ ´õ ³ªÀ½À» ÀÌ·ÐÀûÀ̱⠺¸´Ù °æÇèÀû ¿¹½Ã·Î½á º¸¿©ÁÖ°í ÀÖ´Ù. µû¶ó¼­, º» ¿¬±¸¿¡¼­´Â ¼±Çà ¿¬±¸µé¿¡¼­ º¸¿©ÁØ °á°ú¸¦ ¹ÙÅÁÀ¸·Î ÇÏ°í, ÇØ´ç ¿¬±¸µé¿¡¼­ »ç¿ëÇß´ø Ưº°ÇÑ ¸®¿öµå¿¡ ÀÇÇÑ º¹ÀâÇÑ ÇÔ¼ö ¾øÀÌ, MC¿Í TDÀÇ ¹ë·±½º¸¦ ·£´ýÇÏ°Ô ¸ÂÃß´Â Á» ´õ °£´ÜÇÑ ¹æ¹ýÀ¸·Î MC¿Í TD¸¦ º´ÇÕÇÏ°íÀÚ ÇÑ´Ù. º» ¿¬±¸ÀÇ MC¿Í TDÀÇ ·£´ý º´ÇÕ¿¡ ÀÇÇÑ DQN°ú TD-ÇнÀ¸¸À» »ç¿ëÇÑ ÀÌ¹Ì Àß ¾Ë·ÁÁø DQN°ú ºñ±³ÇÏ¿©, º» ¿¬±¸¿¡¼­ Á¦¾ÈÇÑ MC¿Í TDÀÇ ·£´ý º´ÇÕÀÌ ¿ì¼öÇÑ ÇнÀ ¹æ¹ýÀÓÀ» OpenAI GymÀÇ ½Ã¹Ä·¹À̼ÇÀ» ÅëÇÏ¿© Áõ¸íÇÏ¿´´Ù.
¿µ¹®³»¿ë
(English Abstract)
Deep neural networks(DNN), which are used as approximation functions in reinforcement learning (RN), theoretically can be attributed to realistic results. In empirical benchmark works, time difference learning (TD) shows better results than Monte-Carlo learning (MC). However, among some previous works show that MC is better than TD when the reward is very rare or delayed. Also, another recent research shows when the information observed by the agent from the environment is partial on complex control works, it indicates that the MC prediction is superior to the TD-based methods. Most of these environments can be regarded as 5-step Q-learning or 20-step Q-learning, where the experiment continues without long roll-outs for alleviating reduce performance degradation. In other words, for networks with a noise, a representative network that is regardless of the controlled roll-outs, it is better to learn MC, which is robust to noisy rewards than TD, or almost identical to MC. These studies provide a break with that TD is better than MC. These recent research results show that the way combining MC and TD is better than the theoretical one. Therefore, in this study, based on the results shown in previous studies, we attempt to exploit a random balance with a mixture of TD and MC in RL without any complicated formulas by rewards used in those studies do. Compared to the DQN using the MC and TD random mixture and the well-known DQN using only the TD-based learning, we demonstrate that a well-performed TD learning are also granted special favor of the mixture of TD and MC through an experiments in OpenAI Gym.
Å°¿öµå(Keyword) ¿Â- ¾Ø ¿ÀÇÁ-Æú¸®½Ã   ½Ã°£Â÷ ÇнÀ   ¸óÅ× Ä®·Î ÇнÀ   °­È­ÇнÀ   ºÐ»ê°ú ÆíÂ÷ÀÇ ±ÕÇü   Deep Q-Network   Temporal Difference   Monte Carlo   Reinforcement Learning   Variation and Bias Balance  
ÆÄÀÏ÷ºÎ PDF ´Ù¿î·Îµå