## Sangchul Woo* and Yunsick Sung*## |

Turn | 1 | 2 | 3 | 4 |
---|---|---|---|---|

Number of state space | [TeX:] $$\frac{25 !}{1 ! \times 0 ! \times 24 !}=25$$ | [TeX:] $$\frac{25 !}{1 ! \times 1 ! \times 23 !}=600$$ | [TeX:] $$\frac{25 !}{2 ! \times 1 ! \times 22 !}=6,900$$ | [TeX:] $$\frac{25 !}{2 ! \times 2 ! \times 21 !}=75,900$$ |

If we define status information for 83,425 state spaces and calculate the status and action values for each status, we can create an agent that can win under any situation. We trained the first strike model by saving all state information and policy repetition.

If the first placement location is set among the 83,425 that occur in the first strike, the state spaces for the remaining three moves can be reduced to 6,650.

An analysis of the action space is required to reduce it. Action space can be categorized into space used for learning and space not used. For example, any action that is never performed requires no space because that action is not to be learned. This type of space should be removed from the action space to solve the state-space problem. Any additional space can also be reduced if it is not related to the optimal behavior. Thus, a method is required for applying reinforcement learning and configuring rules by considering only the actions related to the optimal behavior.

To derive the action to be performed, the following procedure is applied: in the entire action set A, the set of actions that can be performed at time t is defined as [TeX:] $$A_{t},$$ and the action set [TeX:] $$A_{t}$$ is defined every hour: [TeX:] $$A_{t} \subset A$$ . Function f(A) takes a set, 𝐴, and returns an action, [TeX:] $$A_{t},$$ which can be performed. Each action in the action set [TeX:] $$A_{t}$$ satisfies Eq. (1). The position where an agent is located at time t is defined as [TeX:] $$s_{t}, \text { and } a_{t}$$ is the action the agent performs at time t. In the Tic-Tac-Toe example, each placement position shown in Fig. 1 denotes a state

where [TeX:] $$A_{t} \subset A, s_{t} \in S, \text { and } g\left(a_{t}\right) \in h\left(s_{t}\right) . \text { Function } g\left(a_{t}\right)$$ returns the spatial data of action [TeX:] $$a_{t}.$$ Function [TeX:] $$h\left(s_{t}\right)$$ returns a set of possible spatial data on the state [TeX:] $$S_{t}.$$ Action [TeX:] $$a_{t}$$ is an element of set A and set [TeX:] $$A_{t},$$ and is an action that can be used for learning when the spatial data of action [TeX:] $$a_{t}$$ is an element of the possible spatial data set in state [TeX:] $$s_{t}.$$ The agent determines the action [TeX:] $$a_{t}$$ to be performed from the dynamically defined action set [TeX:] $$A_{t}$$ during reinforcement learning. For example, Q-learning is updated as expressed in Eq. (2)

where [TeX:] $$a_{t} \in A_{t} \cdot Q\left(s_{t}, a_{t}\right)$$ is an action value when an action is [TeX:] $$a_{t}$$ for the state [TeX:] $$S_{t}$$ at time t, [TeX:] $$\max Q\left(s_{t+1}, a_{t+1}\right)$$ is an action value that is the largest among all possible actions in the next state. [TeX:] $$r_{t}$$ is the reward for action [TeX:] $$a_{t} \cdot \alpha \text { and } \gamma$$ are constants; is the step size in the incremental mean, and is the depreciation rate. The proposed method has the advantage of being applied to various algorithms of reinforcement learning without modifications, as it reduces the state space by decreasing the number of selectable actions.

This section introduces a series of processes to verify the proposed method. The proposed method is applied to Tic-Tac-Toe games for verification. This section presents the Tic-Tac-Toe game and details the application process and experimental results of the proposed method. Furthermore, the experimental results are analyzed.

Tic-Tac-Toe is a game for two players, X and O, who take turns marking the spaces, traditionally in a 3×3 square-grid environment. The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row is determined the winner. Despite the simple rules of the game, the possible number of state spaces is 5,920. The proposed method was applied to 5×5 Tic-Tac-Toe to create a situa¬tion where the state-space problem exists. The size of the state space that can occur in a 5×5 Tic-Tac-Toe environment is approximately 160 billion.

The proposed method was applied to solve the state-space problem. The action space of the Tic-Tac-Toe game was reduced as follows to remove the state-space in relation to actions that were not performed or were unlikely to be performed. In the 5×5 environment, each marked space was processed as spatial data. Thus, when marking a space at a specific position was defined as an action; the function g(a) was used to process that position as spatial data. As shown in Fig. 3, function h(s) returns a set of positions of the empty cells within a distance √2 from the existing marks as a set of possible spaces.

In a 5×5 Tic-Tac-Toe game, the strategies of the first and second players are different. The first player has a relatively high chance of winning. Therefore, the proposed method was applied to the second player. The second player was trained with 100,000 episodes using the proposed method, and the learning time was 10 minutes. Learning was performed by exploring approximately 20,000 state spaces.

The number of wins with the proposed method was 44,683 times that of traditional Q-learning. The proposed method was far more efficient as it used only 20,000 state spaces, which is approximately 0.33% of the 6 million state spaces used in traditional Q-learning.

Figs. 3 and 4 show the process of limiting the action space to within [TeX:] $$\sqrt{2}$$ of the placed marks. There are coordinates corresponding to each position, and each marked position is mapped to these coordinates. These coordinates allow for a distance calculation. Fig. 3 shows the process of reducing the 24 action spaces available to the second player to 8, and Fig. 4 shows the process of reducing the 23 action spaces available to the first player to 12.

In this experiment, the proposed method and traditional Q-learning were compared according to the number of action spaces retrieved, and the count of accumulated wins. As shown in Fig. 5, the proposed method retrieved fewer than 20,000 state spaces for learning approximately 90,000 episodes. Traditional Q-learning searched approximately 6 million state spaces in learning the same number of episodes. Thus, the number of state spaces searched by the proposed method was approximately 0.33% of that by traditional Q-learning, indicating that the time and state space can be significantly reduced by decreasing the search space.

Fig. 6 shows a comparison of the two algorithms according to their accumulated wins. The proposed method resulted in approximately 45,000 wins in training with 100,000 episodes, which was similar to the count of wins for traditional Q-learning. Although the win count differed according to learning type, it did not significantly affect the performance analysis.

Ultimately, the proposed method significantly reduced the learning time and state space in terms of the number of action spaces searched, while showing similar performance to Q-learning in the cumulative win count. Although the proposed method has the disadvantage of extracting spatial data from the action and state, this method is applicable in various fields because action is typically dependent on the location, and this method is expected to considerably improve performance.

In particular, although the impact of the state-space problem has recently been reduced through deep learning, the state space and training time limitations still exist; so, the proposed method can be applied to other models to address these limitations.

In this paper, a method of reinforcement learning that dynamically adjusts action space is proposed. Because actions that were not performed or were unlikely to be the optimal behaviors were not learned, and state space was not allocated to them, the learning time could be shortened, and the state space could be reduced. The proposed method was experimentally verified by applying it to a game of Tic-Tac-Toe. The proposed method showed results similar to those of traditional Q-learning even when the state space was reduced to approximately 0.33%, indicating that the proposed method could reduce the cost and time required for learning the same amount.

A virtual dance-tutorial system is used to learn how to dance with the help of a virtual instructor. If the proposed method is applied, reinforcement learning can be applied to such a system. In the future, the proposed method will be combined with deep learning to study how the dance motions of virtual instructors can be improved.

He received B.S. degrees in Dept. of Computer Engineering at Shinhan University, Seoul, Korea in 2020. He has been interested in artificial intelligence since college. He conducted an artificial waste bin project using TensorFlow. Although he successfully completed the AI bin project, he felt the need to study due to poor recognition. His current research interests include artificial intelligence and immersive SW. Since 2020, he has studies in the Dept. of Multimedia Engineering from Dongguk University-Seoul as a M.E. student.

He received his B.S. degree in the Division of Electrical and Computer Engineering from Pusan National University, Busan, Republic of Korea, in 2004, the M.S. degree in Computer Engineering from Dongguk University-Seoul, Seoul, Republic of Korea, in 2006, and the Ph.D. degree in Game Engineering from Dongguk University, Seoul, Republic of Korea, in 2012. He was employed as a member of the Researcher at Samsung Electronics in the Republic of Korea between 2006 and 2009. He was a postdoctoral fellow at the University of Florida, Florida, USA, between 2012 and 2013. His research interests are focused on the areas of superintelligence in the fields of immersive softwares, UAVs, security, and music using demonstration-based learning and deep learning. He is currently an associate professor in the Department of Multimedia Engineering at Dongguk University-Seoul, Seoul, Republic of Korea.

- 1 V. Francois-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau, "An introduction to deep reinforcement learning,"
*Foundations and Trends in Machine Learning*, vol. 11, no. 3-4, pp. 219-354, 2018.custom:[[[-]]] - 2 O. Alemi, J. Françoise, P. Pasquier, "GrooveNet: real-time music-driven dance movement generation using artificial neural networks," in
*Workshop on Machine Learning for Creativity conjunction with the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, Halifax, Canada, 2017;custom:[[[-]]] - 3 A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, M. Ghassemi, "Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach," in
*Proceedings of the Machine Learning for Health Care Conference (MLHC)*, Boston, MA, 2017;pp. 147-163. custom:[[[-]]] - 4 R. Garg, D. P. Nayak, "Game of tic-tac-toe: Simulation using Min-Max algorithm,"
*International Journal of Advanced Research in Computer Science*, vol. 8, no. 7, pp. 1074-1077, 2017.custom:[[[-]]] - 5 C. Jin, Z. Allen-Zhu, S. Bubeck, M. I. Jordan, "Is Q-learning provably efficient?,"
*Advances in Neural Information Processing Systems*, vol. 31, pp. 4863-4873, 2018.custom:[[[-]]]