From beating top of the leaderboard players in Dota 2 or beating the world champion in Go, Reinforcement Learning has gained a lot of popularity in recent years. Reinforcement Learning can not only be applied to playing arcade or board games, but also for many different other tasks such as managing an investment portfolio or teaching a robot how to pick a device from one box and putting it in a container. With this project I wanted to explore some of what Reinforcement Learning and especially Q-Learning has to offer (a Deep Q-Learning Project might follow in the near future). My idea was to implement a simple Snake-Game in Python without implementing the tail of the snake. So basically just one cube that is looking for another cube in a small random environment. The tail will probably be implemented in a later post.
Step 1 Importing Libraries and defining some important Variables
import numpy as np from PIL import Image import cv2 SIZE = 20 SNAKE_COLOUR = (0, 0, 255) FOOD_COLOUR = (255, 0, 0) HM_EPISODES = 25000 ### How many episodes in total MOVE_PENALTY = 1 ### Movement penalty for each step FOOD_REWARD = 50 ### Reward for the food BOUNDARY_PENALTY = 100 ### Penalty if the boundaries are hit epsilon = 0.9 ### Variable for random actions, so the agent is able to explore the environment EPS_DECAY = 0.9998 ### Every episode will be epsilon*EPS_DECAY SHOW_EVERY = 2000 ### How often to play through env visually. show_every = 1000 ### How often the current episode should be printed to the console LEARNING_RATE = 0.1 DISCOUNT = 0.95
Step 2 Defining the Cube (OOP)
class Cube: def __init__(self): ### Randomly initialize the Cube somewhere in the environment self.x = np.random.randint(0, SIZE) self.y = np.random.randint(0, SIZE) def __str__(self): ### Override the Str statement return f"{self.x}, {self.y}" def __sub__(self, other): ### Needed for distance calculation between Cube and Food return (self.x - other.x, self.y - other.y) def move(self, choice): if choice == 0: self.x += 1 elif choice == 1: self.x -= 1 elif choice == 2: self.y += 1 elif choice == 3: self.y -= 1 ### Check if out of bounds if self.x > SIZE - 1: self.x = SIZE - 1 if self.x < 0: self.x = 0 if self.y > SIZE - 1: self.y = SIZE - 1 if self.y < 0: self.y = 0
Step 3 Initialize Q-Table and Start the Loop
q_table = {} for i in range(-SIZE + 1, SIZE): for ii in range(-SIZE + 1, SIZE): q_table[((i, ii))] = [np.random.uniform(-5, 0) for i in range(4)] episode_rewards = [] for episode in range(HM_EPISODES): ### Initialize Cube and Snake snake = Cube() food = Cube() ### Simple code to check what the epsilon and mean reward is if episode % SHOW_EVERY == 0: print(f"on #{episode}, epsilon is {epsilon}") print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}") show = True else: show = False if episode % show_every == 0: print(f"on #{episode}") episode_reward = 0 for i in range(500): obs = (snake - food) ### Get the first observation --> Distance between snake and food ### Decide whether to take a random action or whether to take the argmax value of the ### q_table based on the observation if np.random.random() > epsilon: # GET THE ACTION action = np.argmax(q_table[obs]) else: action = np.random.randint(0, 4) snake.move(action) ### Make a move ### If the Snake hits the Food, initialize the Food again ### Also assign the reward for each step if snake.x == food.x and snake.y == food.y: food = Cube() reward = FOOD_REWARD elif snake.x > SIZE - 1: reward = -BOUNDARY_PENALTY elif snake.y > SIZE - 1: reward = -BOUNDARY_PENALTY else: reward = -MOVE_PENALTY ### Get the new observation and calculate the q-values new_obs = (snake - food) max_future_q = np.max(q_table[new_obs]) current_q = q_table[obs][action] if reward == FOOD_REWARD: new_q = FOOD_REWARD else: new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q) q_table[obs][action] = new_q ### Visualize the environment if show: env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8) env[snake.x][snake.y] = SNAKE_COLOUR env[food.x][food.y] = FOOD_COLOUR img = Image.fromarray(env, 'RGB') img = img.resize((300, 300)) cv2.imshow("image", np.array(img)) cv2.waitKey(10) episode_reward += reward episode_rewards.append(episode_reward) epsilon *= EPS_DECAY moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,))/SHOW_EVERY, mode='valid')
Snake at Start
epsilon = 0.9
At the very beginning, the Snake is not able to at least once get to the food. That is mainly because epsilon is at 0.9 and almost every action the Snake is taking is random
Snake after 5.000 Steps
epsilon = 0.331
mean reward = 183
After training for 5.000 episodes, the Snake managed to significantly improve in performance, while about every third movement still is random (epsilon = 0.331). The mean reward for the last 1.000 episodes was 183.
Snake after 20.000 Steps
epsilon = 0.016
mean reward = 1276
After training for another 15.000 episodes, the snake is not getting much better any more and has reached an an optima. The mean reward for the last 1.000 episodes has also increased to about 1.276
Further Steps:
- Change some of the parameters, for example the Size of the Screen from 20×20 to 50×50 or even 100×100. This would significantly increase complexity.
- Add the tail
- Change epsilon and epsilon decay as well as other parameters