Snake (without Head) – Reinforcement Learning in Python

From beating top of the leaderboard players in Dota 2 or beating the world champion in Go, Reinforcement Learning has gained a lot of popularity in recent years. Reinforcement Learning can not only be applied to playing arcade or board games, but also for many different other tasks such as managing an investment portfolio or teaching a robot how to pick a device from one box and putting it in a container. With this project I wanted to explore some of what Reinforcement Learning and especially Q-Learning has to offer (a Deep Q-Learning Project might follow in the near future). My idea was to implement a simple Snake-Game in Python without implementing the tail of the snake. So basically just one cube that is looking for another cube in a small random environment. The tail will probably be implemented in a later post.

Step 1 Importing Libraries and defining some important Variables
import numpy as np
from PIL import Image
import cv2

SIZE = 20
SNAKE_COLOUR = (0, 0, 255)
FOOD_COLOUR = (255, 0, 0)

HM_EPISODES = 25000 ### How many episodes in total
MOVE_PENALTY = 1 ### Movement penalty for each step
FOOD_REWARD = 50 ### Reward for the food
BOUNDARY_PENALTY = 100 ### Penalty if the boundaries are hit
epsilon = 0.9 ### Variable for random actions, so the agent is able to explore the environment
EPS_DECAY = 0.9998  ### Every episode will be epsilon*EPS_DECAY
SHOW_EVERY = 2000 ### How often to play through env visually.
show_every = 1000 ### How often the current episode should be printed to the console
LEARNING_RATE = 0.1 
DISCOUNT = 0.95
Step 2 Defining the Cube (OOP)
class Cube:

    def __init__(self): ### Randomly initialize the Cube somewhere in the environment
        self.x = np.random.randint(0, SIZE)
        self.y = np.random.randint(0, SIZE)

    def __str__(self): ### Override the Str statement
        return f"{self.x}, {self.y}"

    def __sub__(self, other): ### Needed for distance calculation between Cube and Food
        return (self.x - other.x, self.y - other.y)

    def move(self, choice):
        if choice == 0:
            self.x += 1
        elif choice == 1:
            self.x  -= 1
        elif choice == 2:
            self.y += 1
        elif choice == 3:
            self.y -= 1

        ### Check if out of bounds
        if self.x > SIZE - 1:
            self.x = SIZE - 1
        if self.x < 0:
            self.x = 0
        if self.y > SIZE - 1:
            self.y = SIZE - 1
        if self.y < 0:
            self.y = 0
Step 3 Initialize Q-Table and Start the Loop
q_table = {}
for i in range(-SIZE + 1, SIZE):
    for ii in range(-SIZE + 1, SIZE):
                q_table[((i, ii))] = [np.random.uniform(-5, 0) for i in range(4)]

episode_rewards = []

for episode in range(HM_EPISODES):
  	### Initialize Cube and Snake
    snake = Cube()
    food = Cube()

    ### Simple code to check what the epsilon and mean reward is
    if episode % SHOW_EVERY == 0:
        print(f"on #{episode}, epsilon is {epsilon}")
        print(f"{SHOW_EVERY} ep mean: {np.mean(episode_rewards[-SHOW_EVERY:])}")
        show = True
    else:
        show = False

    if episode % show_every == 0:
        print(f"on #{episode}")

    episode_reward = 0

    for i in range(500):

        obs = (snake - food) ### Get the first observation --> Distance between snake and food

        ### Decide whether to take a random action or whether to take the argmax value of the
        ### q_table based on the observation
        if np.random.random() > epsilon:
            # GET THE ACTION
            action = np.argmax(q_table[obs])
        else:
            action = np.random.randint(0, 4)

        snake.move(action) ### Make a move

        ### If the Snake hits the Food, initialize the Food again
        ### Also assign the reward for each step
        if snake.x == food.x and snake.y == food.y:
            food = Cube()
            reward = FOOD_REWARD
        elif snake.x > SIZE - 1:
            reward = -BOUNDARY_PENALTY
        elif snake.y > SIZE - 1:
            reward = -BOUNDARY_PENALTY
        else:
            reward = -MOVE_PENALTY

        ### Get the new observation and calculate the q-values 
        new_obs = (snake - food)
        max_future_q = np.max(q_table[new_obs])
        current_q = q_table[obs][action]
        if reward == FOOD_REWARD:
            new_q = FOOD_REWARD
        else:
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
        q_table[obs][action] = new_q

        ### Visualize the environment
        if show:
            env = np.zeros((SIZE, SIZE, 3), dtype=np.uint8)
            env[snake.x][snake.y] = SNAKE_COLOUR
            env[food.x][food.y] = FOOD_COLOUR
            img = Image.fromarray(env, 'RGB')
            img = img.resize((300, 300))
            cv2.imshow("image", np.array(img))
            cv2.waitKey(10)

        episode_reward += reward

    episode_rewards.append(episode_reward)
    epsilon *= EPS_DECAY

moving_avg = np.convolve(episode_rewards, np.ones((SHOW_EVERY,))/SHOW_EVERY, mode='valid')
Snake at Start
epsilon = 0.9

At the very beginning, the Snake is not able to at least once get to the food. That is mainly because epsilon is at 0.9 and almost every action the Snake is taking is random

Snake after 5.000 Steps
epsilon = 0.331
mean reward = 183

After training for 5.000 episodes, the Snake managed to significantly improve in performance, while about every third movement still is random (epsilon = 0.331). The mean reward for the last 1.000 episodes was 183.

Snake after 20.000 Steps
epsilon = 0.016
mean reward = 1276

After training for another 15.000 episodes, the snake is not getting much better any more and has reached an an optima. The mean reward for the last 1.000 episodes has also increased to about 1.276

Further Steps:
  • Change some of the parameters, for example the Size of the Screen from 20×20 to 50×50 or even 100×100. This would significantly increase complexity.
  • Add the tail
  • Change epsilon and epsilon decay as well as other parameters

Kommentar verfassen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert