---
# <div align="center"><font color='green'> COSC 2673/2793 | Machine Learning  </font></div>
## <div align="center"> <font color='green'> **Example: Week07 Lecture QandA**</font></div>
---

## Mountain Car with RL
Mountain Car is a classic control Reinforcement Learning problem that was first introduced by A. Moore in 1991 [1]. 
A car is on a one-dimensional track, positioned between two “mountains”. The goal is to drive up the mountain on the right; however, the car’s engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.
It can be tricky to find this optimal solution due to the sparsity of the reward. Complex exploration strategies can be used to incentivise exploration of the mountain. 
    

Mountain Car Problem definition:
>Objective: Get the car to the top of the right hand side mountain.

>State: Car's horizontal position and velocity (can be negative).

>Action: Direction of push (left, nothing or right).

>Reward: -1 for every environment step until success, which incentivises quick solutions.

## OpenAI Gym 
OpenAI Gym is a Python package comprising a selection of RL environments, ranging from simple “toy” environments to more challenging environments, including simulated robotics environments and Atari video game environments.
It was developed with the aim of becoming a standardized environment and benchmark for RL research.
In this Lab, we will use the OpenAI Gym Mountain Car environment to demonstrate how to get started in using this exciting tool and show how Q-learning can be used to solve this problem.

## Setting up the environment
Lets first import the libraries required for the implementation.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from IPython import display
!pip install gym
import numpy as np
import gym

**Only uncomment the following block if the visualization of the environment gives an error**. On mac you need to install pyglet version 1.5.11 to get the gym environment to render. The installation will give an error, but it will work.

In [None]:
!pip install pyglet==1.5.11

To begin with this environment, import and initialize it as follows:

In [None]:
env = gym.make('MountainCar-v0')
state = env.reset()
print(state)

The `env.reset()` command resets the environemnt and return the initial state

Lets explore the state space and the action space og the MountainCar environment

In [None]:
print('State space: ', env.observation_space)
print('Action space: ', env.action_space)

This tells us that the state space is a 2-dimensional space, so each state observation is a vector of 2 (float) values, and that the action space comprises three discrete actions (left, nothing or right). By default, the three actions are represented by the integers 0, 1 and 2. How about the state space? What are the limis of the state space?

In [None]:
print('State space Low: ', env.observation_space.low)
print('State space High: ', env.observation_space.high)

This shows that the first state variable (horizontal position) has a range [-1.2, 0.6] and the second state variable (speed) has a range [-0.07, 0.07]. The state space of the environment is a continuous state space, which means that there are infinitely many state-action pairs, making it impossible to build a Q table. As a solution to this problem we can descritize the state space. One simple discritization is to conver the stat espace to a grid with spacing of 0.1 along first element and 0.01 along second element in the state space. The states can be given integer indexes multiply the first element by 10 and the second by 100. lets see the size of discretized state space.

In [None]:
num_states = (env.observation_space.high - env.observation_space.low)*np.array([10, 100])
num_states = np.round(num_states, 0).astype(int) + 1
print(num_states)

We can also write a function that will convert a continuous state vector to a descrete one. 

In [None]:
# Discretize state
def discretize_state(state, env_low):
    state_adj = (state - env_low)*np.array([10, 100])
    state_adj = np.round(state_adj, 0).astype(int)
    return state_adj

Lets now make some random actions in the environment and see what the output will be. For this we need a function to plot the output of the environment. 

In [None]:
def show_state(env, step=0, info=""):
    plt.figure(1)
    plt.clf()
    plt.imshow(env.render(mode='rgb_array'))
    plt.title("Step: %d %s" % (step, info))
    plt.axis('off')

    display.clear_output(wait=True)
    display.display(plt.gcf())

In [None]:
env.reset()
done = False
step_index = 0
while done != True:
    action = env.action_space.sample()    # get a random action from the set of actions
    state, reward, done, info = env.step(action) # perform the action and receive new state and reward
    d_state = discretize_state(state, env.observation_space.low)
    show_state(env, step=step_index, info='State ({},{}) Reward: {}'.format(d_state[0], d_state[1], reward))
    step_index = step_index + 1


Did the car reach the goal state?

In [None]:
# Define Q-learning function
def QLearning(env, Q, learning, discount, epsilon, episodes):
    # Env: The OpenAI gym environment
    # Q: Initial Q table
    # learning: Learning Rate of Q learing
    # discount: discount facotr (gamma)
    # epsilon: epsilon for exploration vs exploitation
    # episodes: number of episodes to run when learing the Q table
    
    # Initialize variables to hold rewards
    reward_list = []
    
    # Calculate reduction in epsilon per episode
    epsilon_d = (epsilon)/episodes
    
    for i in range(episodes):
        done = False
        tot_reward, reward = 0,0
        state = env.reset()
        
        state_adj = discretize_state(state, env.observation_space.low)
    
        while done != True:   
                
            # Determine next action - epsilon greedy strategy for explore vs exploitation
            if np.random.random() < 1 - epsilon:
                action = np.argmax(Q[state_adj[0], state_adj[1]]) 
            else:
                action = env.action_space.sample()
                
            # Get next state and reward
            state2, reward, done, info = env.step(action) 
            
            state2_adj = discretize_state(state2, env.observation_space.low)
            
            #Allow for terminal states
            if done and state2[0] >= 0.5:
                Q[state_adj[0], state_adj[1], action] = reward
                
            # Adjust Q value for current state
            else:
                Q[state_adj[0], state_adj[1],action] = (1-learning)*Q[state_adj[0], state_adj[1],action] + learning*(reward + 
                                 discount*np.max(Q[state2_adj[0], state2_adj[1]]))
                                     
            # Update variables
            tot_reward += reward
            state_adj = state2_adj
        
        # Update epsilon
        if epsilon > 0:
            epsilon -= epsilon_d
        
        # Track rewards
        reward_list.append(tot_reward)
        
        if (i+1) % 100 == 0:
            ave_reward = np.mean(reward_list)
            reward_list = []
            
        if (i+1) % 100 == 0:    
            print('Episode {} Average Reward: {}'.format(i+1, ave_reward))
            
    env.close()
    
    return Q

In [None]:
# Initialize Q table randomly
Q = np.random.uniform(low = -1, high = 1, size = (num_states[0], num_states[1], env.action_space.n))
# Run Q-learning algorithm
Q = QLearning(env, Q, 0.2, 0.9, 0.8, 5000)

Now lets see how the we can perform the task with the learned model

In [None]:
state = env.reset()
state_adj = discretize_state(state, env.observation_space.low)
done = False
step_index = 0
while done != True:
    action = np.argmax(Q[state_adj[0], state_adj[1]]) 
    state, reward, done, info = env.step(action)
    state_adj = discretize_state(state, env.observation_space.low)
    show_state(env, step=step_index, info='State ({},{}) Reward: {}'.format(d_state[0], d_state[1], reward))
    step_index = step_index + 1
