ReInforcement Learning
Policy Iteration in RL: A step by step Illustration
Guide to Policy Iteration algorithm

Policy Iteration¹ is an algorithm in ‘ReInforcement Learning’, which helps in learning the optimal policy which maximizes the long term discounted reward. These techniques are often useful, when there are multiple options to chose from, and each option has its own rewards and risks.
In this article, we are going to apply the ‘Policy Iteration’ algorithm to a simple game involving some pirates who will have to reach their destination, amidst risky and beneficial situations.
Overview of the game:
Let us consider a pirate ship, which is currently anchored at an island and, has to reach its homeland safely. There are two routes it could choose from.
If it takes the route to the north, then it can reach an island which is full of gold, which it can collect, and then move south to reach the homeland. However, there is an area with a very high gravity (like the Bermuda triangle) to the north of the gold island. If the ship reaches there by mistake, then it will be sucked into it and the ship is lost forever.
If it takes the route to the south, then it can reach an island, which is full of silver, which it can collect, and then move north to reach the homeland. There is a prison island to the south of the silver island. If the ship lands there by mistake, then the ship will be captured and the crew will be imprisoned.
All looks good so far, however, there is a twist in the game. There is no fun if our life is deterministic. Let’s introduce stochasticity into the game.
Remember Jack Sparrow, in the ‘Pirates of the Caribbean’ movie?!. Let us say our ship’s captain possesses a broken compass, similar to the one possessed by Jack. So, every time, the captain makes a move towards north, he moves north with a probability of 0.8, however, he might miss the mark and reach south with a probability of 0.2. Similarly, if he moves south, there is a 0.8 probability of going south and 0.2 probability of going north.
Rewards (Positive and Negative):
Let us assign some rewards to each of the ship’s landing.
- Reaching homeland will allow the pirate ship to collect +1 point.
- Landing on the gold island will allow the pirate ship to collect +2 points.
- Landing on the silver island will allow the pirate ship to collect +1 point.
- If the ship gets sucked into Bermuda triangle, then the ship gets -2 points.
- If the ship is captured in the prison island, then the ship get -0.5 points.
Given this context, our goal is to find the optimal policy for the ship to reach its homeland safely with maximum rewards.
A quick review of the ‘Policy Iteration’ algorithm:
There are three steps in Policy Iteration¹:
- Initialize random policy¹
- Policy Evaluation¹
- Policy Improvement¹
What is a Policy?
Policy is a mapping of an action to every possible state in the system². An optimal policy is that policy which maximizes the long term reward². Thus, for this example, we can have multiple policies. i.e, multiple set of actions at each state (island), however, there is probably only one policy which gives us final maximum reward. Our goal is to find that optimum policy.
Step 1:
Randomly initialize the policy. Initialize actions randomly at every state of the system.
Step 2:
Step 2 is based on Bellman’s equation which is provided below¹:

Get action for every state in the policy and evaluate the value function using the above equation. Here is p is the transition probability, also denoted by T.
Step 3:
For every state, get the best action from value function using:

If the best action is better than the present policy action, then replace the current action by the best action.
Policy Iterations:
Iterate steps 2 and 3, until convergence. If the policy did not change throughout an iteration, then we can consider that the algorithm has converged.
State transition diagram:
The first thing to do is to understand the states and actions and build a state transition diagram. In our example, each island is a state and there are two actions, ‘north’ and ‘south’. Rewards for each state is as outlined in the above illustration. Based on these facts, we can build a state transition diagram as below:
There are six states including start and destination states and four intermediary islands where they can hop on. Let us label the states from S1 to S6 as follows:
- S1: Start state
- S2: State of landing in gold Island
- S3: State of landing in silver Island
- S4: State of landing in Bermuda Triangle Island
- S5: State of landing in destination Island
- S6: State of landing in prison Island

Transition probability matrix:
Since the game is stochastic, we need to compute transition probabilities for each state/action pair. Based on the probabilities provided above and the state transition diagram, we can draw this up as below. Note that there are two actions and we need a transition probability matrix for both of these actions. Note that while applying Bellman’s equation, T(S, a, S’) refers to the transition probability of moving from state S to state S’ after taking an action ‘a’.


Policy Iteration algorithm:
With this information at hand, let us apply the above-mentioned algorithm step by step. We can assume the discounted factor (gamma) to be 1.
Initial random policy:
Let us randomly initialize the policy (state to action mapping) as moving north for all states.
P = {N, N, N, N, N, N}
If we observe the state transition diagram, the states S4, S5, S6 do not have any actions supported in these states, as these are end states. So let us curtail our policy to apply to only the first three states where we can take an action.
P = {N, N, N}
First iteration:
Let us assume the initial value V(s) for all states as 0. Thus, the Bellman equation would reduce to V(s) = R(s), where R(s) is the reward for entering a state.
Policy Evaluation for first iteration:
V[S1] = 0; V[S4] = -2
V[S2] = 2; V[S5] = 1
V[S3] = 1; V[S6] = -0.5
Policy Improvement for first iteration:
Let us apply the equation provided above for Policy Improvement.

The policy obtained based on the above table is as follows:
P = {N,S,N}
Second iteration:
Policy Evaluation for the second iteration:

The values for each state could be summarized as below:
V[S1] = 3; V[S4] = -2
V[S2] = 1; V[S5] = 1
V[S3] = 1.5; V[S6] = -0.5
Policy Improvement for second iteration:

The policy obtained based on above table is as follows:
P = {S, S, N}
Third iteration:
Policy Evaluation for third iteration:

Values for each state could be summarized as below:
V[S1] = 2.5; V[S4] = -2
V[S2] = 1; V[S5] = 1
V[S3] = 1.5; V[S6] = -0.5
Policy Improvement for third iteration:

The policy obtained based on above table is as follows:
P = {S, S, N}
If we compare this policy, to the policy we obtained in second iteration, we can observe that policies did not change, which implies algorithm has converged and this is the optimal policy.
Conclusion:
We have obtained the optimal policy as {South, South, North} by applying the algorithm. If we observe the example, it might be tempting to go north initially as gold island has more rewards, but it is fraught with risk, as we might lose 2 points and end the game. So, it is better to sacrifice short term rewards and take the south route which eventually maximizes our long term reward.
I hope this illustration helps in better understanding the algorithm.
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Weng, Lilian. (2018, February 19). A (Long) Peek into Reinforcement Learning. Retrieved from https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html