## basics

- references
 - https://github.com/pytorch/examples/tree/main/reinforcement_learning
 - https://towardsdatascience.com/understanding-actor-critic-methods-931b97b6df3f
 - https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

### Policy gradient

- REINFORCE: noisy gradients & high variance (of gradients)
 - update the policy parameter ($\theta$) through Monte Carlo updates (i.e. taking random samples)
 - This introduces in inherent high variability in 
 - log probabilities (log of the policy distribution): $\log\pi_\theta(𝑎_𝑡|𝑠_𝑡)$
 - cumulative reward values: $G_t$
 - because each trajectories during training can deviate from each other at great degrees.
- cumulative reward == 0
 - The essence of policy gradient is increasing the probabilities for “good” actions and decreasing those of “bad” actions in the policy distribution;
 - both “goods” and “bad” actions with will not be learned if the cumulative reward is 0.

$$
\nabla_\theta J(\theta)=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t)G_t]
$$

### introduce a baseline $b(s)$

$$
\nabla_\theta J(\theta)=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t)(G_t-b(s_t))]
$$

### Actor Critic

- AC
 - Actor: $\pi(a|s)$
 - Critic: $Q(s, a)$
- Critic
 - estimates the value function.
 - action-value: $Q$ value
 - state-value: $V$ value
 - average general action value at the given state
 - $Q_w(s_t,a_t)$ => Critic neural network,回归一个 value 值;
 - Q Actor Critic
- Actor
 - The policy gradient method is also the “actor” part of Actor-Critic methods 

- both the Critic and Actor functions are parameterized with neural networks. 


$$
\begin{split}
\nabla_\theta J(\theta)&=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t)G_t]\\
&=\mathbb E_{s_0,a_0,\cdots,s_t,a_t}[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t)] \mathbb E_{r_{t+1},s_{t+1},\cdots,r_T,s_T}[G_t]\\
&=\mathbb E_{s_0,a_0,\cdots,s_t,a_t}[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t)] Q(s_t,a_t)\\
&=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t) Q_w(s_t,a_t)]
\end{split}
$$

### subtract baseline

$$
A(s_t,a_t) = Q_w(s_t,a_t)-V_v(s_t)
$$

- using the V function as the baseline function, 
- we subtract the $Q$ value term with the $V$ value.
- how much better it is to take a specific action compared to the average, general action at the given state. 
 - **advantage value**

$$
\begin{split}
&Q(s_t,a_t)=\mathbb E[r_{t+1}+\gamma V(s_{t+1})]\\
&A(s_t,a_t)=r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t)
\end{split}
$$

### Advantage Actor Critic (A2C)

$$
\begin{split}
\nabla_\theta J(\theta)&=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t) (Q_w(s_t,a_t)-V_v(s_t))]\\
&=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t) A(s_t,a_t)]\\
&=\mathbb E_\tau[\sum_{t=0}^{T-1}\nabla_\theta\log\pi_\theta(a_t|s_t) \left(r_{t+1}+\gamma V_v(s_{t+1})-V_v(s_t)\right)]
\end{split}
$$

## implemention

In [4]:
#!pip install -U gym==0.15.3

In [5]:
import sys
import torch 
import gym
import numpy as np 
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import matplotlib.pyplot as plt
import pandas as pd

In [6]:
# hyperparameters
hidden_size = 256
learning_rate = 3e-4

# Constants
GAMMA = 0.99
num_steps = 300
max_episodes = 3000

In [33]:
env = gym.make("CartPole-v0")

# 4-d 连续
num_inputs = env.observation_space.shape[0]
# 左右离散
num_actions = env.action_space.n
env.reset()

array([-0.01507673, 0.00588999, 0.00869466, 0.00153444])

In [37]:
class ActorCritic(nn.Module):
 def __init__(self, num_inputs, num_actions, hidden_size):
 super(ActorCritic, self).__init__()
 self.num_actions = num_actions
 
 # num_inputs: shape of state
 # critic nn: state => value
 self.critic_ln1 = nn.Linear(num_inputs, hidden_size)
 self.critic_ln2 = nn.Linear(hidden_size, 1)
 
 # actor nn: state => action, policy
 self.actor_ln1 = nn.Linear(num_inputs, hidden_size)
 self.actor_ln2 = nn.Linear(hidden_size, num_actions)
 
 def forward(self, state):
 # (4, ) => (1, 4)
 # ndarray => Variable
 # state = Variable(torch.from_numpy(state).float().unsqueeze(0))
 state = torch.tensor(state, requires_grad=True, dtype=torch.float32).unsqueeze(0)
 
 # forward of critic network
 # (1, 4) => (1, 256)
 value = F.relu(self.critic_ln1(state))
 # (1, 256) => (1, 1)
 value = self.critic_ln2(value)
 
 # (1, 4) => (1, 256)
 policy_dist = F.relu(self.actor_ln1(state))
 # (1, 256) => (1, 2)
 policy_dist = F.softmax(self.actor_ln2(policy_dist), dim=1)
 return value, policy_dist

In [38]:
ac = ActorCritic(num_inputs, num_actions, hidden_size, ) 
ac_opt = optim.Adam(ac.parameters(), lr=learning_rate)

In [41]:
for episode in range(max_episodes):
 
 # same length
 # index means timestamp: t
 log_probs = [] 
 values = []
 rewards = []
 
 state = env.reset()
 
 for step in range(num_steps):
 value, policy_dist = ac(state)
 
 print(value.shape, value)
 print(policy_dist.shape, policy_dist)
 
 value = value.detach().numpy()[0, 0]
 dist = policy_dist.detach().numpy()
 
 print(value.shape, value)
 print(dist.shape, dist)
 
 action = np.random.choice(num_actions, p=np.squeeze(dist))
 log_prob = torch.log(policy_dist.squeeze(0)[action])
 
 new_state, reward, done, _ = env.step(action)
 
 rewards.append(reward)
 values.append(value)
 log_probs.append(log_prob)
 
 state = new_state
 
 if done or step == num_steps - 1:
 
 
 break
 break

torch.Size([1, 1]) tensor([[0.0335]], grad_fn=)
torch.Size([1, 2]) tensor([[0.5095, 0.4905]], grad_fn=)
() 0.03351228
(1, 2) [[0.5095114 0.49048853]]
