1 # 7) Learning to flight (aka policy learning)
5 The objective of this tutorial is to study how to directly solve an
6 optimal control problem, either by computing a trajectory from current
7 state to goal, or by computing a policy. To keep decent computation
8 timings with simple python code, we will only work with a simple
9 inverted pendulum with limited torque, that must swing to raise to
10 standing configuration. The presented algorithms successively compute an
11 optimal trajectory, a discretized policy with a Q-table and with a
12 linear network, and a continuous policy with a deep neural network.
16 We need a pendulum model and a neural network.
20 Inverted pendulum model
22 Two models are provided. The first one is continuous and is implemented
23 with Pinocchio and is available in [pendulum.py](pendulum_8py_source.html)
24 with class `Pendulum`. The
25 code is generic for a N-pendulum. We will use the 1-dof model. The state
26 is `[q, v]` the angle and angular velocity of the pendulum. The control
27 is the joint torque. The pendulum weights 1kg, measures 1m with COM at
28 0.5m of the joint. Do not forget to start `gepetto-gui` before
29 rendering the model. Each time the simulator is integrated, it returns a
30 new state and the reward for the previous action (implement to be the
31 weighted sum of squared position, velocity and control). The state is
32 recorded as a member of the class and can be accessed through env.x. Do
33 not forget to copy it before modifying it.
36 from pendulum import Pendulum
37 from pinocchio.utils import *
39 env = Pendulum(1) # Continuous pendulum
40 NX = env.nobs # ... training converges with q,qdot with 2x more neurones.
41 NU = env.nu # Control is dim-1: joint torque
43 x = env.reset() # Sample an initial state
44 u = rand(NU) # Sample a control
45 x, reward = env.step(u) # Integrate simulator for control u and get reward.
46 env.render() # Display model at state env.x
49 A second version of the same model is provided with a discrete dynamics,
50 in [dpendulum.py](dpendulum_8py_source.html). The state is again `[q, v]`, however discretized in NQ
51 position and NV velocity. The state is then a integer equal to iq*NV+iv,
52 ranging from 0 to NQ*NV=NX. Controls are also discretized from 0 to NU.
55 from dpendulum import DPendulum
58 NX = env.nx # Number of (discrete) states
59 NU = env.nu # Number of (discrete) controls
65 Other models could be used. In particular, we used a similar API to the
66 Gym from OpenAI, that you might be interested to browsed and possibly
67 try with the following algorithms.
70 A neural network with optimizers
72 We will use the Tensor Flow from Google, available thanks to pip.
75 pip install --user tensorflow tflearn
78 ## 7.1) Optimizing an optimal trajectory
80 For the first tutorial, we implement a nonlinear optimization program
81 optimizing the cost of a single trajectory for the continious pendulum.
82 The trajectory is represented by its initial state x0 and the vector of
83 piecewise-constant control `U=[u0 ... uT-1]`, with `T` the number of
84 timestep. The cost is simply the integral of the cost function l(x,u)
85 returned by the pendulum environment.
87 Then the integral cost is optimized starting from a 0 control
88 trajectory, until the pendulum finally reaches a standing state. Observe
89 that the number of swings is possibly sub-optimal.
91 The code to optimize the trajectory is available in [ocp.py](ocp_8py_source.html).
93 ## 7.2) Q-table resolution for discrete pendulum
95 We now consider the discrete model, with NX state and NU control.
96 Imagine this model as a chess board with a maze (possibly nonplanar)
97 drawn on it, and you ask the system to discover a path from an inital
98 state to the final state at the center of the board. When performing a
99 trial, the system is informed of success or failure by receiving reward
100 1 when reaching the goal, or otherwise 0 after 100 moves. To record the
101 already-explored path, we can stored a table of NX per NU values, each
102 giving how likely we would be rewarded if taking action U at state X.
103 This table is named the Q-table, and corresponds to the Hamiltonian
104 (Q-value) of the discrete system. The Q-values can be back-propagated
105 along the table using the Dijkstra algorithm. Since we do not now the
106 goal(s) states, the back propagation is done along random roll-outs
107 inside the maze, which likely converges to an approximation of the exact
108 Hamiltonian. Once the Q-table is computed, the optimal policy is simply
109 chosen by maximizing the vector of Q-values corresponding to the row of
112 This algorithm is available in the file [qtable.py](qtable_8py_source.html).
114 ## 7.3) Q-table using a linear net
116 The idea is to similarly approximate the Q-value for the continuous
117 model. Since the continious model has both infinitely many states and
118 controls, a table can not make it. We will rather use any function basis
119 to approximate the Q-value. For the tutorial, we have chosen to use a
120 deep neural net. Firt, let's use a simple net for storing the Q-table.
122 Basically, the vectory of Q-values for all possible control u is
123 obtained by multiplying the Q-table by a one-hot vector (0 everywhere
124 except a single 1) corresponding to the state. The optimal policy is
125 then the maximum of this vector: `iu^* = argmax(Q*h(ix))`, with `h(ix)
126 = [ 0 0 ... 0 1 0 ... 0]`, ix and iu being indexes of both state and
127 control. We use tensor flow to store array Q. The Q-value net is simply
128 the multiplication of Q by one-hot x, and the policy the argmax of the
131 Now, the coefficients of Q are the parameters defining the Q-value (and
132 then the policy). They must be optimized to fit the cost function. From
133 Hamiltion-Jacobi-Belman equation, we know that Q(x,u) = l(x,u) + max\_u2
134 Q(f(x,u),u2). We optimize the Q value so that this residual is minimized
135 along the samples collected from successive roll-outs inside the maze.
137 The implementation of this algorithm is available in [qnet.py](qnet_8py_source.html).
138 Observe that the convergence is not as fast as with the Q-Table algorithm.
140 ## 7.4) Actor-critic network
142 We will now optimize a continuous policy and the corresponding
143 Q-function, using an "Actor-Critic" method proposed in ["Continuous
144 control with deep reinforcement learning", by Lillicrap et al,
145 arXiv:1509.02971](https://arxiv.org/abs/1509.02971).
147 Two networks are used to represent the Q function and the policy. The
148 first network has two inputs: state x and control u. Its outpout is a
149 scalar. It is optimized to minimize the residual corresponding to HJB
150 equation along a batch of sample points collected along previous
153 The policy function has a single input: state X. Its output is a control
154 vector U (dimension 1 for the pendulum). It is optimize to maximize the
155 Q function , i.e at each state, Pi(x) corresponds to the maximum over
156 all possible controls u of Q(x,u).
158 Two critical aspects are reported in the paper and implemented in the
159 tutorial. First, we learn over a batch of random samples collected from
160 many previous roll-outs, in order to break the temporal dependancy in
161 the batch. Second, we regularize the optimization of both Q-value
162 (critic) and policy (actor) networks by storing a copy of both network,
163 and only slightly modifying these copy at each steps.
165 The corresponding algorithm is implemented in the file [continuous.py](continuous_8py_source.html)
166 The training phase requires 100 roll-outs and some
167 minutes (maybe more on a virual machine).
169 ## 7.5) Training the network with the OCP solver
171 Using the OCP solver, you might compute a few optimal trajectories (say
172 10) starting from various initial conditions. Initialize the replay
173 memory with the 10x50 points composing the 10 optimal trajectories and
174 optimize the network from these replay memory only (without additional
175 roll-outs, but using the same small-size batch). Play with the learning
176 parameters until the network converges.
178 When properly implemented, the OCP produces better accuracy than the
179 policy. However, at run-time, the policy is much cheaper to evaluate
180 than solving a new OCP. I am currently considering how to use the
181 network to warm-start or guide the OCP solver at run-time.
183 The provided solvers (trajectory and policy) runs reasonably well for
184 the 1-pendulum. It is more difficult to tune for a more-complex
185 dynamics, such as a 2-pendulum. You may want to try on a quadcopter
186 robot (hence the title of the tutorial) but I except it to be a serious