Yasunori Toshimitsu | 利光泰徳
my paper pile
The contents of this page are meant to be memos for myself, but I will post it publicly here with the hope that someone else may find it useful, and more importantly, to motivate myself using peer pressure to learn more.
DRL-based manipulation in MuJoCo
A. Rajeswaran et al., “Learning Complex Dexterous Manipulation with Deep
Reinforcement Learning and Demonstrations,” Sep. 2017.
.
read: 08.04.2022 - 11.04.2022

code
quick points
- by using human demonstrations, captured through a glove, DRL can learn more efficiently
- (one of?) the first to apply model-free learning to anthropomorphic hands
- IO
- sensed data: hand joint angles, object and target pose
- actions: desired hand joint angles
- for in-hand manipulation, collecting human demo data is difficult, so used “computational expert trained using RL on a well shaped reward for many iterations”
-
small noise added to demo data, to combat distribution drift
- want to get stochastic policy \(\pi_\theta: S \times A \rightarrow \mathbb R_+\)
-
The “Natural Policy Gradient” algorithm used to optimize the parameters \(\theta\) of the policy (the what???)
- behavior cloning tries to maximize the likelihood for the policy to replicate the demonstrated behavior;
\(\Sigma_{(s, a) \in \rho_D} ln(\pi_\theta (a|s))\)
- but the cloned policies alone are not successful in completing a task.
- The following gradient augmented with the demo data is used, where \(w(s,a)\) is the weighting function to adjust how much effect demo data has (the “heuristic” weighting function is described in paper). DAPG: Demo Augmented Policy Gradient
\[g_{aug} = \Sigma_{(s, a) \in \rho_\pi} \nabla_\theta ln(\pi_\theta (a|s)) A^\pi(s, a) + \\
\Sigma_{(s, a) \in \rho_D} \nabla_\theta ln(\pi_\theta (a|s)) w(s,a)\]
- what is the reward? sparse or shaped? -> DAPG works with sparse task completion reward
one of the resampling (here probably means to validate models by using random subsets) methods, which use random sampling w/ replacement. Bootstrapping estimates the properties of an estimator (e.g. its variance) by measuring it by sampling from an approximating distribution. (the height measurement example in Wikipedia is easy to understand)
TD learning, TD error?
temporal difference learning bootstraps from the current estimate of the value function.
TD(0) method is a special simple case of TD methods, estimating state value function (the expected return of starting from state \(s\) and following policy \(\pi\) ) of a finite state MDP.
The HJB equation, which is the N-S condition for optimal control
\[V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})|s_{0}=s\}\]
The update rule is:
\[V(s)\leftarrow V(s)+\alpha \overbrace{(\underbrace {r+\gamma V(s')} _{\text{TD target}}-V(s))}^{TD error}\]
TD error is the difference bet. current estimate for \(V\) and the actual reward gained (plus discounted estimate of next state). But more effective to estimate the action-value pair?
SARSA
on-policy TD control method, where the estimate of \(q_\pi (s_t, a_t)\) is dependent on the current policy \(\pi\) and assumed that this policy will be used for all of rest of the agent’s episode. A problem is that it is sensitive to current policy that the agent is following.
off-policy vs on-policy?
- on-policy: assumes the current policy is used by the agent for the whole learning process. SARSA is on-policy. “more stable, and scale well to high dimensional spaces”
- off-policy: uses some other policy (like greedy search) during learning. Q-learning is off-policy. “more sample efficient when successful, but more unstable”
distribution drift?
DDPG: Deep Deterministic Policy Gradient (DDPG) & Normalized Advantage Function (NAF)
T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” Sep. 2015..
read 20220423-20220427
https://keras.io/examples/rl/ddpg_pendulum/
2017 presentation
watched 20220427
- model-free off-policy algorithm for learning continuous actions, combining ideas from DPG and DQN.
- in Rajeswaran2017, mentioned “sample efficient but very sensitive to hyperparameters and random seeds”
- cited >8000 !!
from paper…
- model-free, off-policy, actor-critic
- key improvements by DQN which enabled the learning of large non-linear value functions
- network training off-policy with samples from a replay buffer
- network is trained with a “target Q network”
- about DPG
-
actor function $$\mu(s |
\theta^\mu)$$ deterministically maps state to action |
- critic \(Q(s,a)\) is learned using Bellman equation
- actor function updated by gradient of expected return from start \(J\) wrt to \(\theta^\mu\) (actor parameters)
\(\nabla_{\theta^\mu}J \approx \mathbb E_{s_t \sim \rho^\beta}[\nabla_{\theta^\mu}Q(s,a|\theta^Q)|_{s=s_t, a=\mu(s_t|\theta^\mu)}]\\
= \mathbb E_{s_t \sim \rho^\beta}[\nabla_a Q(s,a|\theta^Q)|_{s=s_t, a=\mu(s_t)} \nabla_{\theta_\mu}\mu(s|\theta^\mu)|_{s=s_t}]\)
- this has been proven to be the gradient of the policy’s performance, i.e. the policy gradient.
- what is \(\rho^\beta\)? the replay buffer?
- is computing the expected value for the gradient wrt the actor parameters of the current Q function, starting from \(s_t\) and following the actor function.
- what is new with DDPG (wrt DPG
- modifications inspired by DQN, allwing it to use NN to learn in large state and action spaces online (so… it has same concepts as DPG?)
- add replay buffer like in DQN, which stores the tuple \((s_t, a_t, r_t, s_t+1)\). Seems like a fixed-size LIFO queue?
- add “batch normalization”

from video…
- uses off-policy critic \(Q_w\) which learns from replay buffer
- for on-policy learning, singe objective: relentlessly improve policy
- for off-policy, two objectives: improve the off-policy critic which looks at the replay buffer, and to maximize the value of the critic
- DQN is suitable when action space is discrete / small (difficult to run the “max” step)
- actor-critic structure
- actor sends actions to environemnt
- critic outputs a single scalar value indicating expected cumulative reward

DeepMind Lecture: Reinforcement Learning 6: Policy Gradients and Actor Critics
watched 20220429-20220510
Never solve a more general problem as an intermediate step - Vladimir Vapnik, 1998
-> why not learn a policy directly (rather than learning the Q value)

- approximating parametric value functions vs parametric policy directly

- sometimes policies are simple, while models and values are complex
- policy-based is subject to local optima, and obtained knowledge is specific and does not generalize well
- being able to learn stochastic policies is beneficial because…
- policy could be easily guessed, think rock-paper-scissors
- deterministic policies could get stuck (20:00)
- compute the policy gradient, i.e. gradient of an objective function for the policy \(J(\theta)\)
- use Monte Carlo sampling for the gradinet, but a problem is that the gradient of an expectation cannot be computed. So it is converted (the score function trick / log likelihood trick) to an expectation of the gradient value.
- Policy gradient theorem: instanteneous reward (in the score function trick) can be replaced with long-term value (45:53)
- “copies / variants” of \(\pi^m\) enable simultaneous training on multiple simulations
- the “loss” is in quotes because it is intended to “trick” the DL program to compute the gradient, which then is the update rule that we want
- A2C: advantage actor-critic
- Trust region policy optimization
- careful with updates: bad policy leads to bad data. On the other hand in supervised learning, learning and data are independent.
- one solution is to regularize policy to not change too much, to prevent instability
- limit difference between subsequent policy, e.g. by using KL divergence.
- maximize \(J(\theta) - \nu KL(\pi_{old}|\pi_\theta)\)
for some small \(\nu\) (instead of just \(J(\theta)\))
- Often used in practice: trust region policy optimization (TRPO) and proximal policy optimization (PPO) both try to find solutions that are close to the current solution, because that’s safer and more stable.
- Gaussian policy is common in continuous action spaces.
- gradient of the log of the policy is
- \(\nabla log \pi_\theta(s,a) = \frac{a - \mu(s)}{\sigma^2} \nabla \mu(s)\) (if just the mean is parametrized and the variance is fixed)
- something perhaps even simpler: Cacla (continuous actor-critic learning automaton)
- ![/assets/2022/cacla.png]
- update the actor whenever the TD error is positive
Much more intuitive explanation of the same topic
<div class=“youtube-player” data-id=“zR11FLZ”></div>
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep Q-Learning with Model-based Acceleration,” Mar. 2016..
Model-based dexterous manipulation
V. Kumar, Y. Tassa, T. Erez, and E. Todorov, “Real-time behaviour synthesis for dynamic
hand-manipulation,” May 2014..
I. Mordatch, Z. Popović, and E. Todorov, “Contact-invariant optimization for hand manipulation,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on
Computer Animation, Goslar, DEU, Jul. 2012, pp. 137–144..
- referenced as examples of model-based manipulation by Rajeswaran2017
- Emo Todorov is involved in both of these (as with Rajeswaran2017)
MPC based on stochastic MDP modelled with DNN
A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” Sep. 2019..
read 06.06.2022, project website, github
- not sure what state / action space is used… it’s not written in the paper as far as I saw… secret sauce?
- being model-based is more sample efficient
- PDDM (planning with deep dynamics models)
- learn a deep dynamics model as stochastic MDP model (using a fixed covariance did not impact performance that much)
- learn model parameters to maximize log likelihood of the observations
- use “bootstrap ensembles” where \(i = 1, 2 \dots E\) models are prepared, which is a simple and inexpensive way to “capture epistemic uncertainty in network weights”.
- start each model with different initial random weights, and use different batch of data \(D_i\) at each training step.
- use short-horizon MPC (trajectory optimization) for control.
- use gradient-free method (therefore a lot of sampling is required?)
- Random Shooting and Iterative Random-Shooting with Refinement are simple enough (pure random, or random but refined based on best samples)
Model-free dexterous manipulation
H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning robot in-hand manipulation with tactile features,” Nov. 2015..
read 19.05.2022
- first to apply RL to in-hand manipulation
- state representation for each finger (6DoF):
- tactile sensor
- proximal joint angle
- distal joint angle
- action: velocity for 2 timesteps (2DoF, finger is underactuated)
- reward function: weighted sum of [sumsq(action)], [pressing force is close to target], [object position is close to target]
- math is way beyond what I can currently understand (“relative entropy policy search”)
- roll a bottle-like object across the table: fairly simple task
- learn in the real world
S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep Reinforcement Learning for Robotic Manipulation with
Asynchronous Off-Policy Updates,” Oct. 2016..
read 21.05.2022
- referenced in Rajeswaran2017
- the (probably famous) experiment by Google with multiple robots learning in parallel
- but I also remember the one where the robot picks up multiple objects, whereas this one it’s just a door opening task (and no vision is used)
- why did this method fall out of fashion? improvement of simulators? shift towards data-hungry methods that are just intractable for physical robots?
- learn policy to open door, from scratch with no priors, in 2.5 hours with 2 robots
- reward function is a bit “engineered” as it includes distance bet. hand and handle, distance bet. ideal door handle & hinge angle (doing this with sparse rewards only when it succeeds, is much more difficult)
- the learning curve reflects this, as there is a plateau when robot can reach the handle (but not open door)
- state and action representation varies slightly in each task but for the door task…
- joint angle, end-effector position and their time derivatives
- door angle and hingle angle
- door frame position
- actions are joint velocities
- DDPG and NAF (normalized advantage function) were compared in sim but only latter used for real robot
- NAF “restricts the class of Q-function to the expression below to enable closed-form updates, as in the discrete action case”
<div class=“youtube-player” data-id=“ZhsEKTo7V04”></div>
potential field-based obstacle avoidance
O. Khatib, “Real-Time Obstacle Avoidance for Manipulators and Mobile
Robots,” Int. J. Rob. Res., vol. 5, no. 1, pp. 90–98, Mar. 1986..
read: 10.04.2022
- somehow there are two versions of this paper, in 1985 and 1986
- basically a common sense by now, to use potential fields for obstacle avoidance
- much of paper is about manipulators rather than mobile robot navigation, even though the latter feels like a more obvious application to this method
- possibly replace planning methods like A*
- acknowledges that local minima is an issue
- describes potential field functions for various primitives
- Define potential energy towards goal and against obstacles (FIRAS, “Force Inducing an Artificial Repulsion from the Surface”, but in the French word order?)
\[F_{x_d} = -grad[U_{x_d}(x)] \\
F_{O} = -grad[U_{O}(x)]\]
\[U_{O}(x) =
\begin{cases}
\frac{1}{2}\mu(\frac{1}{\rho} - \frac{1}{\rho_0})^2 & \text{if} \; \rho \leq \rho_0 \\
0 & \text{otherwise}
\end{cases}\]
where \(\rho\) is the shortest distance to the obstacle, and the \(\rho_0\) is the limit of influence of the potential field.
RL-based walking with ANYmal
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in
the wild,” Science Robotics, vol. 7, no. 62, p. eabk2822, 2022.
proximal policy optimization
Visual servoing of soft robots
H. Wang, B. Yang, Y. Liu, W. Chen, X. Liang, and R. Pfeifer, “Visual Servoing of Soft Robot Manipulator in Constrained
Environments With an Adaptive Controller,” IEEE/ASME Trans. Mechatron., vol. 22, no. 1, pp. 41–50, Feb. 2017.
.
- collect the camera intrinsic and extrinsic parameters and the base frame of the robot into the matrix \(A \in \mathbb R^{2 \times 6}\) and vector \(b\), as
\[\dot y = \frac{1}{z(q)} A(y, q) \begin{bmatrix} v \\ w \end{bmatrix} \\
\dot z = b(q) \begin{bmatrix} v \\ w \end{bmatrix}\]
where \(y\) is image coords, \(v\) and \(w\) are end effector velocity (pos & rot).
- \(A\) and \(b\) are somehow parametrized by the vector \(a\) (not really clear how?), and an adaptive law is defined, and the convergence is proven through Lyapunov analysis
- camera is at robot tip, and tries to control the position of a feature point in the image (a dot)
- seems like a pretty straightfoward application of adaptive control + visual servoing to soft robots. wish there was a video…
Apple Machine Learning Reserach
read: 20.04.2022
not an academic paper but interesting look into how Apple implements ML into their products.
- to make the NN compact enough to run on a mobile device, “teacher-student” training method was used (cites this paper).
This approach provided us a mechanism to train a second thin-and-deep network (the “student”), in such a way that it matched very closely the outputs of the big complex network (the “teacher”) that we had trained as described previously.
-
Panoptic segmentation unifies scene-level and subject-level understanding by predicting two attributes for each pixel: a categorical label and a subject label.
- distinguish between sky, person, hair, skin, teeth, and glasses
- “Detection Transformer” (DETR) architecture used as baseline (but not efficient when extending to panoptic segmentation)- thus, “HyperDETR” is proposed, which integrates panoptic segmentation into the architecture.

Panoptic segmentation
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic Segmentation,” Jan. 2018.
- what Apple uses for image processing
Semantic segmentation
my Jupyter notebook
tbh so much simpler than I imagined- the learning process is exactly the same as the standard tutorial for ML, and the very same network (U-Net in this case) could also possibly be used for classification (as long as the output dimensions are correct). It’s the network structure that makes it better suited for segmentation, but in theory, segmentation could also be done with a conventional MLP. It sounds like dataset collection & annotation is the hard part…
Foundation Models
R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Aug. 2021..
read: 20220422
- this is a super long (>200p) review so I mostly only read the first part and the robotics part
- BERT, DALL-E, GPT-3 are all examples of foundational models
- transformer architecture
demonstrate the importance of capturinglong-range dependenciesandpairwise or higher-order interactions between elements.
- in robotics, “task specification” and “task learning” are important opportunities for foundational models
- how to specify goal with clarity so that robots can “understand”- convert into a “reward function”
- how robots can achieve that goal
- not sure what type of data is needed
- different problems have different input-output data
Soft Robotics for Delicate and Dexterous Manipulation (ICRA2021 talk)
Prof. Robert J. Wood’s talk at ICRA
- an application where soft grippers make a lot of sense; exploration of marine life
- apply soft material to not just simple gripping, but dexterous manipulation as well
- add more DoFs in a serial vs parallel manner
- computational design with “SoMo” (based on PyBullet) to find most effective config (31:00)
- use “palm” to assist manipulation (IHM = in-hand manipulation) (33:39)
- tune friction of fingertip using bubble-wrap like material
C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling Palm-Object Interactions Via Friction for Enhanced
In-Hand Manipulation Controlling Palm-Object Interactions
Via Friction for Enhanced In-Hand Manipulation,” IEEE Robotics and Automation Letters, vol. 7, pp. 2258–2265..
C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling palm-object interactions via friction for enhanced
in-hand manipulation,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 2258–2265, Apr. 2022..
- I feel this would be a good paper to see how to approach research of soft gripper-based manipulation
M. A. Graule, T. P. McCarthy, C. B. Teeple, J. Werfel, and R. J. Wood, “SoMoGym: A toolkit for developing and evaluating controllers
and reinforcement learning algorithms for soft robots,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 4071–4078, Apr. 2022..
https://github.com/GrauleM/somogym
- create custom simulation environments for soft robots to be used for learning?
fiddle around with ML & statistics libraries, sample code, etc.
stics libraries, sample code, etc.
Dense Visual Representations
P. O. Pinheiro, A. Almahairi, R. Y. Benmalek, F. Golemo, and A. Courville, “Unsupervised Learning of Dense Visual Representations,” Nov. 2020..
presentation