Yasunori Toshimitsu  利光泰徳
my paper pile
The contents of this page are meant to be memos for myself, but I will post it publicly here with the hope that someone else may find it useful, and more importantly, to motivate myself using peer pressure to learn more.
DRLbased manipulation in MuJoCo
A. Rajeswaran et al., “Learning Complex Dexterous Manipulation with Deep
Reinforcement Learning and Demonstrations,” Sep. 2017.
.
read: 08.04.2022  11.04.2022
code
quick points
 by using human demonstrations, captured through a glove, DRL can learn more efficiently
 (one of?) the first to apply modelfree learning to anthropomorphic hands
 IO
 sensed data: hand joint angles, object and target pose
 actions: desired hand joint angles
 for inhand manipulation, collecting human demo data is difficult, so used “computational expert trained using RL on a well shaped reward for many iterations”

small noise added to demo data, to combat distribution drift
 want to get stochastic policy \(\pi_\theta: S \times A \rightarrow \mathbb R_+\)

The “Natural Policy Gradient” algorithm used to optimize the parameters \(\theta\) of the policy (the what???)
 behavior cloning tries to maximize the likelihood for the policy to replicate the demonstrated behavior;
\(\Sigma_{(s, a) \in \rho_D} ln(\pi_\theta (as))\)
 but the cloned policies alone are not successful in completing a task.
 The following gradient augmented with the demo data is used, where \(w(s,a)\) is the weighting function to adjust how much effect demo data has (the “heuristic” weighting function is described in paper). DAPG: Demo Augmented Policy Gradient
\[g_{aug} = \Sigma_{(s, a) \in \rho_\pi} \nabla_\theta ln(\pi_\theta (as)) A^\pi(s, a) + \\
\Sigma_{(s, a) \in \rho_D} \nabla_\theta ln(\pi_\theta (as)) w(s,a)\]
 what is the reward? sparse or shaped? > DAPG works with sparse task completion reward
one of the resampling (here probably means to validate models by using random subsets) methods, which use random sampling w/ replacement. Bootstrapping estimates the properties of an estimator (e.g. its variance) by measuring it by sampling from an approximating distribution. (the height measurement example in Wikipedia is easy to understand)
TD learning, TD error?
temporal difference learning bootstraps from the current estimate of the value function.
TD(0) method is a special simple case of TD methods, estimating state value function (the expected return of starting from state \(s\) and following policy \(\pi\) ) of a finite state MDP.
The HJB equation, which is the NS condition for optimal control
\[V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})s_{0}=s\}\]
The update rule is:
\[V(s)\leftarrow V(s)+\alpha \overbrace{(\underbrace {r+\gamma V(s')} _{\text{TD target}}V(s))}^{TD error}\]
TD error is the difference bet. current estimate for \(V\) and the actual reward gained (plus discounted estimate of next state). But more effective to estimate the actionvalue pair?
SARSA
onpolicy TD control method, where the estimate of \(q_\pi (s_t, a_t)\) is dependent on the current policy \(\pi\) and assumed that this policy will be used for all of rest of the agent’s episode. A problem is that it is sensitive to current policy that the agent is following.
offpolicy vs onpolicy?
 onpolicy: assumes the current policy is used by the agent for the whole learning process. SARSA is onpolicy. “more stable, and scale well to high dimensional spaces”
 offpolicy: uses some other policy (like greedy search) during learning. Qlearning is offpolicy. “more sample efficient when successful, but more unstable”
distribution drift?
DDPG: Deep Deterministic Policy Gradient (DDPG) & Normalized Advantage Function (NAF)
T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” Sep. 2015..
read 2022042320220427
https://keras.io/examples/rl/ddpg_pendulum/
2017 presentation
watched 20220427
 modelfree offpolicy algorithm for learning continuous actions, combining ideas from DPG and DQN.
 in Rajeswaran2017, mentioned “sample efficient but very sensitive to hyperparameters and random seeds”
 cited >8000 !!
from paper…
 modelfree, offpolicy, actorcritic
 key improvements by DQN which enabled the learning of large nonlinear value functions
 network training offpolicy with samples from a replay buffer
 network is trained with a “target Q network”
 about DPG

actor function $$\mu(s 
\theta^\mu)$$ deterministically maps state to action 
 critic \(Q(s,a)\) is learned using Bellman equation
 actor function updated by gradient of expected return from start \(J\) wrt to \(\theta^\mu\) (actor parameters)
\(\nabla_{\theta^\mu}J \approx \mathbb E_{s_t \sim \rho^\beta}[\nabla_{\theta^\mu}Q(s,a\theta^Q)_{s=s_t, a=\mu(s_t\theta^\mu)}]\\
= \mathbb E_{s_t \sim \rho^\beta}[\nabla_a Q(s,a\theta^Q)_{s=s_t, a=\mu(s_t)} \nabla_{\theta_\mu}\mu(s\theta^\mu)_{s=s_t}]\)
 this has been proven to be the gradient of the policy’s performance, i.e. the policy gradient.
 what is \(\rho^\beta\)? the replay buffer?
 is computing the expected value for the gradient wrt the actor parameters of the current Q function, starting from \(s_t\) and following the actor function.
 what is new with DDPG (wrt DPG
 modifications inspired by DQN, allwing it to use NN to learn in large state and action spaces online (so… it has same concepts as DPG?)
 add replay buffer like in DQN, which stores the tuple \((s_t, a_t, r_t, s_t+1)\). Seems like a fixedsize LIFO queue?
 add “batch normalization”
from video…
 uses offpolicy critic \(Q_w\) which learns from replay buffer
 for onpolicy learning, singe objective: relentlessly improve policy
 for offpolicy, two objectives: improve the offpolicy critic which looks at the replay buffer, and to maximize the value of the critic
 DQN is suitable when action space is discrete / small (difficult to run the “max” step)
 actorcritic structure
 actor sends actions to environemnt
 critic outputs a single scalar value indicating expected cumulative reward
DeepMind Lecture: Reinforcement Learning 6: Policy Gradients and Actor Critics
watched 2022042920220510
Never solve a more general problem as an intermediate step  Vladimir Vapnik, 1998
> why not learn a policy directly (rather than learning the Q value)
 approximating parametric value functions vs parametric policy directly
 sometimes policies are simple, while models and values are complex
 policybased is subject to local optima, and obtained knowledge is specific and does not generalize well
 being able to learn stochastic policies is beneficial because…
 policy could be easily guessed, think rockpaperscissors
 deterministic policies could get stuck (20:00)
 compute the policy gradient, i.e. gradient of an objective function for the policy \(J(\theta)\)
 use Monte Carlo sampling for the gradinet, but a problem is that the gradient of an expectation cannot be computed. So it is converted (the score function trick / log likelihood trick) to an expectation of the gradient value.
 Policy gradient theorem: instanteneous reward (in the score function trick) can be replaced with longterm value (45:53)

 “copies / variants” of \(\pi^m\) enable simultaneous training on multiple simulations
 the “loss” is in quotes because it is intended to “trick” the DL program to compute the gradient, which then is the update rule that we want
 A2C: advantage actorcritic
 Trust region policy optimization
 careful with updates: bad policy leads to bad data. On the other hand in supervised learning, learning and data are independent.
 one solution is to regularize policy to not change too much, to prevent instability
 limit difference between subsequent policy, e.g. by using KL divergence.
 maximize \(J(\theta)  \nu KL(\pi_{old}\pi_\theta)\)
for some small \(\nu\) (instead of just \(J(\theta)\))
 Often used in practice: trust region policy optimization (TRPO) and proximal policy optimization (PPO) both try to find solutions that are close to the current solution, because that’s safer and more stable.
 Gaussian policy is common in continuous action spaces.
 gradient of the log of the policy is
 \(\nabla log \pi_\theta(s,a) = \frac{a  \mu(s)}{\sigma^2} \nabla \mu(s)\) (if just the mean is parametrized and the variance is fixed)
 something perhaps even simpler: Cacla (continuous actorcritic learning automaton)
 ![/assets/2022/cacla.png]
 update the actor whenever the TD error is positive
Much more intuitive explanation of the same topic
<div class=“youtubeplayer” dataid=“zR11FLZ”></div>
S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep QLearning with Modelbased Acceleration,” Mar. 2016..
Modelbased dexterous manipulation
V. Kumar, Y. Tassa, T. Erez, and E. Todorov, “Realtime behaviour synthesis for dynamic
handmanipulation,” May 2014..
I. Mordatch, Z. Popović, and E. Todorov, “Contactinvariant optimization for hand manipulation,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on
Computer Animation, Goslar, DEU, Jul. 2012, pp. 137–144..
 referenced as examples of modelbased manipulation by Rajeswaran2017
 Emo Todorov is involved in both of these (as with Rajeswaran2017)
MPC based on stochastic MDP modelled with DNN
A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” Sep. 2019..
Modelfree dexterous manipulation
H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning robot inhand manipulation with tactile features,” Nov. 2015..
read 19.05.2022
 first to apply RL to inhand manipulation
 state representation for each finger (6DoF):
 tactile sensor
 proximal joint angle
 distal joint angle
 action: velocity for 2 timesteps (2DoF, finger is underactuated)
 reward function: weighted sum of [sumsq(action)], [pressing force is close to target], [object position is close to target]
 math is way beyond what I can currently understand (“relative entropy policy search”)
 roll a bottlelike object across the table: fairly simple task
 learn in the real world
S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep Reinforcement Learning for Robotic Manipulation with
Asynchronous OffPolicy Updates,” Oct. 2016..
read 21.05.2022
 referenced in Rajeswaran2017
 the (probably famous) experiment by Google with multiple robots learning in parallel
 but I also remember the one where the robot picks up multiple objects, whereas this one it’s just a door opening task (and no vision is used)
 why did this method fall out of fashion? improvement of simulators? shift towards datahungry methods that are just intractable for physical robots?
 learn policy to open door, from scratch with no priors, in 2.5 hours with 2 robots
 reward function is a bit “engineered” as it includes distance bet. hand and handle, distance bet. ideal door handle & hinge angle (doing this with sparse rewards only when it succeeds, is much more difficult)
 the learning curve reflects this, as there is a plateau when robot can reach the handle (but not open door)
 state and action representation varies slightly in each task but for the door task…
 joint angle, endeffector position and their time derivatives
 door angle and hingle angle
 door frame position
 actions are joint velocities
 DDPG and NAF (normalized advantage function) were compared in sim but only latter used for real robot
 NAF “restricts the class of Qfunction to the expression below to enable closedform updates, as in the discrete action case”
<div class=“youtubeplayer” dataid=“ZhsEKTo7V04”></div>
potential fieldbased obstacle avoidance
O. Khatib, “RealTime Obstacle Avoidance for Manipulators and Mobile
Robots,” Int. J. Rob. Res., vol. 5, no. 1, pp. 90–98, Mar. 1986..
read: 10.04.2022
 somehow there are two versions of this paper, in 1985 and 1986
 basically a common sense by now, to use potential fields for obstacle avoidance
 much of paper is about manipulators rather than mobile robot navigation, even though the latter feels like a more obvious application to this method
 possibly replace planning methods like A*
 acknowledges that local minima is an issue
 describes potential field functions for various primitives
 Define potential energy towards goal and against obstacles (FIRAS, “Force Inducing an Artificial Repulsion from the Surface”, but in the French word order?)
\[F_{x_d} = grad[U_{x_d}(x)] \\
F_{O} = grad[U_{O}(x)]\]
\[U_{O}(x) =
\begin{cases}
\frac{1}{2}\mu(\frac{1}{\rho}  \frac{1}{\rho_0})^2 & \text{if} \; \rho \leq \rho_0 \\
0 & \text{otherwise}
\end{cases}\]
where \(\rho\) is the shortest distance to the obstacle, and the \(\rho_0\) is the limit of influence of the potential field.
RLbased walking with ANYmal
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in
the wild,” Science Robotics, vol. 7, no. 62, p. eabk2822, 2022.
proximal policy optimization
Visual servoing of soft robots
H. Wang, B. Yang, Y. Liu, W. Chen, X. Liang, and R. Pfeifer, “Visual Servoing of Soft Robot Manipulator in Constrained
Environments With an Adaptive Controller,” IEEE/ASME Trans. Mechatron., vol. 22, no. 1, pp. 41–50, Feb. 2017.
.
 collect the camera intrinsic and extrinsic parameters and the base frame of the robot into the matrix \(A \in \mathbb R^{2 \times 6}\) and vector \(b\), as
\[\dot y = \frac{1}{z(q)} A(y, q) \begin{bmatrix} v \\ w \end{bmatrix} \\
\dot z = b(q) \begin{bmatrix} v \\ w \end{bmatrix}\]
where \(y\) is image coords, \(v\) and \(w\) are end effector velocity (pos & rot).
 \(A\) and \(b\) are somehow parametrized by the vector \(a\) (not really clear how?), and an adaptive law is defined, and the convergence is proven through Lyapunov analysis
 camera is at robot tip, and tries to control the position of a feature point in the image (a dot)
 seems like a pretty straightfoward application of adaptive control + visual servoing to soft robots. wish there was a video…
Apple Machine Learning Reserach
read: 20.04.2022
not an academic paper but interesting look into how Apple implements ML into their products.
 to make the NN compact enough to run on a mobile device, “teacherstudent” training method was used (cites this paper).
This approach provided us a mechanism to train a second thinanddeep network (the “student”), in such a way that it matched very closely the outputs of the big complex network (the “teacher”) that we had trained as described previously.

Panoptic segmentation unifies scenelevel and subjectlevel understanding by predicting two attributes for each pixel: a categorical label and a subject label.
 distinguish between sky, person, hair, skin, teeth, and glasses
 “Detection Transformer” (DETR) architecture used as baseline (but not efficient when extending to panoptic segmentation) thus, “HyperDETR” is proposed, which integrates panoptic segmentation into the architecture.
Panoptic segmentation
A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic Segmentation,” Jan. 2018.
 what Apple uses for image processing
Semantic segmentation
my Jupyter notebook
tbh so much simpler than I imagined the learning process is exactly the same as the standard tutorial for ML, and the very same network (UNet in this case) could also possibly be used for classification (as long as the output dimensions are correct). It’s the network structure that makes it better suited for segmentation, but in theory, segmentation could also be done with a conventional MLP. It sounds like dataset collection & annotation is the hard part…
Foundation Models
R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Aug. 2021..
read: 20220422
 this is a super long (>200p) review so I mostly only read the first part and the robotics part
 BERT, DALLE, GPT3 are all examples of foundational models
 transformer architecture
demonstrate the importance of capturinglongrange dependenciesandpairwise or higherorder interactions between elements.
 in robotics, “task specification” and “task learning” are important opportunities for foundational models
 how to specify goal with clarity so that robots can “understand” convert into a “reward function”
 how robots can achieve that goal
 not sure what type of data is needed
 different problems have different inputoutput data
Soft Robotics for Delicate and Dexterous Manipulation (ICRA2021 talk)
Prof. Robert J. Wood’s talk at ICRA
 an application where soft grippers make a lot of sense; exploration of marine life
 apply soft material to not just simple gripping, but dexterous manipulation as well
 add more DoFs in a serial vs parallel manner
 computational design with “SoMo” (based on PyBullet) to find most effective config (31:00)
 use “palm” to assist manipulation (IHM = inhand manipulation) (33:39)
 tune friction of fingertip using bubblewrap like material
C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling PalmObject Interactions Via Friction for Enhanced
InHand Manipulation Controlling PalmObject Interactions
Via Friction for Enhanced InHand Manipulation,” IEEE Robotics and Automation Letters, vol. 7, pp. 2258–2265..
C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling palmobject interactions via friction for enhanced
inhand manipulation,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 2258–2265, Apr. 2022..
 I feel this would be a good paper to see how to approach research of soft gripperbased manipulation
M. A. Graule, T. P. McCarthy, C. B. Teeple, J. Werfel, and R. J. Wood, “SoMoGym: A toolkit for developing and evaluating controllers
and reinforcement learning algorithms for soft robots,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 4071–4078, Apr. 2022..
https://github.com/GrauleM/somogym
 create custom simulation environments for soft robots to be used for learning?
fiddle around with ML & statistics libraries, sample code, etc.
stics libraries, sample code, etc.
Dense Visual Representations
P. O. Pinheiro, A. Almahairi, R. Y. Benmalek, F. Golemo, and A. Courville, “Unsupervised Learning of Dense Visual Representations,” Nov. 2020..
presentation