Yasunori Toshimitsu | 利光泰徳

my paper pile

The contents of this page are meant to be memos for myself, but I will post it publicly here with the hope that someone else may find it useful, and more importantly, to motivate myself using peer pressure to learn more.

DRL-based manipulation in MuJoCo

A. Rajeswaran et al., “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations,” Sep. 2017. .

read: 08.04.2022 - 11.04.2022


quick points

\[g_{aug} = \Sigma_{(s, a) \in \rho_\pi} \nabla_\theta ln(\pi_\theta (a|s)) A^\pi(s, a) + \\ \Sigma_{(s, a) \in \rho_D} \nabla_\theta ln(\pi_\theta (a|s)) w(s,a)\]


one of the resampling (here probably means to validate models by using random subsets) methods, which use random sampling w/ replacement. Bootstrapping estimates the properties of an estimator (e.g. its variance) by measuring it by sampling from an approximating distribution. (the height measurement example in Wikipedia is easy to understand)

TD learning, TD error?

temporal difference learning bootstraps from the current estimate of the value function.

TD(0) method is a special simple case of TD methods, estimating state value function (the expected return of starting from state \(s\) and following policy \(\pi\) ) of a finite state MDP.

The HJB equation, which is the N-S condition for optimal control

\[V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})|s_{0}=s\}\]

The update rule is:

\[V(s)\leftarrow V(s)+\alpha \overbrace{(\underbrace {r+\gamma V(s')} _{\text{TD target}}-V(s))}^{TD error}\]

TD error is the difference bet. current estimate for \(V\) and the actual reward gained (plus discounted estimate of next state). But more effective to estimate the action-value pair?


on-policy TD control method, where the estimate of \(q_\pi (s_t, a_t)\) is dependent on the current policy \(\pi\) and assumed that this policy will be used for all of rest of the agent’s episode. A problem is that it is sensitive to current policy that the agent is following.

off-policy vs on-policy?

distribution drift?

DDPG: Deep Deterministic Policy Gradient (DDPG) & Normalized Advantage Function (NAF)

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” Sep. 2015..

read 20220423-20220427


2017 presentation

watched 20220427

from paper…

\(\nabla_{\theta^\mu}J \approx \mathbb E_{s_t \sim \rho^\beta}[\nabla_{\theta^\mu}Q(s,a|\theta^Q)|_{s=s_t, a=\mu(s_t|\theta^\mu)}]\\ = \mathbb E_{s_t \sim \rho^\beta}[\nabla_a Q(s,a|\theta^Q)|_{s=s_t, a=\mu(s_t)} \nabla_{\theta_\mu}\mu(s|\theta^\mu)|_{s=s_t}]\)

from video…

DeepMind Lecture: Reinforcement Learning 6: Policy Gradients and Actor Critics

watched 20220429-20220510

Never solve a more general problem as an intermediate step - Vladimir Vapnik, 1998

-> why not learn a policy directly (rather than learning the Q value)

Much more intuitive explanation of the same topic <div class=“youtube-player” data-id=“zR11FLZ”></div>

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep Q-Learning with Model-based Acceleration,” Mar. 2016..

Model-based dexterous manipulation

V. Kumar, Y. Tassa, T. Erez, and E. Todorov, “Real-time behaviour synthesis for dynamic hand-manipulation,” May 2014..

I. Mordatch, Z. Popović, and E. Todorov, “Contact-invariant optimization for hand manipulation,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Goslar, DEU, Jul. 2012, pp. 137–144..

MPC based on stochastic MDP modelled with DNN

A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” Sep. 2019..

read 06.06.2022, project website, github

Model-free dexterous manipulation

H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning robot in-hand manipulation with tactile features,” Nov. 2015.. read 19.05.2022

S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates,” Oct. 2016.. read 21.05.2022

<div class=“youtube-player” data-id=“ZhsEKTo7V04”></div>

potential field-based obstacle avoidance

O. Khatib, “Real-Time Obstacle Avoidance for Manipulators and Mobile Robots,” Int. J. Rob. Res., vol. 5, no. 1, pp. 90–98, Mar. 1986..

read: 10.04.2022

\[F_{x_d} = -grad[U_{x_d}(x)] \\ F_{O} = -grad[U_{O}(x)]\] \[U_{O}(x) = \begin{cases} \frac{1}{2}\mu(\frac{1}{\rho} - \frac{1}{\rho_0})^2 & \text{if} \; \rho \leq \rho_0 \\ 0 & \text{otherwise} \end{cases}\]

where \(\rho\) is the shortest distance to the obstacle, and the \(\rho_0\) is the limit of influence of the potential field.

RL-based walking with ANYmal

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,” Science Robotics, vol. 7, no. 62, p. eabk2822, 2022.

proximal policy optimization

Visual servoing of soft robots

H. Wang, B. Yang, Y. Liu, W. Chen, X. Liang, and R. Pfeifer, “Visual Servoing of Soft Robot Manipulator in Constrained Environments With an Adaptive Controller,” IEEE/ASME Trans. Mechatron., vol. 22, no. 1, pp. 41–50, Feb. 2017. .

\[\dot y = \frac{1}{z(q)} A(y, q) \begin{bmatrix} v \\ w \end{bmatrix} \\ \dot z = b(q) \begin{bmatrix} v \\ w \end{bmatrix}\]

where \(y\) is image coords, \(v\) and \(w\) are end effector velocity (pos & rot).

Apple Machine Learning Reserach

read: 20.04.2022

not an academic paper but interesting look into how Apple implements ML into their products.

An On-device Deep Neural Network for Face Detection

On-device Panoptic Segmentation for Camera Using Transformers

Panoptic segmentation

A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic Segmentation,” Jan. 2018.

Semantic segmentation

my Jupyter notebook

tbh so much simpler than I imagined- the learning process is exactly the same as the standard tutorial for ML, and the very same network (U-Net in this case) could also possibly be used for classification (as long as the output dimensions are correct). It’s the network structure that makes it better suited for segmentation, but in theory, segmentation could also be done with a conventional MLP. It sounds like dataset collection & annotation is the hard part…

Foundation Models

R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Aug. 2021..

read: 20220422

Soft Robotics for Delicate and Dexterous Manipulation (ICRA2021 talk)

Prof. Robert J. Wood’s talk at ICRA

C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling Palm-Object Interactions Via Friction for Enhanced In-Hand Manipulation Controlling Palm-Object Interactions Via Friction for Enhanced In-Hand Manipulation,” IEEE Robotics and Automation Letters, vol. 7, pp. 2258–2265..

C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling palm-object interactions via friction for enhanced in-hand manipulation,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 2258–2265, Apr. 2022..

M. A. Graule, T. P. McCarthy, C. B. Teeple, J. Werfel, and R. J. Wood, “SoMoGym: A toolkit for developing and evaluating controllers and reinforcement learning algorithms for soft robots,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 4071–4078, Apr. 2022..


ML playground

fiddle around with ML & statistics libraries, sample code, etc. stics libraries, sample code, etc.

Dense Visual Representations

P. O. Pinheiro, A. Almahairi, R. Y. Benmalek, F. Golemo, and A. Courville, “Unsupervised Learning of Dense Visual Representations,” Nov. 2020..