Yasunori Toshimitsu | 利光泰徳

my paper pile

The contents of this page are meant to be memos for myself, but I will post it publicly here with the hope that someone else may find it useful, and more importantly, to motivate myself using peer pressure to learn more.

MAML: standard method for meta-learning

C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” Mar. 2017. .

RT-1: Robotic Transformer

A. Brohan et al., “RT-1: Robotics Transformer for Real-World Control at Scale,” 2022. .

read: 19.12.2022


WTF is Attention?

Karl Sims - Evolving Virtual Creatures

“OOPAart of RL and genetic algorithms”

K. Sims, “Evolving virtual creatures,” in Proceedings of the 21st annual conference on Computer graphics and interactive techniques, New York, NY, USA, Jul. 1994, pp. 15–22. .

read: 22.11.2022


A. Zhao et al., “RoboGrammar,” ACM Trans. Graph., vol. 39, no. 6, pp. 1–16, Dec. 2020. .

read: 21-22.11.2022

RoboGrammar code

Gripper optimization based on diff’able sim

J. Xu et al., “An End-to-End Differentiable Framework for Contact-Aware Robot Design,” Jul. 2021. .

read: 21.11.2022

Graph Neural Networks

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 61–80, Jan. 2009. .

Hand and Object pose reconstruction from image

Z. Cao, I. Radosavovic, A. Kanazawa, and J. Malik, “Reconstructing Hand-Object Interactions in the Wild,” Dec. 2020. .

read: 11.2022

Hand evolution

Completely model-based optimization of gripper structure

C. Hazard, N. Pollard, and S. Coros, “Automated design of manipulators for in-hand tasks,” Nov. 2018. .

read: 24.11.2022

multi-step model-based optimization procedure

  1. floating contact optimization: cost functions that make sure the object is moved towards the desired orientation. output is the floating contact point and forces.
  2. Mechanism synthesis continuous optimization: optimize the continuous morphological parameters of the hand so that the contact points are achieved and there is no collision. Not sure what method is used for the optimization. (probably numerical optimization)
  3. Mechanism synthesis discrete optimization: change the joint structure.

Joint optimization of design and controller using DRL

C. Schaff, D. Yunis, A. Chakrabarti, and M. R. Walter, “Jointly Learning to Construct and Control Agents using Deep Reinforcement Learning,” https://openreview.net › forumhttps://openreview.net › forum, 2018. .

read: 29.11.2022

DRL-based manipulation in MuJoCo

A. Rajeswaran et al., “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations,” Sep. 2017. .

read: 08.04.2022 - 11.04.2022


quick points

\[g_{aug} = \Sigma_{(s, a) \in \rho_\pi} \nabla_\theta ln(\pi_\theta (a|s)) A^\pi(s, a) + \\ \Sigma_{(s, a) \in \rho_D} \nabla_\theta ln(\pi_\theta (a|s)) w(s,a)\]


one of the resampling (here probably means to validate models by using random subsets) methods, which use random sampling w/ replacement. Bootstrapping estimates the properties of an estimator (e.g. its variance) by measuring it by sampling from an approximating distribution. (the height measurement example in Wikipedia is easy to understand)

TD learning, TD error?

temporal difference learning bootstraps from the current estimate of the value function.

TD(0) method is a special simple case of TD methods, estimating state value function (the expected return of starting from state \(s\) and following policy \(\pi\) ) of a finite state MDP.

The HJB equation, which is the N-S condition for optimal control

\[V^{\pi }(s)=E_{\pi }\{r_{0}+\gamma V^{\pi }(s_{1})|s_{0}=s\}\]

The update rule is:

\[V(s)\leftarrow V(s)+\alpha \overbrace{(\underbrace {r+\gamma V(s')} _{\text{TD target}}-V(s))}^{TD error}\]

TD error is the difference bet. current estimate for \(V\) and the actual reward gained (plus discounted estimate of next state). But more effective to estimate the action-value pair?


on-policy TD control method, where the estimate of \(q_\pi (s_t, a_t)\) is dependent on the current policy \(\pi\) and assumed that this policy will be used for all of rest of the agent’s episode. A problem is that it is sensitive to current policy that the agent is following.

off-policy vs on-policy?

distribution drift?

DDPG: Deep Deterministic Policy Gradient (DDPG) & Normalized Advantage Function (NAF)

T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” Sep. 2015..

read 20220423-20220427


2017 presentation

watched 20220427

from paper…

\(\nabla_{\theta^\mu}J \approx \mathbb E_{s_t \sim \rho^\beta}[\nabla_{\theta^\mu}Q(s,a|\theta^Q)|_{s=s_t, a=\mu(s_t|\theta^\mu)}]\\ = \mathbb E_{s_t \sim \rho^\beta}[\nabla_a Q(s,a|\theta^Q)|_{s=s_t, a=\mu(s_t)} \nabla_{\theta_\mu}\mu(s|\theta^\mu)|_{s=s_t}]\)

from video…

DeepMind Lecture: Reinforcement Learning 6: Policy Gradients and Actor Critics

watched 20220429-20220510

Never solve a more general problem as an intermediate step - Vladimir Vapnik, 1998

-> why not learn a policy directly (rather than learning the Q value)

Much more intuitive explanation of the same topic <div class=“youtube-player” data-id=“zR11FLZ”></div>

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous Deep Q-Learning with Model-based Acceleration,” Mar. 2016..

Model-based dexterous manipulation

V. Kumar, Y. Tassa, T. Erez, and E. Todorov, “Real-time behaviour synthesis for dynamic hand-manipulation,” May 2014..

I. Mordatch, Z. Popović, and E. Todorov, “Contact-invariant optimization for hand manipulation,” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Goslar, DEU, Jul. 2012, pp. 137–144..

MPC based on stochastic MDP modelled with DNN

A. Nagabandi, K. Konoglie, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” Sep. 2019..

Model-free dexterous manipulation

H. van Hoof, T. Hermans, G. Neumann, and J. Peters, “Learning robot in-hand manipulation with tactile features,” Nov. 2015.. read 19.05.2022

S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates,” Oct. 2016.. read 21.05.2022

<div class=“youtube-player” data-id=“ZhsEKTo7V04”></div>

potential field-based obstacle avoidance

O. Khatib, “Real-Time Obstacle Avoidance for Manipulators and Mobile Robots,” Int. J. Rob. Res., vol. 5, no. 1, pp. 90–98, Mar. 1986..

read: 10.04.2022

\[F_{x_d} = -grad[U_{x_d}(x)] \\ F_{O} = -grad[U_{O}(x)]\] \[U_{O}(x) = \begin{cases} \frac{1}{2}\mu(\frac{1}{\rho} - \frac{1}{\rho_0})^2 & \text{if} \; \rho \leq \rho_0 \\ 0 & \text{otherwise} \end{cases}\]

where \(\rho\) is the shortest distance to the obstacle, and the \(\rho_0\) is the limit of influence of the potential field.

RL-based walking with ANYmal

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,” Science Robotics, vol. 7, no. 62, p. eabk2822, 2022.

proximal policy optimization

Visual servoing of soft robots

H. Wang, B. Yang, Y. Liu, W. Chen, X. Liang, and R. Pfeifer, “Visual Servoing of Soft Robot Manipulator in Constrained Environments With an Adaptive Controller,” IEEE/ASME Trans. Mechatron., vol. 22, no. 1, pp. 41–50, Feb. 2017. .

\[\dot y = \frac{1}{z(q)} A(y, q) \begin{bmatrix} v \\ w \end{bmatrix} \\ \dot z = b(q) \begin{bmatrix} v \\ w \end{bmatrix}\]

where \(y\) is image coords, \(v\) and \(w\) are end effector velocity (pos & rot).

Apple Machine Learning Reserach

read: 20.04.2022

not an academic paper but interesting look into how Apple implements ML into their products.

An On-device Deep Neural Network for Face Detection

On-device Panoptic Segmentation for Camera Using Transformers

Panoptic segmentation

A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic Segmentation,” Jan. 2018.

Semantic segmentation

my Jupyter notebook

tbh so much simpler than I imagined- the learning process is exactly the same as the standard tutorial for ML, and the very same network (U-Net in this case) could also possibly be used for classification (as long as the output dimensions are correct). It’s the network structure that makes it better suited for segmentation, but in theory, segmentation could also be done with a conventional MLP. It sounds like dataset collection & annotation is the hard part…

Foundation Models

R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” Aug. 2021..

read: 20220422

Soft Robotics for Delicate and Dexterous Manipulation (ICRA2021 talk)

Prof. Robert J. Wood’s talk at ICRA

C. B. Teeple, B. Aktas, M. C. Yuen, G. R. Kim, R. D. Howe, and R. J. Wood, “Controlling Palm-Object Interactions Via Friction for Enhanced In-Hand Manipulation Controlling Palm-Object Interactions Via Friction for Enhanced In-Hand Manipulation,” IEEE Robotics and Automation Letters, vol. 7, pp. 2258–2265..

(missing reference).

M. A. Graule, T. P. McCarthy, C. B. Teeple, J. Werfel, and R. J. Wood, “SoMoGym: A toolkit for developing and evaluating controllers and reinforcement learning algorithms for soft robots,” IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 4071–4078, Apr. 2022..


ML playground

fiddle around with ML & statistics libraries, sample code, etc. stics libraries, sample code, etc.

Dense Visual Representations

(missing reference).