Humanoid Dream Touch

Humanoid DreamTouch

Learning Versatile Humanoid Manipulation with Touch Dreaming

Humnaoid DreamTouch provides a data collection and training framework to learn contact-aware policies for versatile humanoid manipulation.

Abstract.

Humanoid robots promise general-purpose assistance, yet real-world humanoid loco-manipulation remains challenging because it requires whole-body stability, dexterous hands, and contact-aware perception under frequent contact changes. In this work, we study dexterous, contact-rich humanoid loco-manipulation. We first develop an RL-based whole-body controller that provides stable lower-body and torso execution during complex manipulation. Built on this controller, we develop a whole-body humanoid data collection system that combines VR-based teleoperation with a unified reference frame for human-to-humanoid motion mapping, enabling efficient collection of real-world demonstrations. We then propose Humanoid Transformer with Touch Dreaming (HTD), a multimodal encoder-decoder Transformer that treats touch as a first-class modality alongside multi-view vision and proprioception. HTD is trained with a single-stage behavioral cloning objective augmented with touch dreaming: in addition to predicting action chunks, the policy predicts future hand finger forces and future tactile latents, encouraging the shared Transformer trunk to learn contact-aware representations for dexterous interaction. Across five contact-rich tasks, Insert-T, Book Organization, Towel Folding, Cat Litter Scooping, and Tea Serving, our system enables versatile, high-dexterity humanoid manipulation in the real world.

Autonomous Policies

Policy rollouts for insert-t with a clearance of 3.5mm. All videos are played at original speed (1X) unless otherwise noted.

Policy rollouts for tea serving. All videos are played at original speed (1X).

Touch Dreaming Visualization

Explore touch dreaming predictions interactively. The left panel shows the robot's head camera view. The right panel visualizes predicted vs. ground-truth touch signals—switch between Force, Latent Tactile, and Raw Tactile modes. For the latent tactile heatmaps, each latent dimension is independently normalized over the episode, with a minimum-range threshold derived from the most active dimension across all fingers to distinguish active from inactive latent contact regions. Note that this per-dimension normalization amplifies subtle changes and prediction errors in the latent space for better visibility.

Episode

Loading video…

Head Camera (Right Eye)

0:00 / 0:00

Whole-Body Controller

Whole-body Controller under Teleoperation. All videos are played at original speed (1X).

Content