Visuo-Tactile Latent World Models
ICRA 2026 (submission), Under review, 2025
abstract
Contact-rich manipulation needs a world model that predicts not only what the scene will look like after an action but also what it will feel like at the fingertips. This work trains a visuo-tactile latent world model that jointly encodes vision and tactile signals into a shared latent space and rolls out future states in that space. The model supports planning and policy learning for tasks where contact events — slip, stick, sudden force changes — carry information that pure vision cannot see.
We evaluate on manipulation tasks that are ambiguous under vision alone, and show that adding the tactile channel to the latent dynamics improves both predictive accuracy and downstream task success.
Motivation
Vision-only world models struggle on contact-rich manipulation because the information that decides success — is the object about to slip, is the grasp stable, did the tool just engage — lives in the force/tactile signal, not in pixels. We want a world model whose latent captures both modalities so that planning and policy learning can reason about contact.
Approach
- Shared visuo-tactile latent. A joint encoder fuses vision and tactile readings into one latent state, aligned so that either modality alone still maps into the same space.
- Latent-space rollout. Dynamics are learned and unrolled entirely in the latent, enabling cheap multi-step prediction for planning.
- Downstream use. The learned latent + dynamics are used for model-predictive control and as a representation for policy learning on contact-rich manipulation tasks.