视觉—触觉联合的潜空间世界模型

摘要

需要丰富接触的操作任务，要求世界模型不仅能预测动作之后场景看起来是什么样，还要预测指尖感受到什么。本文训练一个视觉—触觉联合的潜空间世界模型，将视觉与触觉信号联合编码到共享潜空间，并在该空间中推演未来状态。该模型可用于规划与策略学习，适合打滑、粘连、力突变等视觉看不到但触觉能捕捉到的任务。

我们在仅凭视觉存在歧义的操作任务上进行评估，结果显示在潜空间动力学中引入触觉通道可同时改善状态预测精度与下游任务成功率。

@inproceedings{fu2025visuotactile, title={Visuo-Tactile Latent World Models}, author={Fu, Yongji and others}, booktitle={Submitted to IEEE International Conference on Robotics and Automation (ICRA)}, year={2025} }

Motivation

Vision-only world models struggle on contact-rich manipulation because the information that decides success — is the object about to slip, is the grasp stable, did the tool just engage — lives in the force/tactile signal, not in pixels. We want a world model whose latent captures both modalities so that planning and policy learning can reason about contact.

Approach

Shared visuo-tactile latent. A joint encoder fuses vision and tactile readings into one latent state, aligned so that either modality alone still maps into the same space.
Latent-space rollout. Dynamics are learned and unrolled entirely in the latent, enabling cheap multi-step prediction for planning.
Downstream use. The learned latent + dynamics are used for model-predictive control and as a representation for policy learning on contact-rich manipulation tasks.

摘要

Motivation

Approach

Video