← 返回

视觉—触觉联合的潜空间世界模型

Yongji Fu, et al.

ICRA 2026(投稿中), 审稿中, 2025

摘要

需要丰富接触的操作任务,要求世界模型不仅能预测动作之后场景看起来是什么样,还要预测指尖 感受到什么。本文训练一个视觉—触觉联合的潜空间世界模型,将视觉与触觉信号联合编码到 共享潜空间,并在该空间中推演未来状态。该模型可用于规划与策略学习,适合打滑、粘连、力突变 等视觉看不到但触觉能捕捉到的任务。

我们在仅凭视觉存在歧义的操作任务上进行评估,结果显示在潜空间动力学中引入触觉通道可同时 改善状态预测精度与下游任务成功率。

Motivation

Vision-only world models struggle on contact-rich manipulation because the information that decides success — is the object about to slip, is the grasp stable, did the tool just engage — lives in the force/tactile signal, not in pixels. We want a world model whose latent captures both modalities so that planning and policy learning can reason about contact.

Approach

  • Shared visuo-tactile latent. A joint encoder fuses vision and tactile readings into one latent state, aligned so that either modality alone still maps into the same space.
  • Latent-space rollout. Dynamics are learned and unrolled entirely in the latent, enabling cheap multi-step prediction for planning.
  • Downstream use. The learned latent + dynamics are used for model-predictive control and as a representation for policy learning on contact-rich manipulation tasks.

Video