Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Italian Institute of Technology
ICCV 2025
An agent equipped with an instance segmentation and captioning model navigates the environment. Off-the-shelf captioners predict partially wrong and inconsistent captions across different views of the same object: COCA, BLIP2, Florence2, and ChatGPT-4.

Recognizing and describing objects in a scene are fundamental skills for visual understanding, enabling robots to navigate and interact with the environment. Despite advances in image captioning, a captioner deployed on an autonomous agent often generates wrong or inconsistent descriptions across different views of the same object, especially in the case of occlusions or challenging viewing directions.

Abstract

We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin.

Video Explanation

Pipeline Overview

Our modular framework separates exploration, pseudo-caption generations, and model training into three flexible phases. This design allows independent improvements at each stage: collecting diverse object views, generating consistent pseudo-captions via prompting, and fine-tuning with both supervision and contrastive loss to enhance caption consistency across views.

Overview of the pipeline

The agent navigates to goal locations predicted by a learned policy (a). At each step, it performs instance segmentation and captioning; the detections are projected into a 3D voxel map (b). A semantic voxel map is built by accumulating segmented instances and associated captions over time (c). Captions collected for the same object are grouped along with their frequencies (d). The Language-Driven Consistent Pseudo-captioner (LD-CPS) module generates a single consistent pseudo-caption for each object using prompting and in-context learning (e). The captioner model is then fine-tuned with the pseudo-captions as supervision (f), while a triplet loss is introduced to encourage consistent visual representations across views of the same object (g).

Qualitative Results

Qualitative examples: Examples of predicted captions before and after fine- tuning for different object views. We highlight mistakes and correct details in generated captions.

BibTeX

@inproceedings{galliena2025embodied,
  title={Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions},
  author={Galliena, Tommaso and Apicella, Tommaso and Rosa, Stefano and Morerio, Pietro and Del Bue, Alessio and Natale, Lorenzo},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2025}
}