Embodied-Captioning

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Italian Institute of Technology

ICCV 2025 Highlight

Abstract

We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin.

Pipeline Overview

Our modular framework separates exploration, pseudo-caption generations, and model training into three flexible phases. This design allows independent improvements at each stage: collecting diverse object views, generating consistent pseudo-captions via prompting, and fine-tuning with both supervision and contrastive loss to enhance caption consistency across views.

The agent navigates to goal locations predicted by a learned policy (a). At each step, it performs instance segmentation and captioning; the detections are projected into a 3D voxel map (b). A semantic voxel map is built by accumulating segmented instances and associated captions over time (c). Captions collected for the same object are grouped along with their frequencies (d). The Language-Driven Consistent Pseudo-captioner (LD-CPS) module generates a single consistent pseudo-caption for each object using prompting and in-context learning (e). The captioner model is then fine-tuned with the pseudo-captions as supervision (f), while a triplet loss is introduced to encourage consistent visual representations across views of the same object (g).

Qualitative Results

Qualitative examples: Examples of predicted captions before and after fine- tuning for different object views. We highlight mistakes and correct details in generated captions.

BibTeX

@inproceedings{galliena2025embodied, title={Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions}, author={Galliena, Tommaso and Apicella, Tommaso and Rosa, Stefano and Morerio, Pietro and Del Bue, Alessio and Natale, Lorenzo}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, year={2025} }

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Abstract

Video Explanation

Pipeline Overview

Qualitative Results

BibTeX