Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

T. Galliena, T. Apicella, S. Rosa, P. Morerio, A. Del Bue, L. Natale

Istituto Italiano di Tecnologia, Genoa, Italy



We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin.


Approach

Exploration

The agent navigates the environment using any exploration policy chosen by the user, allowing full flexibility in how observations are collected. In addition to this, we train a reinforcement learning (RL) policy that explicitly encourages the agent to collect object samples where the captioning model exhibits high disagreement across different viewpoints. This targeted exploration improves the quality and diversity of the collected dataset, focusing on challenging and informative samples.

Pseudo-captioning

From the collected image-caption pairs, we apply a consensus mechanism using a large language model to distill a consistent pseudo-caption for each object instance, reducing noise from different viewpoints and descriptions.

Fine-tuning

The distilled pseudo-captions are used to fine-tune an existing captioning model. We additionally apply contrastive learning techniques to further improve consistency and semantic accuracy across multiple views of the same object.



Method Overview Video





Reference

If you use the information in the paper please cite the following reference.

Plain text format
 T. Galliena, T. Apicella, S. Rosa, P. Morerio, A. Del Bue, L. Natale, Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions,
arXiv preprint arXiv:2504.08531, 2025
        

Bibtex format
@article{galliena2025embodied,
        title={Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions},
        author={Galliena, Tommaso and Apicella, Tommaso and Rosa, Stefano and Morerio, Pietro and Del Bue, Alessio and Natale, Lorenzo},
        journal={arXiv preprint arXiv:2504.08531},
        year={2025}
}
        

Contact

If you have any further enquiries, question, or comments, please open an issue on the Github repository page.