Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

T. Galliena, T. Apicella, S. Rosa, P. Morerio, A. Del Bue, L. Natale

Istituto Italiano di Tecnologia, Genoa, Italy



We present a self-supervised method to improve an agent's abilities in describing arbitrary objects while actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism. First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set. Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin.


Reference

If you use the information in the paper please cite the following reference.

Plain text format
 T. Galliena, T. Apicella, S. Rosa, P. Morerio, A. Del Bue, L. Natale, Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions,
arXiv preprint arXiv:2504.08531, 2025
        

Bibtex format
@article{galliena2025embodied,
        title={Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions},
        author={Galliena, Tommaso and Apicella, Tommaso and Rosa, Stefano and Morerio, Pietro and Del Bue, Alessio and Natale, Lorenzo},
        journal={arXiv preprint arXiv:2504.08531},
        year={2025}
}
        

Contact

If you have any further enquiries, question, or comments, please open an issue on the Github repository page.