Memory-Augmented Vision–Language Agents for Persistent and Semantically Consistent Object Captioning

1Italian Institute of Technology, 2University of Genoa
arXiv Code Data
Memory-driven multi-view exploration progressively resolves ambiguous object captions into a consistent object-level description. (a) the agent predicts a caption for the observed object; (b) a different caption is predicted from a different viewpoint; (c) a consistent caption is predicted based on the episodic object memory; (d) the predicted caption for the object remains consistent.

Memory-driven multi-view exploration resolves ambiguous captions into a consistent object-level description: (a)–(b) inconsistent predictions across views; (c)–(d) consistency using episodic object memory.

Abstract

Vision–Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision–Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set demonstrates improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline model, while enabling scalable performance through a compact scene representation.

Motivation

Humans build stable object understanding through embodied exploration: by moving, revisiting objects, and integrating observations over time, they form consistent semantic representations. In contrast, vision–language models trained on static image–text pairs treat each frame independently, often producing inconsistent descriptions of the same object as the viewpoint changes. This semantic drift breaks object identity over time and limits their use in embodied settings.

To address this, we propose a unified, memory-augmented approach where perception, memory, and action are jointly modeled, enabling agents to actively select informative viewpoints and maintain consistent object representations across time.

EPOS-VLM

We introduce EPOS-VLM (Embodied Persistent Object Semantics): a single memory-conditioned policy that explores 3D scenes and builds consistent language representations per object. The model conditions on the current RGB frame (with detected objects), a top-down explored map, and tokenized episodic memory (per-object caption histories and 3D positions). In one autoregressive pass it predicts data association (linking detections to persistent IDs or new objects), object-level captions, and navigation actions. Training uses self-supervised pseudo-captions from a 3D-aware aggregator (3D-CPS), with data collected under a disagreement-based exploration policy in Habitat (HM3D / Gibson).

Contributions

  1. A unified, memory-conditioned vision–language–action model for embodied captioning that jointly learns data association, object-level captioning, and action prediction.
  2. Structured tokenization of episodic object memory so a pretrained VLM can reason over long-horizon object histories end-to-end.
  3. An embodied captioning dataset with navigation trajectories and object-level pseudo-captions, plus a manually annotated object-level caption benchmark.

Architecture

EPOS-VLM is built on a pretrained Qwen3-VL-2B backbone. Visual inputs are the RGB observation (with instance boxes and transient IDs) and the explored map, resized and stacked into a single image; episodic memory is serialized with special tokens ([SCENE-START], [OBJ-ID], [CAP-HISTORY], etc.) and prepended to the language prompt. The model autoregressively decodes [MATCH] tokens (linking frame IDs to memory or NEW_ID), object-level captions, and a discrete navigation action. The diagram below summarizes the full pipeline.

EPOS-VLM overview: episodic object memory, RGB frame with detections, explored map, and unified VLM producing data association, captions, and navigation action.

Overview of EPOS-VLM: memory, visual inputs, and unified outputs for association, captioning, and action.

Results (highlights)

On HM3D, EPOS-VLM improves over strong VLMs and prior embodied captioning pipelines on standard caption metrics and on cross-view semantic consistency (SBERT cosine similarity between captions of the same object). A without-memory ablation drops performance sharply, showing that episodic memory is central. Compared to a dense point-cloud association baseline, EPOS-VLM achieves similar association quality with much lower inference time and memory thanks to fixed-capacity tokenized memory. An exploration ablation (random goals vs. frontier vs. EPOS-VLM policy) shows that action choice matters for consistency—not only aggregation.

Object-level captioning and consistency (HM3D manually annotated test set). Accuracy metrics: higher is better. Mean CS / Median CS: higher is better. IQR: lower is better.

ModelB4MRLCISPCSMean CSMedian CSIQR
Qwen3-VL16.0015.9741.150.8429.8860.0159.6560.0129.12
BLIP-210.9914.5139.990.5726.1656.8860.1564.1225.29
Intern-VL17.2116.8746.020.9830.0261.1855.1958.7627.34
CoCa15.2316.1343.330.8329.9557.4757.8757.2629.01
Florence14.1917.2644.030.7831.4559.1252.5655.0026.98
Galliena et al. (ICCV’25)19.0223.0448.021.1239.7870.0081.9883.078.87
EPOS-VLM25.8627.8959.881.8741.8277.1289.3790.322.79
EPOS-VLM w/o memory16.6518.0145.340.8527.6764.6252.1252.8432.01

B4: BLEU-4; M: METEOR; RL: ROUGE-L; CI: CIDEr; SP: SPICE; CS: cosine similarity between SBERT embeddings of prediction and reference caption. Mean CS / Median CS: cosine similarity between SBERT embeddings of captions for the same object across views. IQR: interquartile range of pairwise cosine similarities (lower = more consistent).

Inference time and memory. Compared to a dense point-cloud association baseline, EPOS-VLM maintains near-constant per-step wall-clock time and a small, fixed-capacity memory representation, while the baseline’s cost grows over long episodes. The plots below show mean and standard deviation over HM3D test (400 steps).

Two plots over 400 steps on HM3D test. Left: wall-clock time in seconds; EPOS-VLM stays near flat, point-cloud baseline increases. Right: memory in GB on log scale; EPOS-VLM stays orders of magnitude below the baseline plateau near 1 GB.

Per-step inference time (left) and memory footprint (right, log scale), averaged over HM3D test episodes: EPOS-VLM vs. dense point-cloud baseline.

Qualitative comparison. Over time along the same trajectory, Qwen3-VL produces view-dependent captions for the boxed object (sofa vs. armchair, bed vs. bench, shifting table/chair details); mistakes are highlighted in red in the figure. EPOS-VLM keeps a stable object-level description by conditioning on episodic memory and association.

Qualitative comparison over time: three rows (living room sofa, bedroom bed, dining table). Each row shows three frames with Qwen3-VL vs EPOS-VLM captions; Qwen drifts and red marks errors; EPOS-VLM stays consistent.

Caption comparison along exploration trajectories: Qwen3-VL (independent per-frame predictions) vs. EPOS-VLM (memory-conditioned). Red text marks Qwen inconsistencies.

Test-set trajectories

Combined trajectory visualizations from EPOS-VLM rollouts on HM3D test scenes (one panel per episode).

Conclusion

EPOS-VLM shows that persistent object memory and end-to-end training of captioning, association, and actions can substantially improve object-level caption accuracy and viewpoint consistency for embodied VL agents, with efficient inference compared to dense 3D representations. Limitations include reliance on an external segmenter and evaluation in simulated, static environments; future work includes tighter integration of perception, real-world deployment, and dynamic scenes.

BibTeX

@misc{galliena2026memoryaugmentedvisionlanguageagentspersistent,
title={Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning},
author={Tommaso Galliena and Stefano Rosa and Tommaso Apicella and Pietro Morerio and Alessio Del Bue and Lorenzo Natale},
year={2026},
eprint={2603.24257},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.24257},
}