Pipeline Overview
Our modular framework separates exploration, pseudo-caption generations, and model training into three flexible phases. This design allows independent improvements at each stage: collecting diverse object views, generating consistent pseudo-captions via prompting, and fine-tuning with both supervision and contrastive loss to enhance caption consistency across views.
The agent navigates to goal locations predicted by a learned policy (a). At each step, it performs instance segmentation and captioning; the detections are projected into a 3D voxel map (b). A semantic voxel map is built by accumulating segmented instances and associated captions over time (c). Captions collected for the same object are grouped along with their frequencies (d). The Language-Driven Consistent Pseudo-captioner (LD-CPS) module generates a single consistent pseudo-caption for each object using prompting and in-context learning (e). The captioner model is then fine-tuned with the pseudo-captions as supervision (f), while a triplet loss is introduced to encourage consistent visual representations across views of the same object (g).