Learning from Models and Data for Visual Grounding

Ruozhen (Catherine) He1     Paola Cascante-Bonilla1     Ziyan Yang1     Alexander C. Berg2     Vicente Ordóñez1    
1Rice University,     2UC Irvine    


We introduce SynGround, a novel framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models to enhance the visual grounding capabilities of a pretrained vision-and-language model. The knowledge transfer from the models initiates the generation of image descriptions through an image description generator. These descriptions serve dual purposes: they act as prompts for synthesizing images through a text-to-image generator, and as queries for synthesizing text, from which phrases are extracted using a large language model. Finally, we leverage an open-vocabulary object detector to generate synthetic bounding boxes for the synthetic images and texts. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention consistency objective that aligns region annotations with gradient-based model explanations. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model. Particularly, SynGround improves the pointing game accuracy of ALBEF on the Flickr30k dataset from 79.38% to 87.26%, and on RefCOCO+ Test A from 69.35% to 79.06% and on RefCOCO+ Test B from 53.77% to 63.67%.


Data is effective for learning visual grounding, but expensive to curate at scale. In contrast, learning from related models is more flexible yet less effective. Our proposed paradigm leverages the benefits of training using both data and models, improving performance for visual grounding.

Overview of our image-text-box synthesis pipeline. We use an image description generator Ψc, which outputs a description that serves as a prompt to an image generator Ψg to obtain synthetic image I. We also use this description to obtain text phrases T by prompting an LLM Ψt. Finally, we input the synthetic text and image into an object detector Ψd to obtain synthetic boxes B.
In this work, we introduce a pragmatic framework for image-text-box synthesis tailored for visual grounding. To the best of our knowledge, this paper is the first to study to which extent learning from models impacts the capability of a pre-trained vision-and-language model to localize objects in an image given a visual explanation. We navigate from lower to higher synthetic purity levels, and break down our investigation of synthetic image-text-box generation into image-text pairs and image-text-box synthesis. Our method, SynGround, leverages a captioning model to generate dense textual descriptions, used for image synthesis. The generated image descriptions are fed into an LLM for text synthesis. Finally, the image-text-box generation is complemented with synthetic bounding boxes from an open-vocabulary object detector. Remarkably, finetuning a base pretrained vision-and-language model on such synthetic set leads to a substantial performance gain, showcasing the potential of learning from models. More importantly, it reaches new heights when learning from models and data by finetuning on both real and synthetic data.

Experimental Results

We evaluate a vision-and-language model's visual grounding performance through gradient-based explanations with pointing-game accuracy on RefCOCO+ and Flickr30k. We compare the visual grounding improvements for the off-the-shelf base model (row 1) by learning exclusively from data (row 2), from models (row 3), and a combination of both (row 4).

Samples of our synthetic image-text-boxes.

Samples of our synthetic image-text-boxes generated at a higher synthetic purity level.

    title={Learning from Models and Data for Visual Grounding},
    author={He, Ruozhen and Cascante-Bonilla, Paola and Yang, Ziyan and Berg, Alexander C and Ordonez, Vicente},
    journal={arXiv preprint arXiv:2403.13804},