Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Ruozhen He¹, Nisarg A. Shah², Qihua Dong³, Zilin Xiao¹, Jaywon Koo¹, Vicente Ordonez¹

¹Rice University · ²Johns Hopkins University · ³Northeastern University

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues that include deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

arXiv Code Dataset Results

Overview

From Naming to Scenario Grounding

Figure 1: Referring Scenario Comprehension (RSC) vs. traditional referring expression comprehension (REC). Each row shows the same target object under both paradigms. Traditional REC queries often name the target category directly, allowing success via lexical matching. RSC instead pairs each image with a lengthy scenario-based query specifying a user role, goal, and multiple disambiguating cues — including explicit contrasts against competing objects. The RSC difficulty tags (U/C/S/O/P) characterize each instance, enabling fine-grained training and evaluation.

RSC Benchmark

Referring Scenario Comprehension Dataset

RSC replaces referring phrases with scenario-based queries that describe a user role, goal, and at least three disambiguating cues, and deliberately mentions competing objects to require deep understanding. Each instance is annotated with reasoning traces and five interpretable difficulty tags (Uniqueness, Clutter, Size, Overlap, and Position), which expose distinct failure modes and support fine-grained curriculum design and evaluation.

~31k

Training examples

In-domain test

OOD test (unseen)

52.7

Avg. query length (words)

9,086

Unique vocabulary tokens

Figure 2: Phase 1 filters and balances source instances computing five interpretable difficulty tags to form a tag-balanced candidate pool. Phase 2 generates annotations via a two-stage process. Phase 3 applies automatic and human quality control. The final RSC dataset provides, per instance, a scenario query, reasoning traces, acceptable names, ground-truth box, and difficulty tags.

Query length distribution. RSC queries peak around 50–60 words vs. under 10 for RefCOCO+/g.

Instance size distribution. RSC covers a broader range of target scales, with higher density at smaller instances.

Difficulty Tag Distributions (RSC-ID vs. RSC-OOD). Per-tag marginals across all five axes: Uniqueness (U), Clutter (C), Size (S), Overlap (O), and Position (P). The ID split maintains near-balanced marginals by design; the OOD split skews toward non-unique (U2) and smaller instances, reflecting LVIS's fine-grained vocabulary.

ScenGround

Two-Stage Curriculum Reasoning

ScenGround is a two-stage curriculum reasoning method for scenario-based visual grounding. In Stage 1, Thought-Primed SFT (TP-SFT) aligns the model to the output schema and elicits faithful reasoning traces before a structured answer, using the easier RSC slices to stabilize interface learning. In Stage 2, Incentive-Curriculum GRPO (IC-GRPO) refines localization and disambiguation via shaped rewards coupling geometry (smooth IoU, center-consistency, out-of-bounds penalties) and alias-aware category rewards. The training follows a tag-aware curriculum, feeding more difficult non-unique, cluttered, overlapping, and off-center targets in the later stage. A prompt-template ensemble (PTE-8) further improves robustness across query surface forms.

Figure 3: ScenGround prompt and output schema. The model reasons inside <think> and emits structured JSON inside <answer> with target_object and bbox [x,y,w,h]. Scenarios avoid category names and force disambiguation. IC-GRPO uses 8 prompt templates (PTE-8) for robustness.

Experiments

State-of-the-Art on RSC and Standard Benchmarks

Results reveal a consistent pattern: models with strong category accuracy tend to lag on localization, while strong detectors lack semantic reasoning. ScenGround substantially outperforms all baselines on ID mIoU and consistently reduces the localization–semantics trade-off across ID and OOD splits.

Model	RSC In-Domain				RSC Out-of-Domain
Model	mIoU	[email protected]	[email protected]	Cat Acc	mIoU	[email protected]	[email protected]	Cat Acc
Closed-source LLMs
GPT-4o	19.41	13.23	5.37	79.45	16.57	9.55	3.08	62.00
Claude 3.7	16.64	8.32	3.71	89.67	12.04	5.54	1.87	58.98
Specialist Grounding Models (oracle settings ‡)
Grounding DINO (cat token) ‡	44.60	47.55	42.03	—	32.18	31.99	27.89	—
Grounding DINO (ref. cue) ‡	48.99	51.84	46.02	—	38.12	38.26	34.07	—
Open-source VLMs
InternVL2.5 8B	16.76	11.88	6.74	81.70	8.08	3.64	1.61	36.50
Qwen3-VL 8B	15.46	11.17	6.05	75.04	7.38	3.70	1.48	46.97
Qwen2.5-VL 7B	30.31	27.42	15.66	30.86	21.54	15.88	9.19	20.82
ScenGround (Ours)	55.68	60.90	42.32	94.23	38.37	38.11	22.64	21.13

‡ Oracle settings: Grounding DINO receives privileged inputs (gold category name or short ref. cue) unavailable at inference.

Figure 4: Qualitative results on RSC. ScenGround correctly localizes scenario-described targets — including distinguishing an illustrated animal on a book cover from a real dog in the foreground. Green = ground truth, blue/red = ScenGround prediction.

Top 20 most frequent categories in RSC-ID (79 COCO categories) and RSC-OOD (395 LVIS categories). OOD categories are disjoint at both string and synset level.

Citation

BibTeX

@article{he2026rsc,
  title = {Beyond Referring Expressions: Scenario Comprehension Visual Grounding},
  author = {He, Ruozhen and Shah, Nisarg A. and Dong, Qihua and Xiao, Zilin and Koo, Jaywon and Ordonez, Vicente},
  journal = {arXiv preprint},
  year = {2026},
  url = {https://arxiv.org/abs/XXXX.XXXXX}
}