When Social Media Images Need Words: Measuring Context Gap and Fusion Tax in Crisis Image Captioning
DOI:
https://doi.org/10.59297/4bar0c32Keywords:
Crisis Informatics, Situational Awareness, Image Captioning, Multimodal Large Language Models, Social MediaAbstract
Crisis images on social media can be difficult to interpret at scale, and many carry meaning embedded in text or symbols (e.g., radar screenshots, evacuation notices). This limits vision-only captioning for situational awareness. We quantify a central trade-off in multimodal captioning: adding post text can reduce omission-driven ambiguity (the Context Gap), but it can also introduce text-driven errors (the Fusion Tax). Using 204 high-priority image–post pairs from CrisisFACTS, we compare Vision-only and Vision + Text captioning across Gemini 2.0 Flash, Qwen2.5-VL, and BLIP. We find that post text improves accuracy for Gemini and Qwen largely by reducing misidentification and scene-type errors, while sometimes amplifying hallucinated (unsupported) details. BLIP, however, does not reliably fuse modalities in our setup. When post text is provided, it often collapses into simple text echoing rather than producing image-grounded captions. We discuss implications for multimodal fusion in crisis informatics and outline next steps for image-type evaluation and routing.