
Abstract
Large foundation models can exhibit unique capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g. from spreadsheets, to SAT questions). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this model diversity is symbiotic, and can be leveraged to build AI systems with structured Socratic dialogue -- in which new multimodal tasks are formulated as a guided language-based exchange between different pre-existing foundation models, without additional finetuning. In the context of egocentric perception, we present a case study of Socratic Models (SMs) that can provide meaningful results for complex tasks such as generating free-form answers to contextual questions about egocentric video, by formulating video Q&A as short story Q&A, i.e. summarizing the video into a short story, then answering questions about it. Additionally, SMs can generate captions for Internet images, and are competitive with state-of-the-art on zero-shot video-to-text retrieval with 42.8 R@1 on MSR-VTT 1k-A. SMs demonstrate how to compose foundation models zero-shot to capture new multimodal functionalities, without domain-specific data collection.
Overview: Zero-Shot Multimodal Applications with Socratic Models

Results Highlight: Zero-shot Open-ended Reasoning (i.e., Q&A, Forecasting) on Egocentric Video
Our example Socratic-Model-based system for egocentric perception can respond to a variety of open-ended text prompts -- examples below:
This works by forming a language-based world-state history and treating Video Q&A as a reading comprehension problem:
Cody Wanner was very nice to let us use his video content in our example demonstration for our research. The examples above are all generated by using this video vlog as input:
Results Highlight: Zero-shot Image Captioning
We can also compose foundation models zero-shot to perform image captioning, through closed-loop Socratic dialogue:
Example captions below:

People gather under a blossoming cherry tree, enjoying the beauty of nature together.

At the outdoor market, you can find everything from plantains to Japanese bananas.

This image shows an inviting dining space with plenty of natural light.

A family celebrates a special occasion with ice cream and cake.

A wooden spoon and other kitchen utensils sit on a table in a restaurant kitchen.

A motorcycle lies abandoned in a sandy desert.

This photo captures a person enjoying a meal at a restaurant. The spinach and nasturtium garnish on the plate makes for a beautiful and healthy meal.

This cartoon shows one person enjoying a relaxing bath with their scrub bird.

This photo was taken at a restaurant or pier. You can see the person enjoying their meal with a beautiful view of the water.

The three people in this photo appear to be enjoying a close encounter with an elephant. This majestic creature looks like a gentle giant, and the handlers seem to have a great rapport with her. What a fun and unique experience for these tourists!
Results Highlight: Zero-shot Video-to-Text Retrieval
We also can compose mutiple models to perform zero-shot video-to-text retrieval -- this achieves state-of-the-art for zero-shot methods, nearing the gap with finetuned-on-the-dataset methods:
Team
Robotics and Augmented Reality at Google
Method

In this work we propose Socratic Models (SMs), a framework that uses structured dialogue between pre-existing foundation models, each of which can exhibit unique (but complementary) capabilities depending on the distributions of data on which they are trained. On various perceptual tasks, this work presents a case study of SMs with visual language models (VLMs, e.g., CLIP), large language models (LMs, e.g., GPT-3, RoBERTa), and audio language models (ALMs, e.g., Wav2CLIP, Speech2Text). From video search, to image captioning; from generating free-form answers to contextual reasoning questions, to forecasting future activities – SMs can provide meaningful results for complex tasks across classically challenging computer vision domains, without any model finetuning.
Examples of guided multi-model exchanges (Socratic Models) for an egocentric perception system: (i, left) parsing a natural language question into search entities (with LM) to be used to find the most relevant key moments in the video (with VLM); (ii, middle) describing each key frame by detecting places and objects (VLM), suggesting commonsense activities (LM), pruning the most likely activity (VLM), then generating a natural language summary (LM) of the SM interaction; (iii, right) concatenating key frame summaries into a language-based world-state history that an LM can use as context to answer the original question:

SMs can interface with the user through dialogue and perform a variety of tasks (formulated as Q&A) with egocentric video: sorting reasoning questions by their output modalities e.g., text-base responses, images from visual search, video snippets from audio search. Depending on the modality, each question can pass through a different sequence of Socratic interactions between the LM, VLM, and ALM:

Code
We plan to release prototypes in the form of self-contained colabs. They will be added to this repository and linked here:
• Image Captioning - Open in Colab
• Egocentric Video Q&A - Coming soon
• Video Understanding (MSR-VTT) - Coming soon
Citation
title={Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language},
author={Andy Zeng and Adrian Wong and Stefan Welker and Krzysztof Choromanski and Federico Tombari and Aveek Purohit and Michael Ryoo and Vikas Sindhwani and Johnny Lee and Vincent Vanhoucke and Pete Florence},
journal={arXiv},
year={2022}
}
Acknowledgements
We thank Debidatta Dwibedi and Matthew O’Kelly for excellent feedback on improving this manuscript, Anelia Angelova, Jean-Jacques Slotine, Jonathan Tompson, Maria Attarian, Shuran Song, for fruitful technical discussions, Kan Huang for applications support, Ahmed Omran, Aren Jensen, Malcolm Slaney, Karolis Misiunas for advice on audio models, and Cody Wanner for YouTube videos.