// Glossary · technical

Multimodal AI

Also: multimodal models · vision-language models · MLLMs

AI models that handle multiple input and output types: text, image, audio, and video, with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro as the current frontier examples.

Multimodal AI describes models that accept and produce more than one kind of data. A multimodal model can read a screenshot and answer questions about it, transcribe a meeting recording, generate a diagram from a verbal description, or watch a 20-minute video and summarize the key moments. The dominant frontier models all ship multimodal capabilities by default. GPT-4o handles text, image, audio in and out. Claude 3.5 Sonnet reads images and PDFs alongside text. Gemini 1.5 Pro accepts hour-long video inputs. The shift from single-modal to multimodal happened fast and reshaped what AI agents can do in production.

In production the multimodal capability unlocks workflows that text-only models could not touch. A support agent reads a screenshot of an error and explains the fix without asking the user to describe what they see. A real-estate copilot looks at listing photos and generates compliant marketing descriptions. An internal compliance agent reads contract PDFs with embedded charts and extracts structured terms. A field-ops assistant watches a 90-second video of a broken machine and surfaces the relevant maintenance procedure from the knowledge base. None of these workflows survive in a text-only world because half the source material is not text.

The cost and latency profile of multimodal calls runs higher than text-only. An image input through GPT-4o consumes the equivalent of a few hundred to a few thousand tokens depending on resolution. A video input through Gemini 1.5 Pro can run into the tens of thousands of tokens for a single minute of footage. Production systems plan inference cost accordingly, often downscaling images before processing or sampling video frames at low rates. The AI Support Department ships multimodal screenshot triage as standard, and the AI Ops Department handles image-and-PDF document workflows where pure text models would miss half the content.

// Examples
  • A SaaS support agent receives a customer screenshot showing a 500 error, identifies the dashboard page, and surfaces the relevant fix from the knowledge base in 12 seconds.
  • A real-estate listing copilot processes 80 photos per property and generates fair-housing-compliant marketing copy that mentions visible features without inventing what the camera did not capture.
  • An internal compliance reviewer reads quarterly board decks as PDFs with charts and tables, then extracts structured risk indicators into a tracking spreadsheet.
// Common questions
Which multimodal model should I pick?
GPT-4o leads on real-time audio and balanced multimodal performance. Claude 3.5 Sonnet leads on document and image understanding for long reasoning tasks. Gemini 1.5 Pro leads on long video and audio inputs with its million-token context window. Most production systems start with one and add a second for specific tasks where the alternative scores better.
How much does a multimodal call cost compared to text?
An image at standard resolution through GPT-4o consumes roughly 1,000 to 2,000 tokens depending on detail level. A one-minute audio clip transcribes to a few hundred tokens. Video is the expensive case, with one minute of footage through Gemini 1.5 Pro consuming tens of thousands of tokens. Plan token budgets accordingly.
Can multimodal models generate images and audio?
Some can, partially. GPT-4o produces audio natively. Image generation usually comes from a separate model like DALL-E 3 or Imagen invoked by the text model. The multimodal output story is less mature than the multimodal input story, and most production workflows still chain a dedicated generation model for the output side.
What does multimodal mean for prompt injection risk?
The attack surface expands. An attacker can hide instructions inside image metadata, audio frequencies humans cannot hear, or PDF text rendered invisibly. Indirect [prompt injection](/glossary/prompt-injection) through multimodal inputs is an active research area. Production systems running multimodal in customer-facing roles need image and document sanitization on top of standard injection defenses.
// Related terms
// Ready to ship?

EOI runs fractional AI departments for funded teams under 50. Sales, Content, Ops, Support. Live in 14 days on a monthly retainer.