Image Captioning with AI: Implementation Guide

Generating descriptive captions for images with VLMs

进阶约 9 分钟

Image Captioning with AI: Implementation Guide

Generating descriptive captions for images with VLMs

AI 图像描述实现指南（2026）：同一张图四种用途四种 caption（无障碍 alt-text/电商文案/检索索引/审核元数据）——风格必须显式指定。mini 档足够、降分辨率控成本、批量走 Batch API、DECORATIVE 出口防瞎编。含以图搜图索引架构。

multimodal vision llm openai captioning

Image Captioning with AI: Implementation Guide

Image captioning in 2026 is a solved-by-VLM problem — the work is no longer "can a model describe an image" but "can you get the right kind of description, at the right cost, validated for your use case." Alt-text, e-commerce descriptions, media asset search, and content moderation each need different captions from the same pixels. This guide implements the patterns.

One image, four different "captions"

The biggest practical insight: caption style must be specified, not assumed. The same product photo needs:

text
Alt-text (accessibility):
"Red ceramic mug with white speckled glaze on a wooden desk beside a laptop"
→ factual, concise, no marketing, describes what a sighted user sees
E-commerce description:
"Hand-glazed ceramic mug in vivid red with artisanal speckle finish..."
→ persuasive, attribute-rich, brand voice
Search/index caption (for retrieval):
"red ceramic coffee mug, speckled glaze, wooden desk, laptop, office setting,
warm lighting, product photography"
→ dense keywords/attributes, optimized for matching queries
Moderation/metadata:
{"contains_people": false, "contains_text": false, "setting": "indoor office", ...}
→ structured, schema-bound

Write a distinct prompt per use case and treat them as versioned assets (prompt discipline).

Implementation

python
import base64
from openai import OpenAI
client = OpenAI()
ALT_TEXT_PROMPT = '''Write alt text for this image for screen-reader users.
Rules: ≤125 characters; describe the essential visual content factually;
no "image of"/"picture of"; include any text visible in the image verbatim;
if the image is decorative with no informational content, reply exactly: DECORATIVE.'''def alt_text(image_bytes: bytes) -> str:
    resp = client.chat.completions.create(
        model='gpt-4o-mini',                 # captioning is mini/flash-tier work
        messages=[{'role': 'user', 'content': [
            {'type': 'image_url', 'image_url': {
                'url': f'data:image/jpeg;base64,{base64.b64encode(image_bytes).decode()}'}},
            {'type': 'text', 'text': ALT_TEXT_PROMPT},
        ]}],
        max_tokens=100,
    )
    return resp.choices[0].message.content.strip()

Production notes:

Mini-tier models are the default for captioning — frontier models add little for descriptive tasks; save them for complex visual *reasoning* (vision analysis guide).

Downscale before sending — captioning rarely needs more than ~1024px; resolution is your cost dial.

Batch the backlog: alt-texting 50K product images is the canonical batch-API job at 50% off.

The DECORATIVE escape hatch (and equivalents like "uncertain": true in structured variants) — models forced to always caption will confabulate on ambiguous images.

Captions as search infrastructure

The highest-leverage captioning use is making image libraries searchable: dense caption → embed the caption → vector store → text queries find images (pgvector setup). Two architecture notes:

Caption-embedding vs direct image-embedding (CLIP-style): CLIP-family embeddings search images without captioning, cheaper at scale; caption-then-embed gives you *human-readable, auditable, editable* index entries and richer attribute matching. Many production systems do both and fuse scores.

Store the caption with provenance (model+prompt version) so the index can be re-generated when prompts improve — same enrichment discipline as any derived data.

Quality control

Golden set: 100 images across your real distribution (including bad lighting, busy scenes, text-heavy images); review captions on every prompt/model change.

The hallucination check that matters: captions inventing brand names, counts, or text that isn't there. For text-in-image accuracy specifically, spot-verify against OCR extraction on a sample.

Accessibility review: alt-text quality is a UX/compliance matter — have an accessibility-literate reviewer audit a sample; "technically descriptive" and "useful to a screen-reader user" differ.

FAQ

Open models for captioning? Qwen-VL/LLaVA-class models handle straightforward captioning well self-hosted — the right call at large steady volume or strict privacy; verify on your golden set.

Video? Frame-sample + caption + temporal merge works for indexing; true video understanding (actions over time) wants video-native models — different cost class.

Multilingual alt-text? Generate in the page's language directly (don't caption-then-translate) — one step, better fluency.

*Last updated: June 2026.*

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

Image Captioning with AI: Implementation Guide

Image Captioning with AI: Implementation Guide

One image, four different "captions"

Alt-text (accessibility):

E-commerce description:

Search/index caption (for retrieval):

Moderation/metadata:

Implementation

Captions as search infrastructure

Quality control

FAQ

Documentation

Getting Started

Learn more