Image Captioning with AI: Implementation Guide

Generating descriptive captions for images with VLMs

返回教程列表
进阶9 分钟

Image Captioning with AI: Implementation Guide

Generating descriptive captions for images with VLMs

AI 图像描述实现指南(2026):同一张图四种用途四种 caption(无障碍 alt-text/电商文案/检索索引/审核元数据)——风格必须显式指定。mini 档足够、降分辨率控成本、批量走 Batch API、DECORATIVE 出口防瞎编。含以图搜图索引架构。

Image Captioning with AI: Implementation Guide

Image captioning in 2026 is a solved-by-VLM problem — the work is no longer "can a model describe an image" but "can you get the right kind of description, at the right cost, validated for your use case." Alt-text, e-commerce descriptions, media asset search, and content moderation each need different captions from the same pixels. This guide implements the patterns.

One image, four different "captions"

The biggest practical insight: caption style must be specified, not assumed. The same product photo needs:

text

Alt-text (accessibility):

"Red ceramic mug with white speckled glaze on a wooden desk beside a laptop" → factual, concise, no marketing, describes what a sighted user sees

E-commerce description:

"Hand-glazed ceramic mug in vivid red with artisanal speckle finish..." → persuasive, attribute-rich, brand voice

Search/index caption (for retrieval):

"red ceramic coffee mug, speckled glaze, wooden desk, laptop, office setting, warm lighting, product photography" → dense keywords/attributes, optimized for matching queries

Moderation/metadata:

{"contains_people": false, "contains_text": false, "setting": "indoor office", ...} → structured, schema-bound

Write a distinct prompt per use case and treat them as versioned assets (prompt discipline).

Implementation

python
import base64
from openai import OpenAI

client = OpenAI()

ALT_TEXT_PROMPT = '''Write alt text for this image for screen-reader users. Rules: ≤125 characters; describe the essential visual content factually; no "image of"/"picture of"; include any text visible in the image verbatim; if the image is decorative with no informational content, reply exactly: DECORATIVE.'''

def alt_text(image_bytes: bytes) -> str: resp = client.chat.completions.create( model='gpt-4o-mini', # captioning is mini/flash-tier work messages=[{'role': 'user', 'content': [ {'type': 'image_url', 'image_url': { 'url': f'data:image/jpeg;base64,{base64.b64encode(image_bytes).decode()}'}}, {'type': 'text', 'text': ALT_TEXT_PROMPT}, ]}], max_tokens=100, ) return resp.choices[0].message.content.strip()

Production notes:

  • Mini-tier models are the default for captioning — frontier models add little for descriptive tasks; save them for complex visual *reasoning* (vision analysis guide).
  • Downscale before sending — captioning rarely needs more than ~1024px; resolution is your cost dial.
  • Batch the backlog: alt-texting 50K product images is the canonical batch-API job at 50% off.
  • The DECORATIVE escape hatch (and equivalents like "uncertain": true in structured variants) — models forced to always caption will confabulate on ambiguous images.
  • Captions as search infrastructure

    The highest-leverage captioning use is making image libraries searchable: dense caption → embed the caption → vector store → text queries find images (pgvector setup). Two architecture notes:

  • Caption-embedding vs direct image-embedding (CLIP-style): CLIP-family embeddings search images without captioning, cheaper at scale; caption-then-embed gives you *human-readable, auditable, editable* index entries and richer attribute matching. Many production systems do both and fuse scores.
  • Store the caption with provenance (model+prompt version) so the index can be re-generated when prompts improve — same enrichment discipline as any derived data.
  • Quality control

  • Golden set: 100 images across your real distribution (including bad lighting, busy scenes, text-heavy images); review captions on every prompt/model change.
  • The hallucination check that matters: captions inventing brand names, counts, or text that isn't there. For text-in-image accuracy specifically, spot-verify against OCR extraction on a sample.
  • Accessibility review: alt-text quality is a UX/compliance matter — have an accessibility-literate reviewer audit a sample; "technically descriptive" and "useful to a screen-reader user" differ.
  • FAQ

    Open models for captioning? Qwen-VL/LLaVA-class models handle straightforward captioning well self-hosted — the right call at large steady volume or strict privacy; verify on your golden set.

    Video? Frame-sample + caption + temporal merge works for indexing; true video understanding (actions over time) wants video-native models — different cost class.

    Multilingual alt-text? Generate in the page's language directly (don't caption-then-translate) — one step, better fluency.


    *Last updated: June 2026.*

    相关工具

    openaipython
    所属主题:OpenAI 开发实战