Image Captioning with AI: Implementation Guide
Generating descriptive captions for images with VLMs
Image Captioning with AI: Implementation Guide
Generating descriptive captions for images with VLMs
AI 图像描述实现指南(2026):同一张图四种用途四种 caption(无障碍 alt-text/电商文案/检索索引/审核元数据)——风格必须显式指定。mini 档足够、降分辨率控成本、批量走 Batch API、DECORATIVE 出口防瞎编。含以图搜图索引架构。
Image Captioning with AI: Implementation Guide
Image captioning in 2026 is a solved-by-VLM problem — the work is no longer "can a model describe an image" but "can you get the right kind of description, at the right cost, validated for your use case." Alt-text, e-commerce descriptions, media asset search, and content moderation each need different captions from the same pixels. This guide implements the patterns.
One image, four different "captions"
The biggest practical insight: caption style must be specified, not assumed. The same product photo needs:
text
Alt-text (accessibility):
"Red ceramic mug with white speckled glaze on a wooden desk beside a laptop"
→ factual, concise, no marketing, describes what a sighted user seesE-commerce description:
"Hand-glazed ceramic mug in vivid red with artisanal speckle finish..."
→ persuasive, attribute-rich, brand voiceSearch/index caption (for retrieval):
"red ceramic coffee mug, speckled glaze, wooden desk, laptop, office setting,
warm lighting, product photography"
→ dense keywords/attributes, optimized for matching queriesModeration/metadata:
{"contains_people": false, "contains_text": false, "setting": "indoor office", ...}
→ structured, schema-bound
Write a distinct prompt per use case and treat them as versioned assets (prompt discipline).
Implementation
python
import base64
from openai import OpenAIclient = OpenAI()
ALT_TEXT_PROMPT = '''Write alt text for this image for screen-reader users.
Rules: ≤125 characters; describe the essential visual content factually;
no "image of"/"picture of"; include any text visible in the image verbatim;
if the image is decorative with no informational content, reply exactly: DECORATIVE.'''
def alt_text(image_bytes: bytes) -> str:
resp = client.chat.completions.create(
model='gpt-4o-mini', # captioning is mini/flash-tier work
messages=[{'role': 'user', 'content': [
{'type': 'image_url', 'image_url': {
'url': f'data:image/jpeg;base64,{base64.b64encode(image_bytes).decode()}'}},
{'type': 'text', 'text': ALT_TEXT_PROMPT},
]}],
max_tokens=100,
)
return resp.choices[0].message.content.strip()
Production notes:
"uncertain": true in structured variants) — models forced to always caption will confabulate on ambiguous images.Captions as search infrastructure
The highest-leverage captioning use is making image libraries searchable: dense caption → embed the caption → vector store → text queries find images (pgvector setup). Two architecture notes:
Quality control
FAQ
Open models for captioning? Qwen-VL/LLaVA-class models handle straightforward captioning well self-hosted — the right call at large steady volume or strict privacy; verify on your golden set.
Video? Frame-sample + caption + temporal merge works for indexing; true video understanding (actions over time) wants video-native models — different cost class.
Multilingual alt-text? Generate in the page's language directly (don't caption-then-translate) — one step, better fluency.
*Last updated: June 2026.*
相关工具
相关教程
Analyzing images and documents with Claude 3 Vision
Advanced optical character recognition using VLMs
Detecting inappropriate content in audio with AI
Senior AI engineers explain the decision framework for choosing between fine-tuning, RAG, and prompt engineering
Detecting emotion and sentiment from voice recordings
Building high-quality fine-tuning datasets from scratch — step-by-step implementation guide