OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities
Master GPT-4o's multimodal features including image analysis, audio transcription, and the new real-time streaming API for interactive applications
OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities
Master GPT-4o's multimodal features including image analysis, audio transcription, and the new real-time streaming API for interactive applications
Complete guide to OpenAI's GPT-4o API covering multimodal inputs, real-time audio streaming, function calling, and building production apps. Includes code examples for vision analysis, speech-to-text integration, and cost optimization strategies.
OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities
What Makes GPT-4o Different in 2026
GPT-4o ("omni") represents a fundamental shift in how we interact with AI models. Unlike previous versions that handled text, images, and audio through separate pipelines, GPT-4o processes all modalities natively—resulting in faster responses, lower latency, and more natural multimodal understanding.
In 2026, GPT-4o has become the backbone of thousands of production applications. This tutorial covers the full API surface.
Quick Start
python
from openai import OpenAIclient = OpenAI() # Uses OPENAI_API_KEY env var
Basic text completion
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's new in GPT-4o?"}
]
)print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
Vision: Analyzing Images
GPT-4o can analyze images from URLs or base64-encoded data:
python
Analyzing a URL-based image
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/chart.png",
"detail": "high" # "low", "high", or "auto"
}
},
{
"type": "text",
"text": "Analyze this sales chart and identify the key trends"
}
]
}
],
max_tokens=1000
)
Practical Vision Use Case: Invoice Processing
python
import base64
from pathlib import Pathdef extract_invoice_data(image_path: str) -> dict:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
ext = Path(image_path).suffix.lower()
media_type = {'.jpg': 'image/jpeg', '.png': 'image/png', '.pdf': 'application/pdf'}.get(ext, 'image/jpeg')
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:{media_type};base64,{image_data}"}
},
{
"type": "text",
"text": """Extract invoice data as JSON:
- invoice_number
- date
- vendor_name
- total_amount
- line_items (array)
Return ONLY valid JSON."""
}
]
}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
Audio Transcription with Whisper
python
import openaiTranscribe audio file
with open("meeting_recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json", # Includes timestamps
timestamp_granularities=["segment"]
)print(transcript.text)
Process segments with timestamps
for segment in transcript.segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}")
Text-to-Speech
python
from pathlib import Pathresponse = client.audio.speech.create(
model="tts-1-hd", # or "tts-1" for faster/cheaper
voice="alloy", # alloy, echo, fable, onyx, nova, shimmer
input="Welcome to our AI-powered service. How can I help you today?",
speed=1.0
)
Path("welcome.mp3").write_bytes(response.content)
Function Calling (Tool Use)
python
import jsontools = [
{
"type": "function",
"function": {
"name": "search_products",
"description": "Search the product catalog",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"category": {"type": "string", "enum": ["electronics", "clothing", "food"]},
"max_price": {"type": "number"}
},
"required": ["query"]
}
}
}
]
def run_conversation(user_message: str):
messages = [{"role": "user", "content": user_message}]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
# Execute the tool call
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
# Simulate tool execution
tool_result = {"products": [{"name": "Product A", "price": 29.99}]}
# Continue conversation with tool result
messages.extend([
response.choices[0].message,
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(tool_result)
}
])
final_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
return final_response.choices[0].message.content
return response.choices[0].message.content
Structured Output with Pydantic
python
from pydantic import BaseModel
from typing import List
from openai import OpenAIclient = OpenAI()
class ProductReview(BaseModel):
sentiment: str
score: int
pros: List[str]
cons: List[str]
summary: str
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Analyze product reviews."},
{"role": "user", "content": "The battery life is amazing, lasts 3 days! Camera is decent but software is buggy."}
],
response_format=ProductReview
)
review = completion.choices[0].message.parsed
print(f"Sentiment: {review.sentiment}")
print(f"Score: {review.score}/10")
print(f"Pros: {', '.join(review.pros)}")
Batch API for Cost Savings
For non-real-time workloads, the Batch API offers 50% cost reduction:
python
import jsonCreate batch file
batch_requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": f"Summarize: {text}"}],
"max_tokens": 500
}
}
for i, text in enumerate(texts_to_summarize)
]Write to JSONL file
with open("batch_input.jsonl", "w") as f:
for req in batch_requests:
f.write(json.dumps(req) + "\n")Upload and create batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id}")
Cost Optimization Tips
Conclusion
GPT-4o's multimodal API makes it possible to build applications that were science fiction just a few years ago. From real-time invoice processing to voice-enabled assistants, the primitives are all here. Start with the basic chat API, layer in vision as needed, and graduate to function calling for agentic workflows.
相关工具
相关教程
Complete developer guide to Mistral AI models in 2026 including Mistral Large, Mixtral 8x22B, and deploying Mistral models locally for privacy-first applications
Step-by-step tutorial for building reliable, safe AI applications using Claude 3.5 Sonnet and Claude 3 Opus via the Anthropic API
投资者和分析师必备:10 分钟用 AI 完成专业财报解读