OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities
Master GPT-4o's multimodal features including image analysis, audio transcription, and the new real-time streaming API for interactive applications
OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities
What Makes GPT-4o Different in 2026
GPT-4o ("omni") represents a fundamental shift in how we interact with AI models. Unlike previous versions that handled text, images, and audio through separate pipelines, GPT-4o processes all modalities natively—resulting in faster responses, lower latency, and more natural multimodal understanding.
In 2026, GPT-4o has become the backbone of thousands of production applications. This tutorial covers the full API surface.
Quick Start
python
from openai import OpenAIclient = OpenAI() # Uses OPENAI_API_KEY env var
Basic text completion
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What's new in GPT-4o?"}
]
)print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
Vision: Analyzing Images
GPT-4o can analyze images from URLs or base64-encoded data:
python
Analyzing a URL-based image
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://example.com/chart.png",
"detail": "high" # "low", "high", or "auto"
}
},
{
"type": "text",
"text": "Analyze this sales chart and identify the key trends"
}
]
}
],
max_tokens=1000
)
Practical Vision Use Case: Invoice Processing
python
import base64
from pathlib import Pathdef extract_invoice_data(image_path: str) -> dict:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
ext = Path(image_path).suffix.lower()
media_type = {'.jpg': 'image/jpeg', '.png': 'image/png', '.pdf': 'application/pdf'}.get(ext, 'image/jpeg')
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:{media_type};base64,{image_data}"}
},
{
"type": "text",
"text": """Extract invoice data as JSON:
- invoice_number
- date
- vendor_name
- total_amount
- line_items (array)
Return ONLY valid JSON."""
}
]
}],
response_format={"type": "json_object"}
)
import json
return json.loads(response.choices[0].message.content)
Audio Transcription with Whisper
python
import openaiTranscribe audio file
with open("meeting_recording.mp3", "rb") as audio_file:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=audio_file,
response_format="verbose_json", # Includes timestamps
timestamp_granularities=["segment"]
)print(transcript.text)
Process segments with timestamps
for segment in transcript.segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}")
Text-to-Speech
python
from pathlib import Pathresponse = client.audio.speech.create(
model="tts-1-hd", # or "tts-1" for faster/cheaper
voice="alloy", # alloy, echo, fable, onyx, nova, shimmer
input="Welcome to our AI-powered service. How can I help you today?",
speed=1.0
)
Path("welcome.mp3").write_bytes(response.content)
Function Calling (Tool Use)
python
import jsontools = [
{
"type": "function",
"function": {
"name": "search_products",
"description": "Search the product catalog",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"category": {"type": "string", "enum": ["electronics", "clothing", "food"]},
"max_price": {"type": "number"}
},
"required": ["query"]
}
}
}
]
def run_conversation(user_message: str):
messages = [{"role": "user", "content": user_message}]
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
# Execute the tool call
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
# Simulate tool execution
tool_result = {"products": [{"name": "Product A", "price": 29.99}]}
# Continue conversation with tool result
messages.extend([
response.choices[0].message,
{
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(tool_result)
}
])
final_response = client.chat.completions.create(
model="gpt-4o",
messages=messages
)
return final_response.choices[0].message.content
return response.choices[0].message.content
Structured Output with Pydantic
python
from pydantic import BaseModel
from typing import List
from openai import OpenAIclient = OpenAI()
class ProductReview(BaseModel):
sentiment: str
score: int
pros: List[str]
cons: List[str]
summary: str
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Analyze product reviews."},
{"role": "user", "content": "The battery life is amazing, lasts 3 days! Camera is decent but software is buggy."}
],
response_format=ProductReview
)
review = completion.choices[0].message.parsed
print(f"Sentiment: {review.sentiment}")
print(f"Score: {review.score}/10")
print(f"Pros: {', '.join(review.pros)}")
Batch API for Cost Savings
For non-real-time workloads, the Batch API offers 50% cost reduction:
python
import jsonCreate batch file
batch_requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": f"Summarize: {text}"}],
"max_tokens": 500
}
}
for i, text in enumerate(texts_to_summarize)
]Write to JSONL file
with open("batch_input.jsonl", "w") as f:
for req in batch_requests:
f.write(json.dumps(req) + "\n")Upload and create batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
print(f"Batch ID: {batch.id}")
Cost Optimization Tips
Conclusion
GPT-4o's multimodal API makes it possible to build applications that were science fiction just a few years ago. From real-time invoice processing to voice-enabled assistants, the primitives are all here. Start with the basic chat API, layer in vision as needed, and graduate to function calling for agentic workflows.
Also available in 中文.