OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

Master GPT-4o's multimodal features including image analysis, audio transcription, and the new real-time streaming API for interactive applications

返回教程列表
进阶30 分钟

OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

Master GPT-4o's multimodal features including image analysis, audio transcription, and the new real-time streaming API for interactive applications

Complete guide to OpenAI's GPT-4o API covering multimodal inputs, real-time audio streaming, function calling, and building production apps. Includes code examples for vision analysis, speech-to-text integration, and cost optimization strategies.

gpt-4oopenaiapivisionaudiomultimodal

OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

What Makes GPT-4o Different in 2026

GPT-4o ("omni") represents a fundamental shift in how we interact with AI models. Unlike previous versions that handled text, images, and audio through separate pipelines, GPT-4o processes all modalities natively—resulting in faster responses, lower latency, and more natural multimodal understanding.

In 2026, GPT-4o has become the backbone of thousands of production applications. This tutorial covers the full API surface.

Quick Start

python
from openai import OpenAI

client = OpenAI() # Uses OPENAI_API_KEY env var

Basic text completion

response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's new in GPT-4o?"} ] )

print(response.choices[0].message.content) print(f"Tokens used: {response.usage.total_tokens}")

Vision: Analyzing Images

GPT-4o can analyze images from URLs or base64-encoded data:

python

Analyzing a URL-based image

response = client.chat.completions.create( model="gpt-4o", messages=[ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://example.com/chart.png", "detail": "high" # "low", "high", or "auto" } }, { "type": "text", "text": "Analyze this sales chart and identify the key trends" } ] } ], max_tokens=1000 )

Practical Vision Use Case: Invoice Processing

python
import base64
from pathlib import Path

def extract_invoice_data(image_path: str) -> dict: with open(image_path, "rb") as f: image_data = base64.b64encode(f.read()).decode() ext = Path(image_path).suffix.lower() media_type = {'.jpg': 'image/jpeg', '.png': 'image/png', '.pdf': 'application/pdf'}.get(ext, 'image/jpeg') response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": [ { "type": "image_url", "image_url": {"url": f"data:{media_type};base64,{image_data}"} }, { "type": "text", "text": """Extract invoice data as JSON: - invoice_number - date - vendor_name - total_amount - line_items (array) Return ONLY valid JSON.""" } ] }], response_format={"type": "json_object"} ) import json return json.loads(response.choices[0].message.content)

Audio Transcription with Whisper

python
import openai

Transcribe audio file

with open("meeting_recording.mp3", "rb") as audio_file: transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file, response_format="verbose_json", # Includes timestamps timestamp_granularities=["segment"] )

print(transcript.text)

Process segments with timestamps

for segment in transcript.segments: print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}")

Text-to-Speech

python
from pathlib import Path

response = client.audio.speech.create( model="tts-1-hd", # or "tts-1" for faster/cheaper voice="alloy", # alloy, echo, fable, onyx, nova, shimmer input="Welcome to our AI-powered service. How can I help you today?", speed=1.0 )

Path("welcome.mp3").write_bytes(response.content)

Function Calling (Tool Use)

python
import json

tools = [ { "type": "function", "function": { "name": "search_products", "description": "Search the product catalog", "parameters": { "type": "object", "properties": { "query": {"type": "string"}, "category": {"type": "string", "enum": ["electronics", "clothing", "food"]}, "max_price": {"type": "number"} }, "required": ["query"] } } } ]

def run_conversation(user_message: str): messages = [{"role": "user", "content": user_message}] response = client.chat.completions.create( model="gpt-4o", messages=messages, tools=tools, tool_choice="auto" ) if response.choices[0].message.tool_calls: # Execute the tool call tool_call = response.choices[0].message.tool_calls[0] args = json.loads(tool_call.function.arguments) # Simulate tool execution tool_result = {"products": [{"name": "Product A", "price": 29.99}]} # Continue conversation with tool result messages.extend([ response.choices[0].message, { "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(tool_result) } ]) final_response = client.chat.completions.create( model="gpt-4o", messages=messages ) return final_response.choices[0].message.content return response.choices[0].message.content

Structured Output with Pydantic

python
from pydantic import BaseModel
from typing import List
from openai import OpenAI

client = OpenAI()

class ProductReview(BaseModel): sentiment: str score: int pros: List[str] cons: List[str] summary: str

completion = client.beta.chat.completions.parse( model="gpt-4o-2024-08-06", messages=[ {"role": "system", "content": "Analyze product reviews."}, {"role": "user", "content": "The battery life is amazing, lasts 3 days! Camera is decent but software is buggy."} ], response_format=ProductReview )

review = completion.choices[0].message.parsed print(f"Sentiment: {review.sentiment}") print(f"Score: {review.score}/10") print(f"Pros: {', '.join(review.pros)}")

Batch API for Cost Savings

For non-real-time workloads, the Batch API offers 50% cost reduction:

python
import json

Create batch file

batch_requests = [ { "custom_id": f"request-{i}", "method": "POST", "url": "/v1/chat/completions", "body": { "model": "gpt-4o", "messages": [{"role": "user", "content": f"Summarize: {text}"}], "max_tokens": 500 } } for i, text in enumerate(texts_to_summarize) ]

Write to JSONL file

with open("batch_input.jsonl", "w") as f: for req in batch_requests: f.write(json.dumps(req) + "\n")

Upload and create batch

batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h" ) print(f"Batch ID: {batch.id}")

Cost Optimization Tips

StrategySavings

Use gpt-4o-mini for simple tasks95% cheaper than gpt-4o Batch API for async processing50% cheaper Prompt caching (>1024 tokens)50% on cached tokens Reduce max_tokensDirect cost reduction Use structured outputsFewer retries

Conclusion

GPT-4o's multimodal API makes it possible to build applications that were science fiction just a few years ago. From real-time invoice processing to voice-enabled assistants, the primitives are all here. Start with the basic chat API, layer in vision as needed, and graduate to function calling for agentic workflows.

相关工具

openaigpt-4opython