OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

Master GPT-4o's multimodal features including image analysis, audio transcription, and the new real-time streaming API for interactive applications

By AI Skill Navigation Editorial TeamPublished May 28, 2026

OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

What Makes GPT-4o Different in 2026

GPT-4o ("omni") represents a fundamental shift in how we interact with AI models. Unlike previous versions that handled text, images, and audio through separate pipelines, GPT-4o processes all modalities natively—resulting in faster responses, lower latency, and more natural multimodal understanding.

In 2026, GPT-4o has become the backbone of thousands of production applications. This tutorial covers the full API surface.

Quick Start

python
from openai import OpenAI
client = OpenAI()  # Uses OPENAI_API_KEY env var
Basic text completion
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's new in GPT-4o?"}
    ]
)print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Vision: Analyzing Images

GPT-4o can analyze images from URLs or base64-encoded data:

python
Analyzing a URL-based image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/chart.png",
                        "detail": "high"  # "low", "high", or "auto"
                    }
                },
                {
                    "type": "text",
                    "text": "Analyze this sales chart and identify the key trends"
                }
            ]
        }
    ],
    max_tokens=1000
)

Practical Vision Use Case: Invoice Processing

python
import base64
from pathlib import Pathdef extract_invoice_data(image_path: str) -> dict:
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()
    
    ext = Path(image_path).suffix.lower()
    media_type = {'.jpg': 'image/jpeg', '.png': 'image/png', '.pdf': 'application/pdf'}.get(ext, 'image/jpeg')
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:{media_type};base64,{image_data}"}
                },
                {
                    "type": "text",
                    "text": """Extract invoice data as JSON:
                    - invoice_number
                    - date
                    - vendor_name
                    - total_amount
                    - line_items (array)
                    Return ONLY valid JSON."""
                }
            ]
        }],
        response_format={"type": "json_object"}
    )
    
    import json
    return json.loads(response.choices[0].message.content)

Audio Transcription with Whisper

python
import openai
Transcribe audio file
with open("meeting_recording.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        response_format="verbose_json",  # Includes timestamps
        timestamp_granularities=["segment"]
    )
print(transcript.text)
Process segments with timestamps
for segment in transcript.segments:
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s]: {segment.text}")

Text-to-Speech

python
from pathlib import Path
response = client.audio.speech.create(
    model="tts-1-hd",  # or "tts-1" for faster/cheaper
    voice="alloy",  # alloy, echo, fable, onyx, nova, shimmer
    input="Welcome to our AI-powered service. How can I help you today?",
    speed=1.0
)Path("welcome.mp3").write_bytes(response.content)

Function Calling (Tool Use)

python
import json
tools = [
    {
        "type": "function",
        "function": {
            "name": "search_products",
            "description": "Search the product catalog",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "category": {"type": "string", "enum": ["electronics", "clothing", "food"]},
                    "max_price": {"type": "number"}
                },
                "required": ["query"]
            }
        }
    }
]def run_conversation(user_message: str):
    messages = [{"role": "user", "content": user_message}]
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        tool_choice="auto"
    )
    
    if response.choices[0].message.tool_calls:
        # Execute the tool call
        tool_call = response.choices[0].message.tool_calls[0]
        args = json.loads(tool_call.function.arguments)
        
        # Simulate tool execution
        tool_result = {"products": [{"name": "Product A", "price": 29.99}]}
        
        # Continue conversation with tool result
        messages.extend([
            response.choices[0].message,
            {
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(tool_result)
            }
        ])
        
        final_response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages
        )
        return final_response.choices[0].message.content
    
    return response.choices[0].message.content

Structured Output with Pydantic

python
from pydantic import BaseModel
from typing import List
from openai import OpenAI
client = OpenAI()
class ProductReview(BaseModel):
    sentiment: str
    score: int
    pros: List[str]
    cons: List[str]
    summary: str
completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "Analyze product reviews."},
        {"role": "user", "content": "The battery life is amazing, lasts 3 days! Camera is decent but software is buggy."}
    ],
    response_format=ProductReview
)review = completion.choices[0].message.parsed
print(f"Sentiment: {review.sentiment}")
print(f"Score: {review.score}/10")
print(f"Pros: {', '.join(review.pros)}")

Batch API for Cost Savings

For non-real-time workloads, the Batch API offers 50% cost reduction:

python
import json
Create batch file
batch_requests = [
    {
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": f"Summarize: {text}"}],
            "max_tokens": 500
        }
    }
    for i, text in enumerate(texts_to_summarize)
]
Write to JSONL file
with open("batch_input.jsonl", "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")
Upload and create batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)
print(f"Batch ID: {batch.id}")

Cost Optimization Tips

StrategySavings

Use gpt-4o-mini for simple tasks95% cheaper than gpt-4o Batch API for async processing50% cheaper Prompt caching (>1024 tokens)50% on cached tokens Reduce max_tokensDirect cost reduction Use structured outputsFewer retries

Conclusion

GPT-4o's multimodal API makes it possible to build applications that were science fiction just a few years ago. From real-time invoice processing to voice-enabled assistants, the primitives are all here. Start with the basic chat API, layer in vision as needed, and graduate to function calling for agentic workflows.

Also available in 中文.

OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

OpenAI GPT-4o API Tutorial 2026: Vision, Audio, and Real-Time Capabilities

What Makes GPT-4o Different in 2026

Quick Start

Basic text completion

Vision: Analyzing Images

Analyzing a URL-based image

Practical Vision Use Case: Invoice Processing

Audio Transcription with Whisper

Transcribe audio file

Process segments with timestamps

Text-to-Speech

Function Calling (Tool Use)

Structured Output with Pydantic

Batch API for Cost Savings

Create batch file

Write to JSONL file

Upload and create batch

Cost Optimization Tips

Conclusion

Documentation

Getting Started

Learn more