AI Text-to-Speech 2026: OpenAI TTS, ElevenLabs, and Voice Cloning

Build voice AI applications with natural-sounding TTS and custom voice cloning

返回教程列表
进阶30 分钟

AI Text-to-Speech 2026: OpenAI TTS, ElevenLabs, and Voice Cloning

Build voice AI applications with natural-sounding TTS and custom voice cloning

Complete TTS API comparison and tutorial. OpenAI TTS for production, ElevenLabs for voice cloning, streaming TTS for chatbots, and building a full voice AI assistant.

ttstext to speechelevenlabsopenaivoiceaudio

Text-to-Speech AI 2026: OpenAI TTS, ElevenLabs, and Voice Cloning

Modern AI TTS produces natural-sounding voices indistinguishable from human speech.

API Comparison 2026

ServiceQualityLanguagesVoice CloningCost

OpenAI TTSHigh57No$15/1M chars ElevenLabsHighest29Yes$5/10K chars CartesiaHigh15Yes$5/1M chars Kokoro (local)GoodEN/JPNoFree

OpenAI TTS API

python
from openai import OpenAI
from pathlib import Path

client = OpenAI()

Generate speech

speech = client.audio.speech.create( model='tts-1-hd', # tts-1 (fast) or tts-1-hd (higher quality) voice='alloy', # alloy, echo, fable, onyx, nova, shimmer input='Welcome to AI Skill Navigator. This tutorial covers text-to-speech APIs.' )

speech.stream_to_file('output.mp3')

Streaming to speakers in real-time

import pyaudio

p = pyaudio.PyAudio() stream = p.open(format=8, channels=1, rate=24000, output=True)

with client.audio.speech.with_streaming_response.create( model='tts-1', voice='nova', input='This text streams to speakers in real-time as it generates.' ) as response: for chunk in response.iter_bytes(1024): stream.write(chunk)

stream.close() p.terminate()

ElevenLabs API (Best Quality)

python
from elevenlabs import ElevenLabs, save

client = ElevenLabs(api_key='your-api-key')

Text to speech

audio = client.text_to_speech.convert( voice_id='21m00Tcm4TlvDq8ikWAM', # Voice ID (Rachel) text='Hello! This is an AI-generated voice with ElevenLabs.', model_id='eleven_multilingual_v2', voice_settings={ 'stability': 0.5, 'similarity_boost': 0.8, 'style': 0.3, 'use_speaker_boost': True } ) save(audio, 'elevenlabs_output.mp3')

Clone a voice

voice = client.voices.clone( name='My Custom Voice', description='Professional narration voice', files=['sample1.mp3', 'sample2.mp3', 'sample3.mp3'], # 1-30 samples ) print(f'New voice ID: {voice.voice_id}')

Use cloned voice

audio = client.text_to_speech.convert( voice_id=voice.voice_id, text='This is generated using my cloned voice!' )

Streaming TTS for Chatbots

python
import asyncio

async def speak_stream(text: str): """Stream TTS audio as it generates for low-latency chatbot responses.""" async with client.audio.speech.with_streaming_response.create( model='tts-1', voice='alloy', input=text ) as response: async for chunk in response.aiter_bytes(1024): yield chunk # Yield chunks to your audio player

In a FastAPI endpoint:

from fastapi import FastAPI from fastapi.responses import StreamingResponse

app = FastAPI()

@app.post('/speak') async def speak_endpoint(text: str): async def generate(): async for chunk in speak_stream(text): yield chunk return StreamingResponse(generate(), media_type='audio/mpeg')

Building a Voice AI Assistant

python
class VoiceAssistant:
    def __init__(self):
        self.client = OpenAI()
        self.history = []
    
    def listen(self) -> str:
        # Record audio using pyaudio
        # ...recording code...
        with open('/tmp/recording.wav', 'rb') as f:
            return self.client.audio.transcriptions.create(model='whisper-1', file=f).text
    
    def think(self, user_input: str) -> str:
        self.history.append({'role': 'user', 'content': user_input})
        r = self.client.chat.completions.create(
            model='gpt-4o',
            messages=[{'role': 'system', 'content': 'You are a helpful voice assistant.'}] + self.history
        )
        response = r.choices[0].message.content
        self.history.append({'role': 'assistant', 'content': response})
        return response
    
    def speak(self, text: str):
        speech = self.client.audio.speech.create(model='tts-1', voice='nova', input=text)
        speech.stream_to_file('/tmp/response.mp3')
        import subprocess
        subprocess.run(['afplay', '/tmp/response.mp3'])  # macOS
    
    def run(self):
        while True:
            print('Listening...')
            text = self.listen()
            print(f'You: {text}')
            response = self.think(text)
            print(f'Assistant: {response}')
            self.speak(response)

Conclusion

AI TTS in 2026 enables natural-sounding voice for applications. OpenAI TTS for production APIs, ElevenLabs for voice cloning and highest quality, Kokoro locally for cost-free inference.

相关工具

openaielevenlabspython