← Back to tutorials

Together AI Platform: Production Guide

Running open-source models with Together AI

Together AI Platform: Production Guide

Together AI's position in the open-model API market is breadth + full lifecycle: one of the largest hosted catalogs of open-weights models (chat, code, vision, image, embeddings, rerankers), plus fine-tuning and dedicated GPU clusters — so teams can prototype on serverless per-token pricing and graduate to dedicated capacity without changing vendors. This guide covers the integration, the platform features that matter in production, and the honest comparison with its neighbors.

Integration: OpenAI-compatible, one URL

python
from openai import OpenAI

client = OpenAI( base_url='https://api.together.xyz/v1', api_key=os.environ['TOGETHER_API_KEY'], )

resp = client.chat.completions.create( model='meta-llama/Llama-3.3-70B-Instruct-Turbo', messages=[{'role': 'user', 'content': 'Summarize this incident report...'}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or '', end='', flush=True)

Notes from production use:

  • "Turbo"/"Lite" variants are quantized servings of the same weights at different price/quality points — benchmark the variant you'll actually buy on your eval set (how); FP8 deltas are usually negligible but task-dependent.
  • JSON mode and function calling are supported on the mainstream chat models; tool-calling reliability varies by model family — verify per model, not per platform.
  • Embeddings + rerankers on the same account is convenient for RAG stacks: one vendor for the retrieval models and the generator (retrieval pipeline).
  • The lifecycle features

  • Fine-tuning service: upload JSONL, get a LoRA or full fine-tune hosted at near-base prices — the managed version of the LoRA recipe. The practical win is *hosting*: your adapter serves on their serverless infra, so a fine-tune doesn't commit you to GPU ops.
  • Dedicated endpoints / GPU clusters: reserved capacity for steady or latency-strict workloads; the serverless→dedicated switch is the same utilization arithmetic as everywhere (the math).
  • Batch inference at a discount for offline workloads — same pattern as the closed-provider batch APIs.
  • Where Together fits vs neighbors

    PlatformEdge

    TogetherCatalog breadth + fine-tune-and-host lifecycle FireworksLatency + function-calling polish (guide) GroqRaw speed on a narrow catalog HF EndpointsThe Hub long tail + private models Self-host vLLMControl/cost at sustained scale (guide)

    They compete on price per model and leapfrog monthly — run a two-provider bake-off on *your* top models, and keep the loser configured as a fallback target; open-model APIs being OpenAI-compatible makes multi-homing nearly free.

    Production checklist

  • [ ] Pin exact model IDs (provider IDs differ from HF names) in config, not code
  • [ ] Eval the quantized variant you'll buy, not the reference weights
  • [ ] Rate-limit tier confirmed against launch traffic; backpressure/queue in front
  • [ ] Cost per feature instrumented from day one (per-token prices × real prompts surprise people)
  • ] DPA/retention reviewed ([privacy diligence)
  • FAQ

    Why pay anyone instead of self-hosting open models? Below sustained high utilization, per-token beats owning GPUs + ops; above it, self-hosting wins. Most teams cross that line later than they think.

    Model deprecations? Open-model catalogs rotate as new families ship — pin versions, subscribe to deprecation notices, and keep your eval set ready to qualify replacements quickly. (Model-landscape tracking: model library.)

    Is quality identical to the reference model? Serving stack and quantization introduce small deltas — usually noise, occasionally not. Trust your eval, not the model name.


    *Last updated: June 2026. Catalog and pricing move monthly — verify at together.ai.*

    Also available in 中文.

    Together AI Platform: Production Guide | AI Skill Navigation | AI Skill Navigation