← Back to tutorials

Llama 3.3 70B (2025-12): What's New and How to Use It

Complete guide to the latest Llama 3.3 70B capabilities: best open-source performance, multilingual

Llama 3.3 70B: What It Changed and How to Run It

Llama 3.3 70B (Meta, December 2024) mattered for one headline reason: it delivered roughly the instruction-following quality of the giant Llama 3.1 405B in a 70B package — per Meta's own announcement — making "frontier-ish quality on hardware you can actually rent" real. It became the default serious open-weights choice of its generation, and understanding it still frames how to evaluate today's open models.

What it actually changed

  • 405B-class quality at 70B: Meta's stated goal and the community's verdict — post-training improvements (not architecture changes) closed most of the gap to 3.1-405B on instruction tasks at ~1/6 the parameters. Deployment cost difference: a 405B needs a multi-node GPU cluster; a 70B runs on a single 8×80GB node, or quantized on far less.
  • 128K context, multilingual officially across 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai).
  • Same license family (Llama Community License): free for most commercial use, with the famous >700M-MAU clause and naming/attribution requirements — read it once before shipping.
  • Instruction-tuned text model only — no vision variant (that was 3.2's job).
  • Running it: the three realistic paths

    1. Local via Ollama (evaluation, light use):

    bash
    ollama run llama3.3
    

    Quantized (Q4) needs ~40-43GB — a 64GB Mac or dual-24GB-GPU box runs it; 32GB machines should look at smaller models instead (local model comparison).

    2. Hosted open-model APIs (production without GPUs): Together, Fireworks, Groq and peers serve Llama models at per-token prices far below closed flagships — OpenAI-compatible endpoints, so it's a base-URL swap:

    python
    from openai import OpenAI
    client = OpenAI(base_url='https://api.fireworks.ai/inference/v1', api_key='...')
    resp = client.chat.completions.create(
        model='accounts/fireworks/models/llama-v3p3-70b-instruct',  # provider-specific ID
        messages=[{'role': 'user', 'content': 'Classify this support ticket...'}],
    )
    

    (Provider trade-offs: Fireworks production guide.)

    3. Self-hosted vLLM (volume + data control): an 8-GPU node with tensor parallelism, or FP8/AWQ quantization to shrink the footprint — sizing math and throughput levers in our inference optimization and KV cache guides.

    What it was (and wasn't) good for

    Strong: instruction following, summarization, extraction, RAG answer synthesis, multilingual chat, fine-tuning base for domain models (the 70B + LoRA recipe — guide — became the standard enterprise pattern).

    Weak vs closed flagships of its time: hardest math/reasoning chains, agentic tool-use reliability over long horizons, and anything needing vision.

    Where it stands now

    Meta has since moved to the Llama 4 family (MoE architecture, multimodal), and open-weights competition intensified — Qwen, DeepSeek, and Kimi K2 all ship open models that beat Llama 3.3 on various axes. But 3.3-70B remains widely deployed in 2026 because it's a known quantity: stable behavior, huge tooling/fine-tune ecosystem, and runs everywhere. The evaluation playbook it taught — judge open models by deployment-cost-per-quality, license terms, and ecosystem, not leaderboard position alone — is the durable lesson; current side-by-sides live in the model library.

    FAQ

    Llama 3.3 vs 3.1-405B today? Unless you're chasing the last points on hard reasoning, 3.3-70B (or a newer mid-size open model) is the rational choice — the 405B's serving cost rarely justifies the delta.

    Can I use outputs to train other models? The license has specific terms on this (and on attribution) — check the current Llama license text rather than folklore.

    Fine-tune 3.3 or prompt a frontier API? Decision framework in fine-tuning vs RAG: stable narrow task + volume economics → fine-tune the 70B; everything else → start with prompting.


    *Last updated: June 2026. Specs per Meta's release notes; verify license and current model lineup at llama.com.*

    Also available in 中文.