← Back to tutorials

LLM API Cost Control in Practice: 12 Ways to Cut Your AI Bill from $500 to $80

A Complete Guide to Production LLM Cost Optimization, Each Tip Backed by Real Data

LLM API Cost Control in Practice: Cut Your AI Bill from $500 to $80

The Numbers First

Before and after comparison for a SaaS product:

MetricBeforeAfterReduction

Monthly API Cost$520$8384% Average Response Time4.2s1.8s57% Cost per Request$0.026$0.00485%


Category 1: Model Selection (Reduce 50-70%)

Method 1: Use the Right Model for the Right Task

The most common waste: Using GPT-4o for everything.

python
def get_model(task_type: str) -> str:
    routing = {
        "classification": "gpt-4o-mini",     # $0.15/1M, 1/66 of GPT-4o
        "summarization": "gpt-4o-mini",
        "simple_qa": "gpt-4o-mini",
        "code_review": "claude-3-5-haiku-20241022",
        "complex_reasoning": "gpt-4o",
        "math": "o3-mini"
    }
    return routing.get(task_type, "gpt-4o-mini")


Method 2: DeepSeek API Alternative (Chinese Scenarios)

DeepSeek V3 API costs about ¥1/million tokens ($0.14/1M); GPT-4o costs $2.5/1M. For Chinese tasks, use DeepSeek to cut costs by 94%:

python
from openai import OpenAI
client = OpenAI(api_key="deepseek-key", base_url="https://api.deepseek.com")

API compatible with OpenAI SDK, drop-in replacement


Category 2: Prompt Optimization (Reduce 20-40%)

Method 3: Compress System Prompt

python

Verbose version (850 tokens)

"You are a professional customer service assistant. Your task is to help users solve problems.

You should remain friendly, professional, and patient..." (500 characters)

Concise version (120 tokens, same effect)

system = "Chinese customer service assistant. Friendly and professional. Do not disclose internal info. Admit when unsure."

Method 4: Limit Output Length

python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=300,  # Without this, output could exceed 2000 tokens
)

Method 5: Task Merging

python

One request instead of three

result = call_llm(""" Perform three tasks on the following article:
  • Create an attention-grabbing title (within 20 characters)
  • Write a 100-word summary
  • Provide 3-5 tags
  • Output JSON: {"title": "...", "summary": "...", "tags": [...]} """)


    Category 3: Caching Strategies (Reduce 30-60%)

    Method 6: Semantic Caching

    python
    from sentence_transformers import SentenceTransformer
    import numpy as np

    model = SentenceTransformer("all-MiniLM-L6-v2") cache = {} # Use Redis in production

    def cached_llm_call(query: str): query_emb = model.encode(query) for key, (cached_emb, response) in cache.items(): if np.dot(query_emb, cached_emb) > 0.95: return response # Cache hit, zero API cost response = call_llm(query) cache[query] = (query_emb, response) return response

    Method 7: Claude Prompt Caching

    python
    response = anthropic.messages.create(
        model="claude-3-5-sonnet-20241022",
        system=[{"type": "text", "text": long_system_prompt,
                 "cache_control": {"type": "ephemeral"}}],  # Mark as cacheable
        messages=user_messages
    )
    

    Second call with same system prompt costs only 10%


    Category 4: Batch Processing (Save 50%)

    Method 8: OpenAI Batch API

    python
    import json

    requests = [ {"custom_id": f"task-{i}", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": task}]}} for i, task in enumerate(tasks) ]

    with open("batch.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")

    batch_file = client.files.create(file=open("batch.jsonl","rb"), purpose="batch") batch = client.batches.create( input_file_id=batch_file.id, endpoint="/v1/chat/completions", completion_window="24h" # Returns within 24 hours, 50% off )


    Category 5: Monitoring and Governance

    Methods 9-12: Tracking, Quotas, Auditing

    python
    

    Track cost per feature/user

    def track_cost(user_id, feature, tokens, model): costs = {"gpt-4o": 0.005, "gpt-4o-mini": 0.000075} cost = tokens / 1000 * costs.get(model, 0.001) db.insert("api_costs", {"user_id": user_id, "feature": feature, "cost": cost})

    Set daily usage limits

    MAX_TOKENS = 50000 def check_quota(user_id, requested): used = db.query("SELECT SUM(tokens) FROM usage WHERE user_id=? AND date=today()", user_id) return (used + requested) <= MAX_TOKENS

    Monthly Audit Checklist:

  • Which system prompts can be trimmed?
  • Which tasks can be downgraded to cheaper models?
  • Which high-frequency queries benefit from caching?

  • Results Summary

    MethodDifficultyExpected Reduction

    Model RoutingLow40-60% DeepSeek Alternative (Chinese)Low80-94% System Prompt TrimmingLow10-30% Semantic CachingMedium30-60% Batch APIMedium50% Prompt CachingMedium20-50%


    Further Reading

  • OpenAI o3 Practical Guide
  • AI Agent Workflow Automation
  • RAG Knowledge Base Best Practices
  • Also available in 中文.