LLM API Cost Control in Practice: 12 Ways to Cut Your AI Bill from $500 to $80
A Complete Guide to Production LLM Cost Optimization, Each Tip Backed by Real Data
LLM API Cost Control in Practice: Cut Your AI Bill from $500 to $80
The Numbers First
Before and after comparison for a SaaS product:
Category 1: Model Selection (Reduce 50-70%)
Method 1: Use the Right Model for the Right Task
The most common waste: Using GPT-4o for everything.
python
def get_model(task_type: str) -> str:
routing = {
"classification": "gpt-4o-mini", # $0.15/1M, 1/66 of GPT-4o
"summarization": "gpt-4o-mini",
"simple_qa": "gpt-4o-mini",
"code_review": "claude-3-5-haiku-20241022",
"complex_reasoning": "gpt-4o",
"math": "o3-mini"
}
return routing.get(task_type, "gpt-4o-mini")
Method 2: DeepSeek API Alternative (Chinese Scenarios)
DeepSeek V3 API costs about ¥1/million tokens ($0.14/1M); GPT-4o costs $2.5/1M. For Chinese tasks, use DeepSeek to cut costs by 94%:
python
from openai import OpenAI
client = OpenAI(api_key="deepseek-key", base_url="https://api.deepseek.com")
API compatible with OpenAI SDK, drop-in replacement
Category 2: Prompt Optimization (Reduce 20-40%)
Method 3: Compress System Prompt
python
Verbose version (850 tokens)
"You are a professional customer service assistant. Your task is to help users solve problems.
You should remain friendly, professional, and patient..." (500 characters)
Concise version (120 tokens, same effect)
system = "Chinese customer service assistant. Friendly and professional. Do not disclose internal info. Admit when unsure."
Method 4: Limit Output Length
python
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=300, # Without this, output could exceed 2000 tokens
)
Method 5: Task Merging
python
One request instead of three
result = call_llm("""
Perform three tasks on the following article:
Create an attention-grabbing title (within 20 characters)
Write a 100-word summary
Provide 3-5 tags
Output JSON: {"title": "...", "summary": "...", "tags": [...]}
""")
Category 3: Caching Strategies (Reduce 30-60%)
Method 6: Semantic Caching
python
from sentence_transformers import SentenceTransformer
import numpy as npmodel = SentenceTransformer("all-MiniLM-L6-v2")
cache = {} # Use Redis in production
def cached_llm_call(query: str):
query_emb = model.encode(query)
for key, (cached_emb, response) in cache.items():
if np.dot(query_emb, cached_emb) > 0.95:
return response # Cache hit, zero API cost
response = call_llm(query)
cache[query] = (query_emb, response)
return response
Method 7: Claude Prompt Caching
python
response = anthropic.messages.create(
model="claude-3-5-sonnet-20241022",
system=[{"type": "text", "text": long_system_prompt,
"cache_control": {"type": "ephemeral"}}], # Mark as cacheable
messages=user_messages
)
Second call with same system prompt costs only 10%
Category 4: Batch Processing (Save 50%)
Method 8: OpenAI Batch API
python
import jsonrequests = [
{"custom_id": f"task-{i}", "method": "POST",
"url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": task}]}}
for i, task in enumerate(tasks)
]
with open("batch.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
batch_file = client.files.create(file=open("batch.jsonl","rb"), purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h" # Returns within 24 hours, 50% off
)
Category 5: Monitoring and Governance
Methods 9-12: Tracking, Quotas, Auditing
python
Track cost per feature/user
def track_cost(user_id, feature, tokens, model):
costs = {"gpt-4o": 0.005, "gpt-4o-mini": 0.000075}
cost = tokens / 1000 * costs.get(model, 0.001)
db.insert("api_costs", {"user_id": user_id, "feature": feature, "cost": cost})Set daily usage limits
MAX_TOKENS = 50000
def check_quota(user_id, requested):
used = db.query("SELECT SUM(tokens) FROM usage WHERE user_id=? AND date=today()", user_id)
return (used + requested) <= MAX_TOKENS
Monthly Audit Checklist:
Results Summary
Further Reading
Also available in 中文.