Lilian Weng's Blog: Scaling Laws Are Not Infallible, Industry Consensus Has Methodological Flaws

Former OpenAI VP and Peking University alumna Lilian Weng, after a three-year hiatus, published a blog post titled "Scaling Laws, Carefully" on June 24, 2026, systematically reviewing the origins, controversies, and limitations of scaling laws. The article points out that the disagreement between OpenAI and DeepMind over optimal compute allocation stems from differences in parameter counting and insufficient experimental scale, and that DeepMind's Chinchilla formula itself has bugs such as averaging the loss function leading to premature optimizer stopping and key parameters rounded to only two decimal places. Moreover, extrapolating scaling laws fitted on small models to trillion-parameter levels can sharply amplify errors, and the implicit assumption of "infinite data supply" faces the real challenge of high-quality text depletion.

Core Controversy: Opposite Conclusions from OpenAI and DeepMind

2020 OpenAI Kaplan team: Optimal model size N_opt ∝ C^0.73, meaning a 10x increase in compute allocates 5.5x to model size and 1.8x to data. This conclusion guided GPT-3 training (175B parameters, 300B tokens).
2022 DeepMind Chinchilla team: N_opt ∝ C^0.50, model and data should scale equally, with an optimal token-to-parameter ratio of about 20:1. Chinchilla (70B parameters, 1.4T tokens) comprehensively outperformed Gopher (280B parameters, 300B tokens) under the same compute budget, flipping the industry consensus.

Root of Disagreement: Bookkeeping Issues and Experimental Scale

A 2024 TMLR paper reconciled the above disagreements:

Parameter counting differences: Kaplan excluded embedding layer parameters, while Chinchilla included them. In small models, embeddings account for a large proportion, causing bias in fitted exponents. The correction formula is N = N_E + ω·N_E^(1/3).
Insufficient experimental scale: Kaplan's largest model had only 1.5B parameters, while Chinchilla scanned up to 16B+. At small scales, the exponent was close to 0.73, but as scale increased, it converged to 0.50.

Methodological Flaws in Chinchilla Itself

When the 2024 Epoch AI team reproduced Chinchilla's Method 3 (directly fitting the loss function), they discovered two bugs:

Loss function averaged instead of summed: Averaging Huber Loss yields very small values, causing the L-BFGS-B optimizer to mistakenly detect convergence and stop early, outputting non-optimal parameters.
Key parameters rounded to two decimal places: Rounding errors amplified exponentially, and the confidence interval was so narrow that it would require 600,000 experiments to achieve, but only fewer than 500 were actually run.

Extrapolation Risks and Data Bottleneck

Unreliable extrapolation: Scaling laws fitted on small models, when extrapolated to trillion parameters, can cause tiny parameter differences (e.g., rounding) to drastically deviate conclusions. The blog includes an interactive simulator showing the sensitivity of fitted results to parameters.
Data finiteness: The formula assumes infinite data supply, but high-quality text is nearing depletion. The industry is shifting toward reinforcement learning, test-time compute, and synthetic data.

Industry Impact

Lilian Weng's blog sparked widespread discussion, with netizens queuing to welcome her return. She lamented that "many people will let AI summarize rather than actually read" and plans to set up a model to automatically update the blog. The article is seen as a sober examination of the scaling law faith, reminding the industry to be wary of methodological flaws and local biases when betting billions of dollars.

Lilian Weng's Blog: Scaling Laws Are Not Infallible, Industry Consensus Has Methodological Flaws

Core Controversy: Opposite Conclusions from OpenAI and DeepMind

Root of Disagreement: Bookkeeping Issues and Experimental Scale

Methodological Flaws in Chinchilla Itself

Extrapolation Risks and Data Bottleneck

Industry Impact

Documentation

Getting Started

Learn more