Bug Found in OpenAI's Original Scaling Law Paper, Potentially Wasting Global Compute Resources

Recently, DeepMind researcher Sander Dieleman shared a blog post pointing out a critical bug in OpenAI's 2020 Scaling Law paper, which may have led to significant waste of compute resources across the global AI industry over the past few years. The blog, titled "Scaling Laws, Honestly," was written by former OpenAI researcher Diogo Almeida, who states that the original scaling law was flawed due to a bug.

Background: Proposal and Impact of Scaling Laws

In 2020, OpenAI proposed in a paper that, under a fixed compute budget, model parameters should be scaled up preferentially over data, with the optimal parameter count scaling as the 0.73 power of compute. This conclusion directly influenced the design of models like GPT-3 (175 billion parameters) and drove the industry trend of "bigger is better." However, in 2022, DeepMind's Chinchilla paper overturned this conclusion, showing that model and data should be scaled equally, with each parameter requiring about 20 tokens. Chinchilla, with 70 billion parameters and 1.4 trillion tokens, outperformed Gopher (280 billion parameters, 300 billion tokens) under the same compute budget, revealing the issue of "over-parameterized but undertrained" models.

Key Details: Specific Manifestation of the Bug

Diogo Almeida identified three critical issues in the original paper:

Fixed training token count: All models, regardless of size, were trained on approximately 130B tokens, causing small models to overfit and large models to underfit.
Cosine learning rate decay: The learning rate was forced to zero at the end of training, artificially creating a performance saturation effect that misled researchers into believing adding more data was ineffective.
Misleading conclusion: The paper claimed results were "largely insensitive to the learning rate schedule," but this only held under a limited token budget and not in the infinite data regime described by scaling laws.

Additionally, in 2024, Besiroglu et al. discovered a bug in the Chinchilla paper itself: the loss scale in the optimizer was set too high, causing premature termination of fitting. This highlights that scaling laws are empirical fits, not ironclad rules.

Reactions and Data

Diogo Almeida: Admitted he did not catch the bug while at OpenAI because the learning rate schedule appeared "carefully set."
Sander Dieleman: Tweeted that this bug may have caused the industry to waste compute on many "overly large, undertrained" models.
Adam Zachary Wasserman: Further noted that current scaling laws are essentially "English scaling laws," as English is morphologically poor and requires more data, while languages like French and Chinese are more efficient, indicating a language bias in compute allocation.

Impact and Reflection

The exposure of this bug suggests that the global AI industry may have mistakenly prioritized parameter scaling over data quality for years, leading to misallocated compute. Researchers believe that smaller models with more high-quality data could have achieved better performance, saving significant H100 runtime costs. This also prompts reflection on the nature of scaling laws: they are not physical laws but empirical fits based on specific experimental conditions, whose validity is limited by data, language, and training settings.

As of now, OpenAI has not officially responded. However, this discovery may push the industry to re-evaluate the balance between model size and data volume, fostering more efficient AI development paths.

Bug Found in OpenAI's Original Scaling Law Paper, Potentially Wasting Global Compute Resources

Background: Proposal and Impact of Scaling Laws

Key Details: Specific Manifestation of the Bug

Reactions and Data

Impact and Reflection

Documentation

Getting Started

Learn more