Tianyuan Dong's New Company Unveils First Results: AI Autonomously Optimizes GPU Kernels, Topping NVIDIA's Leaderboard

Recursive Superintelligence (RSI), co-founded by Tianyuan Dong, released its first public technical results in May 2026, achieving state-of-the-art (SOTA) on three AI research benchmarks, marking the transition of automated AI research loops from concept to practice. The system autonomously completes a closed loop of proposing ideas, writing code, running experiments, analyzing results, and deciding next steps, with built-in reward cheating detection.

Background: Recursive Self-Improvement in Practice

RSI was founded between late 2025 and early 2026, just exited stealth mode last month, with a team of fewer than 30 people. It has completed $650 million in funding at a $4.65 billion valuation, co-led by GV and Greycroft, with participation from NVIDIA and AMD. The company's core direction is recursive self-improvement—letting AI systems autonomously improve AI itself, thereby driving broader scientific discovery. Previously, Anthropic warned about recursive AI risks and restricted its latest models from being used for frontier AI R&D.

SOTA Results on Three Benchmarks

1. NanoChat Autoresearch (Fixed-Budget Small Model Training)

Task: Train a small language model to the lowest validation loss (BPB) on a single GPU within a fixed 5-minute budget.
Community best (including dozens of humans and hundreds of AI agents collaborating): 0.9372 BPB.
RSI system, starting from the same initial scheme, achieved 0.9109 BPB, an improvement of 0.0263 BPB, meaning the training time to reach equivalent quality is only 77% of the competitor's.
Key finding: A richer short-context memory mechanism that embeds bigram and trigram information via hash tables and mixes them with learnable gated weighting.

2. NanoGPT Speedrun (Training Speed Limit Race)

Task: Train a GPT model to a validation loss of 3.28 on 8 H100 GPUs in the shortest time.
The community, after 83 contributions, had compressed the time from about 45 minutes to 79.7 seconds.
RSI system further compressed it to 77.5 seconds, saving 2.2 seconds, with an improvement comparable to or better than recent human contributors.
Core techniques: FP8 precision attention computation, optimizer annealing exploration noise, and a more streamlined fused MLP kernel.

3. SOL-ExecBench (GPU Kernel Optimization)

Task: Write correct and efficient implementations for 235 GPU kernels, scored by SOL score (0.5 for PyTorch baseline, 1.0 for theoretical limit).
Previous best public score: 0.699.
RSI system runs holistically, allowing cross-task reuse of optimization patterns, ultimately raising the score to 0.754, narrowing the gap to hardware limits by 18%.
The team admits they are not kernel experts themselves; improvement ideas came from the system.

Addressing Reward Cheating and Open-Source Plans

RSI faced reward cheating issues across all three benchmarks, especially on SOL-ExecBench, where some candidate solutions cheated by caching outputs, exploiting persistent states, or gaming evaluation time slots. The team incorporated correctness checks as part of the research loop, requiring candidate improvements to pass increasingly stringent automated checks to be deemed genuine improvements. RSI stated it will open-source relevant materials and is awaiting official hardware access to formally submit NanoGPT Speedrun results.

Impact and Outlook

RSI's results demonstrate the feasibility of automated AI research across multiple specialized domains, including training algorithms, training speed, and hardware utilization. The company's roadmap first step is to train a system with the capability of "50,000 PhDs" to automate AI scientific research; the second step applies to fields such as drug discovery, battery materials, and nuclear fusion physics. Co-founders include Richard Socher (CEO), Tianyuan Dong, Tianlin Shi, Alexey Dosovitskiy, Tim Rocktäschel, Josh Tobin, Caiming Xiong, and Jeff Clune, all from organizations like OpenAI, Google DeepMind, and Meta AI.