13 minute read

TL;DR

Karpathy open-sourced autoresearch, a framework where an AI coding agent autonomously runs ML experiments while you sleep. It modifies code, trains for five minutes, checks if the model improved, keeps or discards the change, and repeats forever. The experiments are the obvious story. The less obvious one: the researcher now writes program.md instead of Python, shifting from “run experiments” to “program an autonomous research organization.” Markdown as the language for directing AI agents is the pattern to watch in 2026.


An automated chemistry laboratory at night with no humans present

1. “brb sauna”

On March 6, 2026, Karpathy posted a chart on X. Every dot was a completed LLM training run. His comment:

“ah yes, this is what post-agi feels like :) i didn’t touch anything. brb sauna.”

The next day, he open-sourced the system behind that chart. Within 48 hours: 5,200 GitHub stars and 651 forks.

The project is called autoresearch, and what it does is disarmingly simple. It gives an AI coding agent a small GPT training setup and tells it to make the model better. The agent modifies code, trains for five minutes, checks whether the result improved, keeps or discards the change, and repeats. Forever. No human in the loop.

Around 12 experiments per hour. Around 100 overnight. You go to sleep, you wake up to a results log and a git history full of attempted improvements.

The experiments are the obvious story. The less obvious one is what the human writes instead of Python.

2. Three files, one constraint

The architecture is aggressively minimal. Three files:

prepare.py downloads training data from HuggingFace (a subset of Karpathy’s climbmix-400b dataset), trains a BPE tokenizer, and sets up evaluation. Read-only. The agent never touches it.

train.py is a ~630-line file containing the full GPT model, a custom MuonAdamW optimizer, and the training loop. This is the only file the agent edits.

program.md tells the agent what to do, how to evaluate results, and when to keep or discard changes. This is the only file the human edits.

The constraint is the design. One GPU. One editable file. One metric: val_bpb (validation bits per byte; lower is better, vocabulary-size-independent so architecture changes get a fair comparison). Five minutes per experiment, fixed.

No multi-GPU setups. No complex infrastructure. No package installations. The entire research surface area is a single Python file and a five-minute clock.

flowchart LR
    subgraph Human["Human Domain"]
        R[Researcher] -->|writes| PM["program.md"]
    end

    subgraph Agent["Agent Domain (autonomous, never stops)"]
        AI[AI Agent] -->|modifies| T["train.py"]
        T -->|"uv run train.py<br/>(5 min budget)"| GPU[GPU Training]
        GPU -->|val_bpb| D{Improved?}
        D -->|Yes| K[Keep commit<br/>New baseline]
        D -->|No / Crash| X[git reset<br/>Log & discard]
        K --> AI
        X --> AI
    end

    PM -->|configures| AI
    AI -->|logs every run| TSV["results.tsv"]

3. Inside the loop

Here’s the step-by-step of what the agent does, on repeat, until you kill the process:

  1. Creates a git branch (autoresearch/<tag>)
  2. Runs train.py to establish a baseline, records val_bpb in results.tsv
  3. Picks an idea – bump the learning rate, swap an activation function, adjust model width, restructure attention, anything
  4. Commits the change
  5. Runs training: uv run train.py > run.log 2>&1 (five-minute budget)
  6. Extracts results: grep "^val_bpb:\|^peak_vram_mb:" run.log
  7. If improved: keeps the commit, advances the branch, updates baseline
  8. If worse or crashed: git reset to last good state, logs as discarded/crashed
  9. Goes to step 3. Never stops.

The results.tsv file accumulates a record of every experiment:

commit     val_bpb    peak_vram_mb  status    description
baseline   0.997900   45060.2       keep      baseline
b2c3d4e    0.993200   44800.1       keep      increase LR to 0.04
c3d4e5f    1.005000   45100.3       discard   switch to GeLU activation
d4e5f6g    0.000000   0.0           crash     double model width (OOM)

And then there’s the instruction that captures the whole philosophy. From program.md:

“Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human might be asleep… You are autonomous. If you run out of ideas, think harder.”

4. program.md is the actual innovation

Here’s what I keep coming back to. The ML results are interesting – the baseline val_bpb of 0.9979 dropped to 0.9932, a 0.47% improvement. But that’s not the point. The point is what happened to the researcher’s job.

In a traditional ML research workflow, you write Python. You design experiments, implement them, run training, analyze results, iterate. The thinking and the coding are tangled together. You are both the architect and the bricklayer.

With autoresearch, you write markdown. You describe what the agent should try and how it should weigh the results. You program the agent’s taste.

This line from program.md stopped me:

“A 0.001 val_bpb improvement that adds 20 lines of hacky code? Probably not worth it. A 0.001 val_bpb improvement from deleting code? Definitely keep. An improvement of ~0 but much simpler code? Keep.”

That’s not a hyperparameter configuration. That’s an aesthetic judgment encoded in prose. Karpathy is teaching the agent to value elegance, not just performance. The human’s role shifts from “run experiments” to “define what good research looks like.”

His own framing says it plainly: “You’re not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org.”

Programming the research org. Not the experiments.

Why constraint beats ambition here

Sakana AI’s AI Scientist tried the maximalist approach: idea generation, literature search, experiments, manuscript writing, peer review, the full pipeline. Ambitious. Also, according to their evaluation (published February 2025 on arXiv), 42% of experiments had coding errors. When you automate everything, you automate the failure modes too.

Autoresearch goes the opposite direction. The scope is so narrow that the agent can actually succeed. One file, one metric, five minutes. The constraint isn’t a limitation. It’s the reason the thing works.

On Hacker News (thread #47291123), user abeppu asked Karpathy why not just use Bayesian optimization. He responded directly: autoresearch doesn’t tune hyperparameters. It modifies code arbitrarily. The agent can restructure the model architecture, rewrite the optimizer, rework the training loop. It uses sequential binary search rather than wasteful parallel sweeps. And it needs zero human intervention.

That’s a fundamentally different loop than AutoML.

5. Markdown is becoming a programming language

Autoresearch doesn’t exist in a vacuum. Markdown has been quietly becoming a language for directing AI agents all year, and autoresearch is the most provocative example yet.

AGENTS.md, an open standard backed by the Linux Foundation’s Agentic AI Foundation, has been adopted by over 40,000 open-source repositories as of March 2026. Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, and Aider all read it. You write a markdown file describing how an agent should behave in your codebase, and every major coding agent respects it.

GitHub Spec Kit takes this further: write a markdown spec, have the agent “compile” it to working code. The spec is the source of truth, not the code. AWS Kiro and the Tessl Framework shipped similar spec-driven tooling earlier this year.

Visual Studio Magazine’s February 2026 headline: “In Agentic AI, It’s All About the Markdown.”

Karpathy’s program.md fits this pattern but pushes it somewhere new. AGENTS.md tells an agent how to code. program.md tells an agent how to research. The agent isn’t following instructions to build something. It’s following instructions to discover something.

Karpathy coined “vibe coding” in February 2025: give in to the vibes, let AI write your code. One year later: vibe researching. The academic community is already taking it seriously. VibeX 2026, a workshop at the EASE conference, is dedicated to studying both as formal methodologies.

One year from “let the AI write code” to “let the AI run your research lab.”

6. The skepticism that matters

I’d be dishonest if I didn’t flag the legitimate concerns. Some of these bother me too.

On Hacker News, users categoricalrift and aix1 noticed the agent changed the random seed from 42 to 137 in one of the “improvement” commits. With a five-minute training budget and a ~50M parameter model, seed variance could account for the observed gains. Is that a real optimization or just noise?

User lostmsu pointed out that the progress chart in the README uses a non-zero y-axis baseline. Going from 0.9979 to 0.9932 looks dramatic when you crop the axis. The numbers are honest (they’re right there in results.tsv), but the visual flatters the results.

At ~50M parameters, you’re not going to stumble onto emergent capabilities or architectural breakthroughs. User elikoga raised this, and it’s fair. You might find better hyperparameters for a toy model. Whether those findings transfer to production-scale training is an open question.

User gmerc called it “brute force discovery” and invoked Goodhart’s Law: when you optimize a single metric autonomously, you risk the agent gaming the metric rather than genuinely improving the model.

These are real concerns. But I think they mistake the product for the prototype. The val_bpb number isn’t what matters here. The pattern is what matters. Whether this specific run found meaningful improvements is less important than whether autonomous agent loops can do meaningful research at all. And on that question, autoresearch is a proof of concept worth paying attention to, because of the methodology, not the results.

7. What this means if you build things

You don’t need an H100 to take something from this.

The experiment loop transfers cleanly beyond ML. Try, measure, keep or discard, repeat. This applies to performance optimization, configuration tuning, prompt engineering, API integration testing. Anywhere you have a measurable metric and a parameterized system, you can run a version of this loop.

Most people’s instinct is to give agents maximum freedom. Autoresearch shows the opposite works better. Narrow the scope, fix the constraints, and the agent can actually succeed within them. The tighter the box, the more useful the autonomy.

And here’s the part that’s easy to miss: taste can be programmed. program.md doesn’t just specify what to optimize. It specifies how to weigh trade-offs: when a marginal improvement isn’t worth the complexity, when simpler code beats better numbers. You can encode engineering judgment in markdown and have an agent execute on it.

If you want to try autoresearch yourself: you need a single NVIDIA GPU with ~45GB VRAM. An H100 or A100 80GB. Cloud options like Lambda Labs, RunPod, or vast.ai run $2-3/hour for an H100. There’s also a macOS fork by miolini that replaces FlashAttention-3 with PyTorch SDPA for Apple Silicon, though you’ll hit memory limits fast.

User miki123211 on Hacker News predicted “Agent Reliability Engineers” becoming a standard role – engineers who don’t write code or run experiments, but who manage, monitor, and tune the agent systems that do. After spending time with autoresearch, that prediction feels less like speculation and more like a job listing waiting to be written.


FAQ

Can autoresearch discover genuinely novel ML architectures?

At current scale, probably not. The model is ~50M parameters with a five-minute training budget – you’ll find incremental optimizations, not architectural breakthroughs. But the framework scales. Change the time budget, swap in a bigger model, and the same loop applies. I suspect the interesting results come when someone points this at a 1B+ parameter model on a cluster of H100s with an hour per experiment.

How is this different from traditional hyperparameter optimization?

Karpathy addressed this directly on Hacker News: autoresearch modifies code, not just hyperparameters. The agent can restructure model architecture, rewrite the optimizer, change the training loop – any change to train.py is fair game. Bayesian optimization searches a predefined parameter space. Autoresearch searches the space of all possible code changes. The search space is, in a real sense, infinite.

Do I need an NVIDIA GPU to run this?

For the original repo, yes. It requires CUDA, FlashAttention-3, and ~45GB VRAM. A macOS fork exists (miolini/autoresearch-macos) using PyTorch SDPA, which runs on Apple Silicon. Expect slower experiments and tighter memory constraints on consumer hardware.

Is “vibe researching” a real methodology or a meme?

Both, honestly. Karpathy uses it with a wink (“brb sauna”). But the underlying pattern – autonomous agents running experiment loops directed by human-written markdown – is being studied at VibeX 2026, a workshop at the EASE conference. The meme name will probably fade. Whether the methodology survives depends on what happens when someone scales it past toy models.


Key takeaways

  • Autoresearch runs ~12 ML experiments per hour autonomously. An AI coding agent modifies a single training file, evaluates against val_bpb, and keeps or discards each change.
  • The actual innovation is program.md – a markdown file that programs the agent’s research behavior, including aesthetic taste and trade-off criteria.
  • The researcher’s job shifts from “run experiments” to “manage an autonomous research organization.”
  • Ruthless constraint makes it work. One file, one metric, five minutes. Compare Sakana’s AI Scientist (broad scope, 42% coding errors) with autoresearch (narrow scope, functional loop).
  • Legitimate concerns exist around seed sensitivity, model scale, and metric gaming. They’re worth tracking but don’t invalidate the pattern.
  • The experiment loop (try, measure, keep/discard, repeat) transfers to any domain with measurable outcomes.
  • 2026 is the year markdown became a programming language for AI agents. AGENTS.md (40K+ repos), GitHub Spec Kit, AWS Kiro, and program.md all point the same direction.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch