The dark factory era: what engineering looks like when 95% of code is not typed
In manufacturing, a dark factory is a facility that runs with the lights off. No workers on the floor. Machines build, test, and package without anyone watching.
Simon Willison thinks software engineering just entered the same phase.
TL;DR
We crossed a threshold sometime around November 2025. AI coding agents went from “mostly works” to “actually works,” and the consequences are compounding fast. Willison now reports that 95% of the code he produces, he did not type himself. Karpathy’s autoresearch runs hundreds of ML experiments with zero human intervention. StrongDM built an entire engineering team around the principle that humans must not write or review code. Gartner predicts 40% of enterprise applications will include task-specific AI agents by end of 2026, up from less than 5% in 2025. The dark factory is not a thought experiment. It is a deployment pattern.

The more interesting question: what does the job look like on the other side?
The numbers that forced the conversation
The data points arrived in clusters through early 2026:
- Simon Willison on Lenny’s Podcast: “Honestly, six months ago, I thought that was crazy, and today, probably 95% of the code that I produce, I didn’t type it myself.” He writes most of his code from his phone now.
- GitHub reported that over 51% of all code committed to its platform in early 2026 was AI-generated or AI-assisted.
- Karpathy’s autoresearch runs an autonomous loop: modify code, train for five minutes, evaluate, keep or discard, repeat. Roughly 12 experiments per hour. Hundreds overnight. Zero human keystrokes.
- Karpathy’s LLM Wiki – 100 articles, 400,000 words, autonomously maintained through a three-round research loop that identifies gaps, fills them, and cross-references everything with provenance tracking.
- StrongDM published a manifesto in February 2026: “Code must not be written by humans. Code must not be reviewed by humans.” They built behavioral clones of third-party services and run thousands of scenario tests hourly.
- Gartner predicted 40% of enterprise apps will feature task-specific AI agents by end of 2026.
- Stack Overflow’s 2025 survey: 92% of developers use AI tools somewhere in their workflow. 51% use them daily.
You can argue with any individual number. The direction is harder to dismiss.
What the dark factory workflow actually looks like
The cultural commentary writes itself. “AI is replacing developers!” or “AI will never replace developers!” – both miss the structural change happening underneath the headlines.
For engineers operating at Willison’s 95% threshold, the workflow looks roughly like this:
flowchart LR
A[Human: Requirements\nand Constraints] --> B[Human: Specification\nDocument]
B --> C[Agent: Implementation]
C --> D[Agent: Test Generation]
D --> E[Automated: CI/CD\nValidation]
E --> F{Human: Review\nand Judgment}
F -->|Accepted| G[Ship]
F -->|Rejected| H[Human: Refine\nSpecification]
H --> C
The human touches three points: the beginning (what to build and why), the specification (how to constrain the agent), and the review gate (does this actually do what I meant). Everything between specification and review is agent territory.
This is not autocomplete. It is delegation. And the distinction matters because delegation requires a different skill set than typing.
Willison’s five levels
Willison described a progression from the bottom of AI-assisted coding to the top:
- Spicy autocomplete – tab-completion on steroids. The model finishes your line. You stay in control.
- Chat-driven coding – you describe what you want, the model writes a block, you paste it in. Copy-paste engineering.
- Agent-driven development – the agent reads your codebase, makes changes across files, runs tests, iterates. You review diffs.
- Agentic pipelines – multiple agents coordinated through specifications, each handling a piece: one writes code, one writes tests, one reviews, one deploys.
- Dark factory – no human writes or reviews code. The machines handle their own QA. The lights are off.
Most production teams in April 2026 operate somewhere between levels 2 and 3. A small number of teams – StrongDM being the most public – are pushing into levels 4 and 5.
Willison’s own prediction: 50% of engineers will be writing 95% AI code by end of 2026.
What humans still do
The dark factory framing makes it sound like humans vanish from the process. They do not. But they move to different parts of it.
Specification writing. This is the skill shift nobody talks about. When an agent implements your specification literally, the quality of that specification becomes the quality of the code. Ambiguous requirements that a human developer would clarify through conversation become bugs when an agent interprets them silently. Martin Fowler’s team at ThoughtWorks calls this “harness engineering” – designing the system of guides and sensors that surround the agent. The specification is the primary guide.
Architectural judgment. Agents produce locally correct code. They solve the immediate problem. But they have no sense of the system. They do not know that the team decided to migrate from REST to gRPC last quarter, or that the caching layer was designed to avoid database pressure during peak hours. Unarticulated architectural intent – decisions that live in institutional memory rather than documentation – is where agent-generated code drifts hardest. The human role is maintaining coherence across a codebase that no single person typed.
Failure-mode reasoning. CodeRabbit’s 2026 report found that AI-generated code creates 1.7x more issues than human-written code. Not because agents are stupid, but because they optimize for the success path. They handle the obvious cases well. The edge cases, the failure modes, the “what happens when the database is down and the retry queue is full and the user hits refresh” – that reasoning remains a human responsibility.
Taste. Karpathy’s autoresearch works because someone wrote program.md – a markdown file that defines what the agent should try, what counts as an improvement, and what to avoid. That file encodes taste. It says: try architectural changes, not just hyperparameter sweeps. Prefer simplicity. Do not increase parameter count without a proportional quality gain. The agent has no taste of its own. It executes the taste you specify.
Review under volume. When you type code, reviewing it is redundant – you already know what it does. When an agent types code, review is the entire job. And the volume is enormous. A single engineer directing multiple agents might generate dozens of pull requests per day. The review bottleneck is not speed. It is the cognitive cost of understanding code you did not write and verifying that it does what you intended rather than what you described.
Where failure modes appear
The dark factory has failure modes that are different in kind from traditional development, not just in degree.
Specification drift
The agent implements what it interpreted your specification to mean, not what you meant. This is subtle because the code often looks correct on first review. It compiles, passes tests, handles the happy path. But the edge behavior diverges from intent in ways that only surface in production. The fix is not better prompts. It is better specifications – executable specifications, scenario-based testing, property-based testing that verifies invariants rather than examples.
Test suite gaming
When the same agent writes both implementation and tests, you get a closed feedback loop. The tests verify that the code does what the code does, not that the code does what you wanted. StrongDM’s solution is treating test scenarios as holdout sets – external to the codebase, maintained separately, validated through probabilistic measures of user satisfaction. The test suite becomes a constraint from outside the system rather than a verification from inside it.
Dependency hallucination
AI models generate code that references methods, parameters, or library versions from their training data that do not exist in your project’s actual dependency tree. The code compiles. Static analysis passes. The failure appears at runtime, in a code path that your test suite might not exercise. I have hit this myself. It is probably the most commonly reported production issue with AI-generated code in 2026.
Architectural erosion
Each individual agent-generated change is locally reasonable. But locally reasonable decisions compound into poor system architecture. The agent does not know about the half-completed migration, the deprecated module that should not receive new dependencies, the performance constraint that shaped the original design. Over weeks, the codebase drifts toward a state where no human made any single bad decision, but the system as a whole reflects no coherent design intent.
Martin Fowler’s team calls this the “behavior harness gap” – the space between what automated tools can verify (structure, syntax, coverage) and what requires human judgment (intent, coherence, fitness for purpose). Closing that gap is what dark factory engineering is really about.
The economics driving adoption
StrongDM suggests spending at least $1,000 per day per engineer on tokens. That sounds expensive until you calculate the alternative.
A senior software engineer in the US costs roughly $400-600 per day in salary and benefits. If that engineer directing AI agents produces 5-10x the output of a traditional developer (a conservative estimate based on current reports), the token cost is noise against the productivity gain. The economics are not close.
This is why the transition is not optional. Teams that adopt dark factory patterns will outproduce teams that do not by a margin that makes competitive parity impossible. The same dynamic played out with version control, CI/CD, and cloud infrastructure. The debate is never whether to adopt. It is how fast.
Dark factory era engineering skills
The job is changing. The skill set changes with it. Ranked by how underinvested most engineers currently are:
1. Specification as code. Writing specifications that are precise enough for agents to implement correctly, testable enough to verify, and flexible enough to accommodate iteration. This is harder than writing code. It requires thinking about what you want before you think about how to build it – a discipline most engineers have atrophied.
2. Adversarial review. Reading code you did not write and asking: does this do what I meant, or does it do what I said? Where could this break that the tests would not catch? What assumption is the agent making that I did not state? This is closer to security auditing than traditional code review.
3. System design under delegation. Designing architectures that remain coherent when the implementation is produced by agents that have no awareness of architectural intent. This means more explicit documentation of design decisions, more architectural fitness functions, more automated enforcement of constraints that used to live in developers’ heads.
4. Harness engineering. Building the guides and sensors – the specifications, linters, test frameworks, architectural checks, and review workflows – that constrain agent output. I have started thinking of the harness as the actual product. The code it produces is downstream.
5. Failure-mode fluency. Understanding where AI-generated code breaks differently from human-written code, and designing verification strategies that target those failure modes specifically. This is a new literacy that did not exist two years ago.
The career question
Nobody wants to hear this, but an engineer whose primary value is typing code faces a compressed market. Not because the job disappears overnight, but because the ratio of engineers needed to produce a given amount of software is shifting.
The flip side: engineers whose primary value is judgment, specification, architecture, and taste face an expanding market. The demand for software is not fixed. When the cost of producing software drops by an order of magnitude, the amount of software the world wants goes up by more than an order of magnitude. The total number of engineers may not shrink. But the job description is already different.
Willison frames it well: you do not get replaced by AI. You get replaced by a person who uses AI better than you do. The relevant skill is not “can you use Copilot” – every developer can. The relevant skill is: can you specify, review, architect, and operate at the pace that AI-driven development demands?
If you are reading this and feeling uncertain about where you stand, here is a concrete test: take your last three pull requests. For each one, could you have written a specification precise enough that an agent would have produced the same code? If yes, you are already thinking at the right level. If no, that is the gap to close.
What to do Monday morning
-
Audit your delegation ratio. What percentage of your code are you typing versus directing agents to produce? If it is below 50%, you are leaving productivity on the table. If it is above 90%, make sure your review process is strong enough to match.
-
Write one specification as if an agent will implement it. Pick your next feature. Before touching code, write the specification: inputs, outputs, constraints, edge cases, acceptance criteria. Then let an agent implement it. Observe where the specification was too vague.
-
Build a personal test harness. Not just unit tests – architectural fitness functions, property-based tests, integration scenarios. The harness is what separates dark factory engineering from vibe coding.
-
Practice adversarial review. When reviewing agent-generated code, adopt the mindset of a security auditor. Assume the code is wrong until proven right. Ask: what assumption is the agent making that is not in the specification?
-
Read Willison’s agentic engineering patterns guide. It is the most practical resource available on operating at levels 3-5 of the AI coding progression.
The dark factory era arrived while we were debating whether AI could write a for loop. Machines already write most of our code. What remains is whether you are the engineer directing the factory or the one standing outside, watching the lights go off.
Related posts:
- Karpathy’s autoresearch: when your AI agent runs 100 experiments while you sleep – the autonomous experiment loop that makes the dark factory concrete
- The judgment shift: what engineering looks like when agents write 80% of the code – a narrower look at judgment as the core skill
Sources:
- Simon Willison on Lenny’s Podcast: AI State of the Union
- Simon Willison: StrongDM’s Software Factory
- Karpathy’s autoresearch on GitHub
- Karpathy’s LLM Wiki gist
- Gartner: 40% of Enterprise Apps Will Feature AI Agents by 2026
- Martin Fowler: Harness Engineering
- CodeRabbit: AI vs Human Code Generation Report
- Stack Overflow Developer Survey 2025