OpenAI recently published a groundbreaking article detailing their five-month experiment in “harness engineering” - building a complete software product with zero human-written code. This comprehensive analysis explores their journey from an empty repository to a million-line codebase, revealing key insights about agent-first development, context management, and the future of software engineering in an AI-driven world.
AI-Generated Content Notice: The content of this article was researched, written, and reviewed by generative AI. Due to the potential for “hallucinations” in LLMs, the information may contain some inaccuracies. Readers are advised to exercise their own judgment regarding the technical accuracy and relevance of the content.
- The Groundbreaking Experiment
- Key Statistics and Results
- The Paradigm Shift: Human as Architect, Agent as Builder
- Context Management: The Biggest Challenge
- Architecture for Agents: Constraints Enable Speed
- Quality Assurance in an Agent-First World
- The Development Workflow Revolution
- Technical Debt and AI “Residue” Management
- Key Lessons and Best Practices
- The Future of Agent-First Development
- Conclusion
The Groundbreaking Experiment
OpenAI conducted a five-month experiment that fundamentally challenges our understanding of software engineering. Their team built and delivered an internal beta software product where not a single line of code was written by humans. Every aspect - from application logic, tests, CI configuration, documentation, observability, to internal tools - was generated by Codex.
The Core Philosophy
- “No manual coding”: This became the team’s core principle
- Human as architect, agent as builder: Engineers focused on designing environments, clarifying intent, and building feedback loops
- 10x acceleration: Estimated to complete the work in approximately 1/10th the time of manual coding
Key Statistics and Results
The Numbers Tell the Story
| Metric | Value | Significance |
|---|---|---|
| Timeframe | 5 months (Aug 2025 - Jan 2026) | Complete product lifecycle |
| Codebase Size | ~1 million lines | From empty repository to production system |
| Pull Requests | ~1,500 opened and merged | High-velocity development |
| Team Size | Started with 3, grew to 7 engineers | Scalable approach |
| PR Throughput | 3.5 PRs per engineer per day | Exceptional productivity |
| Active Users | Hundreds of internal beta users | Real-world validation |
Development Velocity Insights
- Initial architecture (repository structure, CI config, formatting rules, package manager setup, application framework) was generated by Codex CLI using GPT-5
- The initial AGENTS.md file itself was written by Codex
- Zero pre-existing human-written code: The repository was shaped by agents from the beginning
- Throughput increased as the team grew from 3 to 7 engineers, contrary to traditional team scaling challenges
The Paradigm Shift: Human as Architect, Agent as Builder
Redefining Engineering Roles
The experiment revealed a fundamental shift in software engineering roles:
Traditional Engineering → Agent-First Engineering
- Writing code → Designing environments
- Debugging issues → Clarifying intent and constraints
- Code reviews → Building feedback loops
- Implementation → System and architecture design
The Depth-First Approach
When progress was slow, the solution was never “try harder.” Instead, engineers asked:
“What capability is missing, and how do we make it both legible and enforceable to the agent?”
This led to a depth-first work style:
- Break larger goals into smaller building blocks
- Prompt the agent to build these blocks
- Use completed blocks to unlock more complex tasks
- Continuously identify and address missing capabilities
Human-Agent Interaction Model
- Humans interact almost exclusively through prompts: Engineers describe tasks, run agents, and allow them to open Pull Requests
- Agent-driven PR completion: Codex reviews its own changes locally, requests additional agent reviews, responds to feedback, and iterates until all reviewers are satisfied
- Agent-to-agent reviews: Over time, almost all review work shifted to agent-to-agent processing
- Direct tool integration: Codex uses standard development tools (gh, local scripts, repository-embedded skills) without human copy-paste
Context Management: The Biggest Challenge
The Failed “Big AGENTS.md” Approach
OpenAI’s team initially tried creating a comprehensive AGENTS.md file, which failed for several reasons:
| Problem | Description | Impact |
|---|---|---|
| Context Scarcity | Large instruction files crowd out tasks, code, and documentation | Agents miss key constraints or optimize for wrong ones |
| Ineffective Guidance | When everything is “important,” nothing is important | Agents pattern-match locally instead of navigating consciously |
| Immediate Rot | Comprehensive manuals become graveyards of stale rules | Agents can’t determine what information remains valid |
| Verification Difficulty | Single blobs resist mechanical checks (coverage, freshness, ownership) | Drift becomes inevitable |
The Successful Approach: Map, Not Manual
Instead of an encyclopedia, they created a content directory:
Knowledge Repository Structure
1
2
3
4
5
6
docs/
├── architecture/ # Domain and package layer maps
├── design-docs/ # Validated design decisions
├── execution-plans/ # Version-controlled plans with progress logs
├── technical-debt/ # Known issues and debt tracking
└── principles/ # Core agent-first operating principles
AGENTS.md as Navigation
- Short (≈100 lines): Injected into context as a map
- Progressive disclosure: Agents start with a small, stable entry point and are guided where to look next
- Structured validation: Dedicated linters and CI jobs verify knowledge base freshness, cross-linking, and structure
- Auto-maintenance: “Doc-gardening” agents scan for stale documentation and create fix PRs
Critical Insight
“From the agent’s perspective, anything it cannot access in context at runtime does not exist.”
This led to the principle of pushing more context into the repository:
- Slack discussions about architecture patterns
- Product principles and engineering norms
- Team culture (including emoji preferences)
- All must be versioned and accessible in the repository
Architecture for Agents: Constraints Enable Speed
The Strict Layered Architecture
OpenAI discovered that agents are most effective in environments with strict boundaries and predictable structures:
Business Domain Layers
1
Types → Config → Repo → Service → Runtime → UI
Key Architectural Rules
- Fixed dependency direction: Code can only depend “forward” through layers
- Cross-cutting concerns: Authentication, connectors, telemetry, feature flags enter through a single explicit interface (Providers)
- Mechanical enforcement: Custom linters and structural tests (generated by Codex) enforce all constraints
The Architecture Paradox
“This is the kind of architecture you typically defer until you have hundreds of engineers. For coding agents, it’s an early prerequisite: constraints enable speed without architectural drift.”
Practical Implementation
- Custom linters: Statically enforce structured logging, naming conventions, file size limits, platform-specific reliability requirements
- “Taste invariants”: Error messages are written to inject repair instructions into agent context
- Local autonomy within boundaries: Similar to leading a large engineering platform organization - enforce boundaries centrally, allow autonomy locally
Quality Assurance in an Agent-First World
Evolving QA Practices
Traditional engineering norms evolved to match agent throughput:
| Traditional Practice | Agent-First Adaptation |
|---|---|
| Blocking merge gates | Minimized blocking; PRs have short lifetimes |
| Investigating flaky tests | Subsequent reruns solve flaky failures |
| High cost of errors | Low cost of correction, high cost of waiting |
| Human-intensive reviews | Agent-to-agent reviews with human oversight only when needed |
The QA Bottleneck Shift
As code throughput increased, the bottleneck shifted from code generation to human QA capacity. The solution: make more system aspects directly readable by Codex:
Observability Integration
- Application UI: Chrome DevTools protocol integrated into agent runtime
- Logs and metrics: Local observability stack exposed to Codex
- Temporary environments: Codex runs on completely separate application versions that are deleted after tasks
- Query capabilities: LogQL for logs, PromQL for metrics
Practical Examples
- “Ensure service startup completes within 800ms”
- “No span in these four key user journeys should exceed two seconds”
- “Reproduce this bug and verify the fix”
The Development Workflow Revolution
End-to-End Agent Capabilities
The repository crossed a significant threshold where Codex can now drive a new feature end-to-end:
Complete Feature Development Flow
- Verify current state of the codebase
- Reproduce reported bugs with video recordings
- Implement fixes
- Verify fixes by running the application
- Record demonstration videos of the solution
- Open Pull Requests
- Respond to agent and human feedback
- Detect and fix build failures
- Defer to humans only for judgment calls
- Merge changes
Agent-Generated Artifacts
The codebase generated by Codex agents includes:
- Product code and tests
- CI configuration and release tooling
- Internal developer tools
- Documentation and design history
- Evaluation frameworks
- Review comments and replies
- Scripts managing the repository itself
- Production dashboard definitions
Human Role Evolution
Humans remain involved but at different abstraction levels:
- Prioritizing work
- Translating user feedback into acceptance criteria
- Validating outcomes
- Identifying missing capabilities (tools, guidance, constraints, documentation)
Technical Debt and AI “Residue” Management
The “AI Residue” Problem
Codex replicates patterns existing in the repository - including uneven or suboptimal ones. This inevitably leads to drift over time.
Initial (Failed) Approach
- Manual cleanup: Team spent 20% of their time (every Friday) cleaning “AI residue”
- Not scalable: Human-intensive approach couldn’t keep pace with agent throughput
Successful Strategy: “Golden Principles”
Instead of manual cleanup, they encoded subjective but mechanical rules into the repository:
Example Golden Principles
- Prefer shared utility packages over hand-written helpers to centralize invariants
- No “YOLO-style” data probing - validate boundaries or rely on typed SDKs
- Regular background Codex tasks scan for drift, update quality ratings, and create targeted refactoring PRs
The Technical Debt Analogy
“Technical debt is like a high-interest loan: it’s better to pay it off in small, continuous installments than let it accumulate and painfully pay it off in one go.”
Continuous Quality Improvement
- Daily pattern discovery: Bad patterns are discovered and addressed daily, not allowed to propagate for days or weeks
- Automated refactoring: Most refactoring PRs can be reviewed and merged automatically within minutes
- Functioning like garbage collection: Regular cleanup prevents accumulation
Key Lessons and Best Practices
Top Lessons from the Experiment
| Lesson | Description | Implication |
|---|---|---|
| Context is King | Give agents a map, not a manual | Progressive disclosure beats information overload |
| Constraints Enable Speed | Strict boundaries prevent architectural drift | Early architectural discipline pays dividends |
| Human Judgment Scales | Encode human taste into mechanical rules | Subjective preferences become objective constraints |
| Tool Integration is Critical | Make systems directly readable by agents | Eliminate human translation layers |
| Error Correction vs. Prevention | In high-throughput systems, correction is cheaper than prevention | Optimize for velocity, not perfection |
Best Practices for Agent-First Development
- Start with Architecture
- Define strict layers and dependency rules early
- Enforce constraints mechanically from day one
- Build for agent navigability, not just human readability
- Structure Your Knowledge
- Create a docs/ directory as the system of record
- Keep AGENTS.md short (≈100 lines) as a navigation map
- Implement progressive disclosure of information
- Encode Human Judgment
- Transform subjective preferences into objective rules
- Create “golden principles” that can be mechanically enforced
- Use custom linters with repair instructions in error messages
- Integrate Tools Deeply
- Make observability tools directly accessible to agents
- Enable agents to interact with UI, logs, and metrics
- Allow agents to use standard development tools directly
- Manage Technical Debt Continuously
- Implement regular “garbage collection” cycles
- Use background agents to scan for drift and create fix PRs
- Pay down technical debt in small, continuous installments
The Future of Agent-First Development
Unanswered Questions
The experiment raised several important questions for the future:
- Architectural Coherence Over Time
- How will architecture evolve in a fully agent-generated system?
- What patterns will emerge as dominant or problematic?
- Human Judgment Optimization
- Where does human judgment provide the most value?
- How can we better encode this judgment for greater leverage?
- Model Evolution Impact
- How will increasingly capable models change the system?
- What new capabilities will become possible or necessary?
The Discipline Shift
“Building software still requires discipline, but the discipline is more in the supporting structure than in the code.”
The most challenging problems now focus on:
- Designing environments that help agents achieve goals
- Building feedback loops that maintain quality at scale
- Creating control systems for complex, reliable software
Conclusion
OpenAI’s harness engineering experiment represents a watershed moment in software development. By demonstrating that a complete software product can be built with zero human-written code, they’ve validated a new paradigm for software engineering.
Key Takeaways
-
Agent-First is Viable: The experiment proves that agent-first development is not just possible but can be highly productive (10x acceleration).
-
Context Management is Critical: Successful agent collaboration requires careful management of context through structured knowledge repositories and progressive disclosure.
-
Architecture Enables Scale: Strict architectural constraints, typically deferred until large team sizes, become prerequisites for effective agent collaboration.
-
Human Role Evolution: Engineers shift from writing code to designing environments, clarifying intent, and building feedback loops.
-
Continuous Quality Management: Technical debt and “AI residue” require continuous, automated management strategies rather than periodic manual cleanup.
The Path Forward
As AI agents like Codex take on increasingly significant roles in the software lifecycle, the questions raised by this experiment will become increasingly important. The future of software engineering lies not in writing code, but in designing the systems, environments, and feedback loops that enable agents to build software effectively at scale.
The discipline of software engineering is evolving from craftsmanship to orchestration - from writing lines of code to designing the systems that write code. This transition represents both a profound challenge and an unprecedented opportunity for the software industry.
Reference Links: