Harness Engineering: Lessons from OpenAI's Agent-First Development Experiment

OpenAI recently published a groundbreaking article detailing their five-month experiment in “harness engineering” - building a complete software product with zero human-written code. This comprehensive analysis explores their journey from an empty repository to a million-line codebase, revealing key insights about agent-first development, context management, and the future of software engineering in an AI-driven world.

AI-Generated Content Notice: The content of this article was researched, written, and reviewed by generative AI. Due to the potential for “hallucinations” in LLMs, the information may contain some inaccuracies. Readers are advised to exercise their own judgment regarding the technical accuracy and relevance of the content.

Contents

The Groundbreaking Experiment
Key Statistics and Results
The Paradigm Shift: Human as Architect, Agent as Builder
Context Management: The Biggest Challenge
Architecture for Agents: Constraints Enable Speed
Quality Assurance in an Agent-First World
The Development Workflow Revolution
Technical Debt and AI “Residue” Management
Key Lessons and Best Practices
The Future of Agent-First Development
Conclusion

The Groundbreaking Experiment

OpenAI conducted a five-month experiment that fundamentally challenges our understanding of software engineering. Their team built and delivered an internal beta software product where not a single line of code was written by humans. Every aspect - from application logic, tests, CI configuration, documentation, observability, to internal tools - was generated by Codex.

The Core Philosophy

“No manual coding”: This became the team’s core principle
Human as architect, agent as builder: Engineers focused on designing environments, clarifying intent, and building feedback loops
10x acceleration: Estimated to complete the work in approximately 1/10th the time of manual coding

Key Statistics and Results

The Numbers Tell the Story

Metric	Value	Significance
Timeframe	5 months (Aug 2025 - Jan 2026)	Complete product lifecycle
Codebase Size	~1 million lines	From empty repository to production system
Pull Requests	~1,500 opened and merged	High-velocity development
Team Size	Started with 3, grew to 7 engineers	Scalable approach
PR Throughput	3.5 PRs per engineer per day	Exceptional productivity
Active Users	Hundreds of internal beta users	Real-world validation

Development Velocity Insights

Initial architecture (repository structure, CI config, formatting rules, package manager setup, application framework) was generated by Codex CLI using GPT-5
The initial AGENTS.md file itself was written by Codex
Zero pre-existing human-written code: The repository was shaped by agents from the beginning
Throughput increased as the team grew from 3 to 7 engineers, contrary to traditional team scaling challenges

The Paradigm Shift: Human as Architect, Agent as Builder

Redefining Engineering Roles

The experiment revealed a fundamental shift in software engineering roles:

Traditional Engineering → Agent-First Engineering

Writing code → Designing environments
Debugging issues → Clarifying intent and constraints
Code reviews → Building feedback loops
Implementation → System and architecture design

The Depth-First Approach

When progress was slow, the solution was never “try harder.” Instead, engineers asked:

“What capability is missing, and how do we make it both legible and enforceable to the agent?”

This led to a depth-first work style:

Break larger goals into smaller building blocks
Prompt the agent to build these blocks
Use completed blocks to unlock more complex tasks
Continuously identify and address missing capabilities

Human-Agent Interaction Model

Humans interact almost exclusively through prompts: Engineers describe tasks, run agents, and allow them to open Pull Requests
Agent-driven PR completion: Codex reviews its own changes locally, requests additional agent reviews, responds to feedback, and iterates until all reviewers are satisfied
Agent-to-agent reviews: Over time, almost all review work shifted to agent-to-agent processing
Direct tool integration: Codex uses standard development tools (gh, local scripts, repository-embedded skills) without human copy-paste

Context Management: The Biggest Challenge

The Failed “Big AGENTS.md” Approach

OpenAI’s team initially tried creating a comprehensive AGENTS.md file, which failed for several reasons:

Problem	Description	Impact
Context Scarcity	Large instruction files crowd out tasks, code, and documentation	Agents miss key constraints or optimize for wrong ones
Ineffective Guidance	When everything is “important,” nothing is important	Agents pattern-match locally instead of navigating consciously
Immediate Rot	Comprehensive manuals become graveyards of stale rules	Agents can’t determine what information remains valid
Verification Difficulty	Single blobs resist mechanical checks (coverage, freshness, ownership)	Drift becomes inevitable

The Successful Approach: Map, Not Manual

Instead of an encyclopedia, they created a content directory:

Knowledge Repository Structure

docs/
├── architecture/          # Domain and package layer maps
├── design-docs/          # Validated design decisions
├── execution-plans/      # Version-controlled plans with progress logs
├── technical-debt/       # Known issues and debt tracking
└── principles/           # Core agent-first operating principles

AGENTS.md as Navigation

Short (≈100 lines): Injected into context as a map
Progressive disclosure: Agents start with a small, stable entry point and are guided where to look next
Structured validation: Dedicated linters and CI jobs verify knowledge base freshness, cross-linking, and structure
Auto-maintenance: “Doc-gardening” agents scan for stale documentation and create fix PRs

Critical Insight

“From the agent’s perspective, anything it cannot access in context at runtime does not exist.”

This led to the principle of pushing more context into the repository:

Slack discussions about architecture patterns
Product principles and engineering norms
Team culture (including emoji preferences)
All must be versioned and accessible in the repository

Architecture for Agents: Constraints Enable Speed

The Strict Layered Architecture

OpenAI discovered that agents are most effective in environments with strict boundaries and predictable structures:

Business Domain Layers

Types → Config → Repo → Service → Runtime → UI

Key Architectural Rules

Fixed dependency direction: Code can only depend “forward” through layers
Cross-cutting concerns: Authentication, connectors, telemetry, feature flags enter through a single explicit interface (Providers)
Mechanical enforcement: Custom linters and structural tests (generated by Codex) enforce all constraints

The Architecture Paradox

“This is the kind of architecture you typically defer until you have hundreds of engineers. For coding agents, it’s an early prerequisite: constraints enable speed without architectural drift.”

Practical Implementation

Custom linters: Statically enforce structured logging, naming conventions, file size limits, platform-specific reliability requirements
“Taste invariants”: Error messages are written to inject repair instructions into agent context
Local autonomy within boundaries: Similar to leading a large engineering platform organization - enforce boundaries centrally, allow autonomy locally

Quality Assurance in an Agent-First World

Evolving QA Practices

Traditional engineering norms evolved to match agent throughput:

Traditional Practice	Agent-First Adaptation
Blocking merge gates	Minimized blocking; PRs have short lifetimes
Investigating flaky tests	Subsequent reruns solve flaky failures
High cost of errors	Low cost of correction, high cost of waiting
Human-intensive reviews	Agent-to-agent reviews with human oversight only when needed

The QA Bottleneck Shift

As code throughput increased, the bottleneck shifted from code generation to human QA capacity. The solution: make more system aspects directly readable by Codex:

Observability Integration

Application UI: Chrome DevTools protocol integrated into agent runtime
Logs and metrics: Local observability stack exposed to Codex
Temporary environments: Codex runs on completely separate application versions that are deleted after tasks
Query capabilities: LogQL for logs, PromQL for metrics

Practical Examples

“Ensure service startup completes within 800ms”
“No span in these four key user journeys should exceed two seconds”
“Reproduce this bug and verify the fix”

The Development Workflow Revolution

End-to-End Agent Capabilities

The repository crossed a significant threshold where Codex can now drive a new feature end-to-end:

Complete Feature Development Flow

Verify current state of the codebase
Reproduce reported bugs with video recordings
Implement fixes
Verify fixes by running the application
Record demonstration videos of the solution
Open Pull Requests
Respond to agent and human feedback
Detect and fix build failures
Defer to humans only for judgment calls
Merge changes

Agent-Generated Artifacts

The codebase generated by Codex agents includes:

Product code and tests
CI configuration and release tooling
Internal developer tools
Documentation and design history
Evaluation frameworks
Review comments and replies
Scripts managing the repository itself
Production dashboard definitions

Human Role Evolution

Humans remain involved but at different abstraction levels:

Prioritizing work
Translating user feedback into acceptance criteria
Validating outcomes
Identifying missing capabilities (tools, guidance, constraints, documentation)

Technical Debt and AI “Residue” Management

The “AI Residue” Problem

Codex replicates patterns existing in the repository - including uneven or suboptimal ones. This inevitably leads to drift over time.

Initial (Failed) Approach

Manual cleanup: Team spent 20% of their time (every Friday) cleaning “AI residue”
Not scalable: Human-intensive approach couldn’t keep pace with agent throughput

Successful Strategy: “Golden Principles”

Instead of manual cleanup, they encoded subjective but mechanical rules into the repository:

Example Golden Principles

Prefer shared utility packages over hand-written helpers to centralize invariants
No “YOLO-style” data probing - validate boundaries or rely on typed SDKs
Regular background Codex tasks scan for drift, update quality ratings, and create targeted refactoring PRs

The Technical Debt Analogy

“Technical debt is like a high-interest loan: it’s better to pay it off in small, continuous installments than let it accumulate and painfully pay it off in one go.”

Continuous Quality Improvement

Daily pattern discovery: Bad patterns are discovered and addressed daily, not allowed to propagate for days or weeks
Automated refactoring: Most refactoring PRs can be reviewed and merged automatically within minutes
Functioning like garbage collection: Regular cleanup prevents accumulation

Key Lessons and Best Practices

Top Lessons from the Experiment

Lesson	Description	Implication
Context is King	Give agents a map, not a manual	Progressive disclosure beats information overload
Constraints Enable Speed	Strict boundaries prevent architectural drift	Early architectural discipline pays dividends
Human Judgment Scales	Encode human taste into mechanical rules	Subjective preferences become objective constraints
Tool Integration is Critical	Make systems directly readable by agents	Eliminate human translation layers
Error Correction vs. Prevention	In high-throughput systems, correction is cheaper than prevention	Optimize for velocity, not perfection

Best Practices for Agent-First Development

Start with Architecture
- Define strict layers and dependency rules early
- Enforce constraints mechanically from day one
- Build for agent navigability, not just human readability
Structure Your Knowledge
- Create a docs/ directory as the system of record
- Keep AGENTS.md short (≈100 lines) as a navigation map
- Implement progressive disclosure of information
Encode Human Judgment
- Transform subjective preferences into objective rules
- Create “golden principles” that can be mechanically enforced
- Use custom linters with repair instructions in error messages
Integrate Tools Deeply
- Make observability tools directly accessible to agents
- Enable agents to interact with UI, logs, and metrics
- Allow agents to use standard development tools directly
Manage Technical Debt Continuously
- Implement regular “garbage collection” cycles
- Use background agents to scan for drift and create fix PRs
- Pay down technical debt in small, continuous installments

The Future of Agent-First Development

Unanswered Questions

The experiment raised several important questions for the future:

Architectural Coherence Over Time
- How will architecture evolve in a fully agent-generated system?
- What patterns will emerge as dominant or problematic?
Human Judgment Optimization
- Where does human judgment provide the most value?
- How can we better encode this judgment for greater leverage?
Model Evolution Impact
- How will increasingly capable models change the system?
- What new capabilities will become possible or necessary?

The Discipline Shift

“Building software still requires discipline, but the discipline is more in the supporting structure than in the code.”

The most challenging problems now focus on:

Designing environments that help agents achieve goals
Building feedback loops that maintain quality at scale
Creating control systems for complex, reliable software

Conclusion

OpenAI’s harness engineering experiment represents a watershed moment in software development. By demonstrating that a complete software product can be built with zero human-written code, they’ve validated a new paradigm for software engineering.

Key Takeaways

Agent-First is Viable: The experiment proves that agent-first development is not just possible but can be highly productive (10x acceleration).
Context Management is Critical: Successful agent collaboration requires careful management of context through structured knowledge repositories and progressive disclosure.
Architecture Enables Scale: Strict architectural constraints, typically deferred until large team sizes, become prerequisites for effective agent collaboration.
Human Role Evolution: Engineers shift from writing code to designing environments, clarifying intent, and building feedback loops.
Continuous Quality Management: Technical debt and “AI residue” require continuous, automated management strategies rather than periodic manual cleanup.

The Path Forward

As AI agents like Codex take on increasingly significant roles in the software lifecycle, the questions raised by this experiment will become increasingly important. The future of software engineering lies not in writing code, but in designing the systems, environments, and feedback loops that enable agents to build software effectively at scale.

The discipline of software engineering is evolving from craftsmanship to orchestration - from writing lines of code to designing the systems that write code. This transition represents both a profound challenge and an unprecedented opportunity for the software industry.

Reference Links: