Self-Improving Coding Agents

Author: Addy Osmani

Summary
The most productive use of AI coding agents isn't watching them work—it's setting them up to work autonomously while you do something else. This article explores the patterns, strategies, and safeguards for building continuous coding loops that improve themselves over time.

The Vision: Agents That Work While You Sleep

The ultimate promise of AI coding agents isn't faster pair programming—it's autonomous productivity. Imagine describing a feature, going to bed, and waking up to a completed, tested, and reviewed pull request. We're not fully there yet, but the patterns for getting close already exist.

Self-improving coding agents work in continuous loops: they write code, test it, analyze failures, fix issues, and iterate—without human intervention. The key is setting up the right structure so these loops converge on correct solutions rather than spinning endlessly.

The "Ralph Wiggum" Technique

One of the most effective strategies for autonomous coding agents has an unlikely name. The "Ralph Wiggum" technique (a reference to the Simpsons character who succeeds through persistent, incremental effort) is about breaking complex tasks into tiny, independently verifiable steps.

Why Tiny Steps Work

Large tasks have compounding error rates. If each step in a 10-step task has a 90% success rate, the overall success rate is only 35%. But if you break that same task into 20 smaller steps, each with a 98% success rate, overall success climbs to 67%. And critically, failures are caught early and corrected before they cascade.

The Pattern in Practice

Instead of asking an agent to "implement user authentication," break it down:

Create the user model with email and hashed password fields
Write a test for user creation
Implement the user creation function
Run tests and verify they pass
Create the login endpoint
Write a test for successful login
Write a test for failed login
Implement the login logic
Run all tests
Add session management
...and so on

Each step is small enough that the agent can complete it reliably, and each step is independently verifiable. If step 6 fails, you know exactly where the problem is, and the agent can focus its attention on a small, well-defined fix.

Verification at Every Step

The critical element is that every step includes its own verification:

After writing code: run linters and type checkers
After implementing logic: run the relevant tests
After making changes: verify the full test suite still passes
After each micro-task: confirm the system is in a known-good state

This continuous verification is what transforms a sequence of AI operations into a reliable engineering process.

AGENTS.md as External Memory

One of the most powerful but underappreciated tools for autonomous coding agents is the external memory file—commonly named AGENTS.md or CLAUDE.md in practice.

What Goes in External Memory

An effective external memory file contains:

Project context: Technology stack, architecture decisions, coding conventions

markdown

## Stack
- Next.js 15 with App Router
- TypeScript strict mode
- Prisma ORM with PostgreSQL
- Tailwind CSS with shadcn/ui components

Coding conventions: Style rules and patterns the agent should follow

markdown

## Conventions
- Use named exports, not default exports
- Prefer server components; use 'use client' only when needed
- All API routes must validate input with Zod schemas
- Error handling: use Result types, not try/catch for business logic

Known issues and workarounds: Things the agent shouldn't waste time rediscovering

markdown

## Known Issues
- The Prisma client must be regenerated after schema changes: `npx prisma generate`
- Hot reload doesn't work for changes to middleware.ts—restart the dev server
- The `auth()` function is async in Next.js 15; always await it

Task-specific instructions: Guidance for common operations

markdown

## Adding a New API Endpoint
1. Create the route file in `src/app/api/`
2. Add input validation with Zod
3. Add the corresponding test in `__tests__/api/`
4. Update the API documentation in `docs/api/`

Why External Memory Matters for Autonomous Loops

Without external memory, every new agent session starts from zero. The agent needs to rediscover the project structure, conventions, and gotchas—wasting tokens and time. With a well-maintained external memory file:

New sessions start with full project awareness
Conventions are followed consistently across sessions
Known pitfalls are avoided without re-learning
Common workflows are executed correctly on the first try

Memory That Improves Over Time

The "self-improving" aspect partly comes from updating external memory based on discoveries. When an agent encounters a new issue or figures out a better approach, that knowledge can be captured:

markdown

## Learned: 2024-Q3
- Database migrations must be run before tests in CI
- The payment webhook handler needs idempotency checks
- Use `unstable_cache` for data that changes hourly, not `revalidate`

Over time, the external memory becomes a comprehensive handbook that makes each subsequent autonomous session more efficient and reliable.

Persistent Memory Across Sessions

Beyond file-based external memory, several patterns enable agents to maintain continuity across sessions:

Git History as Memory

The git log itself serves as persistent memory. An agent can:

Review recent commits to understand what's been done
Read commit messages to understand the reasoning behind changes
Check the diff of recent changes to maintain context continuity

Task Tracking Integration

Connecting the agent to a task tracker (issues, project boards) provides another memory layer:

The agent knows which tasks are complete, in progress, or pending
Failed attempts are recorded with context for future sessions
Dependencies between tasks are visible

Session Handoff Files

For long-running projects, creating explicit handoff notes between sessions ensures nothing is lost:

markdown

## Session Summary - 2024-03-15
### Completed
- Implemented user profile API endpoints
- Added avatar upload with S3 integration

### In Progress
- Profile edit form (frontend) - component created but validation not working

### Blocked
- Email notification system waiting on SendGrid API key

### Notes for Next Session
- The avatar crop library has a bug with PNG transparency—use JPEG for now
- Profile validation schema is in `src/schemas/profile.ts`

Task Isolation Strategies

Autonomous agents work best when tasks are properly isolated—when completing one task can't accidentally break another.

Branch-Per-Task

The simplest isolation strategy: each task runs on its own git branch.

The agent creates a feature branch from main
All work happens on that branch
If the task fails catastrophically, the branch can be discarded without affecting anything
Successful branches are merged through the normal review process

Environment Isolation

For tasks that involve infrastructure or dependencies:

Use separate development databases for different agents
Run tests in containers to prevent interference
Use separate environment variable files for different task contexts

Scope Limiting

Define clear boundaries for what the agent can modify:

Specify which files and directories the task should touch
Explicitly list files that should NOT be modified
Use read-only access for reference files

QA Validation Loops

The "self-improving" in self-improving agents comes largely from automated QA loops—structured verification processes that catch and correct issues.

The Basic QA Loop

1. Agent makes a change
2. Run automated checks:
   - Linting (ESLint, Prettier)
   - Type checking (TypeScript compiler)
   - Unit tests
   - Integration tests
3. If all checks pass → proceed to next task
4. If any check fails → analyze failure, fix, return to step 2
5. Maximum retry limit to prevent infinite loops

Progressive Validation

Not all checks need to run at every step. A progressive validation strategy runs faster checks first:

Syntax check: Does the code parse? (milliseconds)
Type check: Does it type-check? (seconds)
Unit tests: Do related tests pass? (seconds to minutes)
Integration tests: Does the system work end-to-end? (minutes)
Build check: Does the full build succeed? (minutes)

Failing early on a syntax error saves the time of running the full test suite.

Quality Gates

Define clear quality thresholds that must be met before a task is considered complete:

All existing tests pass
New code has test coverage above a minimum threshold
No new linting warnings introduced
No type errors
Build succeeds
Performance benchmarks don't regress (for critical paths)

When to Let Agents Run vs. When to Intervene

Autonomous doesn't mean unsupervised. Knowing when to let agents work and when to step in is crucial.

Let Agents Run When

Tasks are well-defined: Clear inputs, clear success criteria
Verification is automated: Tests, linters, and type checkers cover the change
Scope is limited: The task touches a small, well-understood part of the codebase
Rollback is easy: Git branches make it simple to discard failed attempts
The cost of failure is low: Internal tools, non-critical features

Intervene When

Architecture decisions are needed: Agents should implement decisions, not make them
Security-sensitive code: Authentication, authorization, data access patterns
External system integration: APIs, databases, third-party services
User-facing design: Layout, copy, interaction patterns
The agent is stuck in a loop: Repeated failures on the same step indicate a problem the agent can't solve alone

The Review Checkpoint

Even for autonomous work, human review remains essential. The ideal flow is:

Agent works autonomously, producing a branch with commits
Agent opens a pull request with a description of what was done
Human reviews the changes, focusing on:
- Architectural fit
- Security implications
- Business logic correctness
- Edge cases the tests might not cover
Human approves or requests changes
If changes requested, agent addresses them in another autonomous loop

Practical Setup Guide

Step 1: Prepare External Memory

Create an AGENTS.md file with project context, conventions, and common workflows. Start minimal and expand as you discover what the agent needs to know.

Step 2: Design Task Decomposition

Break your project into independently completable tasks. Each task should:

Have clear success criteria
Be verifiable with automated tests
Take no more than 30-60 minutes of agent time
Have minimal dependencies on other incomplete tasks

Step 3: Set Up Automated Validation

Ensure your CI pipeline covers:

Linting and formatting
Type checking
Unit and integration tests
Build verification

Step 4: Establish the Loop

Configure the agent to work in cycles:

Pick a task from the queue
Create a branch
Implement the task in small steps with verification
Run the full test suite
Create a pull request
Move to the next task

Step 5: Monitor and Improve

Track metrics over time:

Task completion rate
Average time per task
Number of human interventions needed
Types of failures encountered

Use these metrics to improve your external memory, task decomposition, and validation coverage.

The Future Is Autonomous (With Guardrails)

Self-improving coding agents aren't science fiction—they're a practical reality today with the right setup. The patterns described here—tiny steps, external memory, task isolation, and QA loops—transform unreliable AI coding from a novelty into a productive engineering workflow.

The key insight: autonomy is earned through structure. The more structure you provide (clear tasks, comprehensive tests, well-documented conventions), the more autonomy the agent can handle. Start with tight supervision and gradually expand as confidence builds.

Self-Improving Coding Agents ​

Summary ​

The Vision: Agents That Work While You Sleep ​

The "Ralph Wiggum" Technique ​

Why Tiny Steps Work ​

The Pattern in Practice ​

Verification at Every Step ​

AGENTS.md as External Memory ​

What Goes in External Memory ​

Why External Memory Matters for Autonomous Loops ​

Memory That Improves Over Time ​

Persistent Memory Across Sessions ​

Git History as Memory ​

Task Tracking Integration ​

Session Handoff Files ​

Task Isolation Strategies ​

Branch-Per-Task ​

Environment Isolation ​

Scope Limiting ​

QA Validation Loops ​

The Basic QA Loop ​

Progressive Validation ​

Quality Gates ​

When to Let Agents Run vs. When to Intervene ​

Let Agents Run When ​

Intervene When ​

The Review Checkpoint ​

Practical Setup Guide ​

Step 1: Prepare External Memory ​

Step 2: Design Task Decomposition ​

Step 3: Set Up Automated Validation ​

Step 4: Establish the Loop ​

Step 5: Monitor and Improve ​

The Future Is Autonomous (With Guardrails) ​

Self-Improving Coding Agents

Summary

The Vision: Agents That Work While You Sleep

The "Ralph Wiggum" Technique

Why Tiny Steps Work

The Pattern in Practice

Verification at Every Step

AGENTS.md as External Memory

What Goes in External Memory

Why External Memory Matters for Autonomous Loops

Memory That Improves Over Time

Persistent Memory Across Sessions

Git History as Memory

Task Tracking Integration

Session Handoff Files

Task Isolation Strategies

Branch-Per-Task

Environment Isolation

Scope Limiting

QA Validation Loops

The Basic QA Loop

Progressive Validation

Quality Gates

When to Let Agents Run vs. When to Intervene

Let Agents Run When

Intervene When

The Review Checkpoint

Practical Setup Guide

Step 1: Prepare External Memory

Step 2: Design Task Decomposition

Step 3: Set Up Automated Validation

Step 4: Establish the Loop

Step 5: Monitor and Improve

The Future Is Autonomous (With Guardrails)