EvanFlow Framework for TDD-driven LLM Development
The AI coding hype cycle is exhausting, relentless, and increasingly detached from the day-to-day reality of building maintainable software. Every single week, someone on Twitter or ProductHunt drops a new wrapper around an OpenAI or Anthropic API, slaps an "agentic" or "autonomous" label on it, and loudly claims it will replace your entire engineering department by Q3. It won’t.
If you’ve actually tried to build production-grade, enterprise-ready software with Large Language Models (LLMs), you already know the sobering reality. They get confused easily. They hallucinate application state that doesn't exist. They invent libraries that sound plausible but return 404s on NPM. And most frustratingly, they will aggressively rewrite perfectly working, battle-tested code into an unreadable bowl of spaghetti the exact moment your prompt gets slightly too vague or ambitious.
We need constraints. We need guardrails, strict boundaries, and verifiable proof of work. And we desperately need to stop pretending that dropping a 5,000-word prompt into a chat window and praying for a clean output constitutes actual software engineering.
Enter EvanFlow.
Yes, the name is mildly obnoxious. As Hacker News immediately and gleefully pointed out when the repository first made rounds, naming an AI framework after yourself feels a bit self-absorbed. We get it, you want your OpenClaw or AutoGPT moment in the sun. But if you can get past the branding and the vanity URL, EvanFlow actually solves a persistent, irritating, and expensive problem in modern software development: it forces a highly capable but inherently chaotic LLM into a strict, unforgiving Test-Driven Development (TDD) feedback loop.
It stops the AI from guessing. It prevents the model from moving forward until it can prove its work. It turns an eager, chaotic junior developer into a methodical, reliable systems engineer.
## The Illusion of Naive TDD in AI Engineering
Traditional TDD is simple, elegant, and mathematically satisfying. You write a test. You watch it fail (Red). You write the minimum amount of code to make it pass (Green). You refactor the code to make it clean (Refactor).
This workflow works beautifully for human engineers because traditional code is deterministic. For a given set of inputs, there is a single, knowable, mathematically correct output. Humans understand the goal and write logic to bridge the gap between the failing test and the passing state.
LLMs do not care about your deterministic dreams. They are prediction engines, guessing the next most statistically likely token based on a vast, generalized training corpus.
If you ask an LLM to "make this test pass," it might do the right thing and fix your flawed business logic. But it is equally likely to do something disastrous. It might rewrite your carefully crafted test to expect the wrong output, effectively silencing the alarm without fighting the fire. It might import a massive, unnecessary, and potentially vulnerable third-party dependency that happens to resolve the type error. Or it might hardcode the exact specific string the test is looking for, completely ignoring the underlying dynamic requirements.
Applying naive TDD to AI development often falls incredibly short because you are dealing with a non-deterministic black box. You aren't just writing code anymore; you are managing probability distributions.
The fix is to stop, back up, and split the problem into much smaller, aggressively constrained pieces. Google Chrome engineer Addy Osmani recently noted that the only way to survive LLM coding workflows going into 2026 is "incremental context carrying." You cannot give an AI the whole system. You must generate tests for a tiny, isolated piece of logic, execute the test in a sandboxed environment, and anchor that state before moving a single inch forward.
EvanFlow systematizes this exact philosophy into Claude Code and other agentic coding harnesses.
## Anatomy of the EvanFlow Loop
EvanFlow isn't a standalone binary or a heavy Electron app that eats your RAM. It’s a collection of 16 cohesive, highly-tuned Claude Code skills (or custom CLI instructions) that literally hijack the AI's internal reasoning loop. It strips away the AI's freedom to "just code" and forces the model to walk every single idea through four rigid, immutable phases: Brainstorm, Plan, Execute, Iterate.
Most importantly, it enforces hard, cryptographic checkpoints at every transition.
### Phase 1: Brainstorming (With Chains)
When you ask a standard LLM to build a feature, its default behavior is to immediately start spitting out code blocks. It wants to please you quickly. Unfortunately, this is usually garbage code, written before the model has even considered the edge cases, the database schema, or the security implications.
The EvanFlow `brainstorm` skill explicitly and forcefully disables code generation. It uses system prompts that trigger immediate failure if a code block is detected. Instead, it forces Claude (or your model of choice) to ask clarifying questions, define the exact boundaries of the feature, and meticulously identify edge cases *before* any files are created on your disk.
```bash
# How you start a feature in EvanFlow
claude --skill evanflow-brainstorm "Add rate limiting to the Express auth routes"
The output isn't TypeScript. It's a structured markdown document outlining the proposed Redis schema, the specific failure states, the fallback mechanisms if Redis goes down, and the exact HTTP status codes to return to the client. It essentially acts as a strict contract between the human product owner and the AI developer.
### Phase 2: The Planning and Test Generation
Once the contract exists and the human approves it, the framework shifts to the planning phase. But it doesn't plan the implementation logic. It exclusively plans the tests.
This is the beating heart of the TDD feedback loop. The LLM is strictly instructed to write the assertions first, based entirely and only on the approved brainstorming document. It is not allowed to look at the existing implementation files yet.
```typescript
// evanflow_generated_tests/auth.rate.spec.ts
import { test, expect, vi } from 'vitest';
import { rateLimiter } from '../src/middleware/rateLimiter';
import { mockRequest, mockResponse } from '../tests/utils/httpMocks';
test('Blocks request after 5 attempts and returns 429', async () => {
// Test generation is prioritized over implementation logic
const req = mockRequest({ ip: '192.168.1.1' });
const res = mockResponse();
const next = vi.fn();
// Simulate 5 rapid-fire requests
for(let i = 0; i < 5; i++) {
await rateLimiter(req, res, next);
}
// The 6th request should trip the circuit breaker
const finalRes = mockResponse();
await rateLimiter(req, finalRes, next);
expect(finalRes.status).toHaveBeenCalledWith(429);
expect(finalRes.json).toHaveBeenCalledWith(
expect.objectContaining({ error: 'Too Many Requests' })
);
expect(next).not.toHaveBeenCalled();
});
Because the LLM wrote this test without seeing the actual implementation code, it is mathematically much less likely to write tautological tests—tests that just blindly echo and confirm bad logic that already exists in the system.
### Phase 3: Execution and Checkpoints
Here is where standard, highly-marketed AI workflows completely break down. In a normal tool, you run the test. It fails. You copy the stack trace and paste the error back to the LLM. The LLM panics, apologizes profusely, and changes 40 lines of code across three entirely unrelated files. Now, instead of one failing test, your build is broken and you have six new errors.
EvanFlow fundamentally restricts the execution phase using immutable checkpoints.
A checkpoint is a snapshot of the working state, usually leveraging Git under the hood. If a test fails, the LLM is only allowed to modify the exact specific file currently under test. It cannot touch routing, it cannot touch the database config, it cannot touch the UI. If it fails three times in a row, the skill forces a hard, unoverrideable stop. The LLM is literally locked out of making further edits. It cannot keep guessing. It must output a summary of the failure, explain its hypothesis on why its attempts failed, and return manual control back to the human developer.
```bash
# The execution loop in practice
claude --skill evanflow-execute --target tests/auth.rate.spec.ts
[EvanFlow] Creating Git snapshot... OK
[EvanFlow] Running Vitest... FAIL (Cannot read properties of undefined)
[EvanFlow] Attempt 1/3: Modifying src/middleware/rateLimiter.ts
[EvanFlow] Running Vitest... FAIL (Expected 429 but received 200)
[EvanFlow] Attempt 2/3: Modifying src/middleware/rateLimiter.ts
[EvanFlow] Running Vitest... PASS
[EvanFlow] Checkpoint created. State anchored.
```
### Phase 4: Iterate and Anchor
Once a test passes and the green checkmark appears, the context must be actively anchored. The LLM writes a brief, one-paragraph summary of what was achieved and then completely clears its internal scratchpad and conversation history.
This is a critical innovation that prevents context bloat. One of the biggest silent killers in AI software development is an LLM trying to hold the entire history of a 45-minute debugging session in its context window. It gets confused by its own previous mistakes, starts blending failed approaches with the current approach, and eventually collapses under its own cognitive weight (often referred to as the "lost in the middle" phenomenon).
By anchoring the state (committing the code) and clearing the scratchpad, EvanFlow keeps the model hyper-focused only on the *current* delta. It is always fresh, always alert, and never bogged down by the ghosts of its previous failures.
## Setting Up EvanFlow in Your Local Environment
Transitioning from a chaotic "YOLO prompting" workflow to a structured EvanFlow loop requires a bit of discipline and setup. Fortunately, the framework is designed to wrap around your existing tools rather than replace them.
**Step 1: Install the Framework Requirements**
You need a reliable test runner (Jest, Vitest, or PyTest) and a version control system (Git is mandatory, as EvanFlow relies on it for checkpointing). Ensure your CLI has access to Claude Code, OpenClaw, or your preferred agentic harness.
**Step 2: Initialize the EvanFlow Config**
Run the initialization command in your project root. This generates an `evanflow.json` file which dictates the strictness of your loop. Here, you define your maximum retry count (default is 3), your test execution commands, and your protected directories (files the AI is never allowed to touch, like infrastructure configurations).
**Step 3: Define the Feature Manifest**
Instead of typing a loose prompt into a chat window, you create a `.feature.md` file. You describe the business requirement, the user persona, and the expected outcome. You then feed this document into the `evanflow-brainstorm` skill.
**Step 4: Execute the Loop**
Once the tests are generated and reviewed by you, run the execution loop. Step back and watch as the AI systematically attempts to pass the tests, creates Git snapshots on success, and halts if it gets confused. Your job shifts from writing boilerplate to reviewing architectural contracts and managing the AI's execution pipeline.
## The Psychology of AI-Assisted Development
Why do we let AI write such bad code when we would instantly reject the same code from a junior human developer? It comes down to cognitive offloading and the psychological trap of perceived authority.
When an AI generates 200 lines of code in four seconds, formatted perfectly with syntax highlighting, our brains are tricked into assuming it must be correct. Reading and verifying code is mentally taxing—often harder than writing it from scratch. As a result, developers skim the output, assume the AI "knows what it's doing," and blindly hit merge.
EvanFlow acts as a psychological circuit breaker. By forcing the AI to stop after every single test, it breaks the hypnotic flow of endless code generation. It forces the human to stay engaged in the process as a supervisor and a reviewer. You aren't just letting the AI drive; you are forcing it to stop at every single traffic light and verify its route before proceeding.
## Scaling EvanFlow to Enterprise Teams
While EvanFlow is a massive productivity booster for solo developers, its true value unlocks when integrated into enterprise team environments.
In a team setting, EvanFlow standardizes the way AI is utilized across the engineering department. Instead of one developer using Cursor to generate massive PRs that take days to review, and another manually typing out boilerplate, EvanFlow ensures that all AI-generated code is inherently test-backed and documented.
Furthermore, EvanFlow's architectural contracts (the output of the Brainstorm phase) can be automatically attached to Jira tickets or GitHub Pull Requests. This provides human reviewers with exact context on *what* the AI was instructed to do, the tests it generated to prove it did it, and the incremental steps it took to achieve the final state. It transforms AI code from a mysterious black box into an auditable, transparent pipeline.
## EvanFlow vs. The Alternatives
How does this rigorous, constrained approach stack up against the way you are probably writing code today?
| Feature | YOLO Prompting (Cursor/Copilot) | Heavy Agentic (AutoGPT/Devin) | EvanFlow (TDD Loop) |
| :--- | :--- | :--- | :--- |
| **Pacing** | Instant, chaotic, often wrong | Slow, incredibly expensive, opaque | Stepped, deliberate, human-in-the-loop |
| **Testing Philosophy** | Total afterthought, usually skipped | Usually ignores them to force a result | Foundational, non-negotiable |
| **Context Management**| Bloats until the model hallucinates | Infinite loops are notoriously common | Anchored and completely cleared per test |
| **Failure Mode** | Spaghetti code across 14 files | Burns $40 of API credits while you sleep | Halts gracefully after 3 failed test runs |
| **Setup Cost** | Zero, works out of the box | High, requires massive API permissions | Medium (Requires CLI setup and config) |
| **Maintainability** | Very Low over time | Medium, if it finishes the job | Exceptionally High |
YOLO prompting is fine for simple boilerplate, generating regex strings, or centering a div. Heavy autonomous agents are fine if you are building a toy project and don't care about your AWS bill. But if you want maintainable, testable code that will survive the next major refactoring cycle, forcing a rigid, test-first loop is the only mathematically sustainable path forward.
## Why This Matters Now
We are rapidly hitting the theoretical and practical limits of "just ask the AI to do it." The underlying foundation models are incredibly smart, possessing vast knowledge of syntax, patterns, and computer science theory. But they critically lack executive function. They do not have common sense. They will happily dig a software architecture hole straight to the center of the earth if you don't explicitly tell them to stop digging.
By adopting a strict TDD framework for LLMs, you aren't just incrementally improving your code quality. You are actively saving your own sanity. You no longer have to review 500-line, undocumented diffs generated by a hallucinating model at 2:00 AM. You only review the tests, ensuring the business logic is covered, and you let the model fight with the compiler and the type checker until the tests pass.
## Actionable Takeaways
If you want to stop wrestling with AI and start actually engineering with it, implement these behavioral patterns immediately, whether you use EvanFlow or not:
1. **Stop letting LLMs write tests and code in the same prompt.** Always, without exception, separate the generation phases. Force the model to output the test file, review it, commit it, and *then* ask it to write the implementation logic.
2. **Implement hard stops and circuit breakers.** Give your AI workflows a maximum retry count. If it can't pass the test in 3 attempts, the approach is fundamentally flawed. Stop the loop, analyze the error yourself, and rethink the architecture.
3. **Anchor your context aggressively.** Do not carry a 20-turn debugging conversation into your next feature request. Once a checkpoint passes, wipe the AI's short-term memory completely. Feed it only the current, working state and the next test it needs to pass.
4. **Define your edge cases in plain text.** Force the AI to write a markdown document of failure modes before it touches your source code. You will catch 90% of architectural and logic errors in plain English before a single line of buggy code is written.
## Frequently Asked Questions (FAQ)
**Q: Does EvanFlow work with local, open-source models like Llama 3 or Mistral?**
A: Yes, provided the model has sufficient reasoning capabilities and a large enough context window to handle standard coding tasks. However, EvanFlow heavily relies on strict instruction following to prevent code generation during the brainstorming phase. Premium models like Claude 3.5 Sonnet or GPT-4o tend to follow these negative constraints much more reliably than smaller local models, which often "leak" code prematurely.
**Q: Will using this strict framework slow down my development process?**
A: In the micro sense, yes. You will not get the instant gratification of seeing 300 lines of code appear in two seconds. But in the macro sense, absolutely not. The time you "lose" in planning and test generation is time you save exponentially by not having to untangle AI-generated spaghetti code, hunt down silent bugs, or revert broken commits. It trades speed of typing for speed of shipping.
**Q: Why was this built around Claude Code specifically?**
A: While the methodology can be adapted to any harness, Claude Code's native ability to chain tools, execute local shell commands, and read file diffs makes it uniquely suited for the EvanFlow loop. Claude 3.5 Sonnet is also currently the industry leader in zero-shot code generation and refactoring, making it the most logical engine for a strict TDD framework.
**Q: How do you handle frontend UI or visual testing with a TDD AI loop?**
A: UI testing is notoriously difficult for LLMs because they lack visual context. EvanFlow handles frontend development by isolating state and logic from the DOM. It forces the AI to write unit tests for React hooks, Redux reducers, or Vue composables independently of the render cycle. For visual regressions, it defers to human review at the checkpoint stage.
**Q: Can I override the 3-strike failure rule if the AI is really close to a solution?**
A: Technically yes, by altering the `evanflow.json` configuration file. However, it is highly discouraged. Experience shows that if an LLM fails the same test three times in an isolated environment, it has usually fallen into a localized logic trap. Giving it more attempts rarely results in a clean fix; it usually results in the AI creating increasingly bizarre and complex workarounds. It is almost always faster to reset the context and guide it manually.
## Conclusion
The era of typing loose, hopeful prompts into a chat window and expecting enterprise-grade software to emerge is coming to an end. As we transition from treating AI as a novelty to relying on it as core infrastructure, our methodologies must mature.
EvanFlow is not magic. It is just the rigorous application of computer science fundamentals—Test-Driven Development, state isolation, and deterministic checkpoints—applied to non-deterministic systems. By aggressively constraining the LLM, forcing it to prove its work through tests, and halting its execution when it gets confused, we strip away the chaos of AI coding. What remains is a reliable, tireless, and highly effective development partner that actually helps you build software, rather than just generating technical debt at the speed of light.