Back to Blog

EvanFlow Framework for TDD-driven LLM Development

The AI coding hype cycle is exhausting. Every week, someone drops a new wrapper around an API, slaps an "agentic" label on it, and claims it will replace your entire engineering department. It won’t. If you’ve actually tried to build production software with LLMs, you know the reality: they get confused, they hallucinate state, and they aggressively rewrite working code into spaghetti the moment your prompt gets too vague. We need constraints. We need guardrails. And we need to stop pretending that dropping a 5,000-word prompt into a chat window constitutes software engineering. Enter EvanFlow. Yes, the name is mildly obnoxious. As Hacker News immediately pointed out, naming an AI framework after yourself feels a bit self-absorbed. We get it, you want your OpenClaw moment. But if you can get past the branding, EvanFlow actually solves a persistent, irritating problem in modern development: it forces an LLM into a strict, Test-Driven Development (TDD) feedback loop. It stops the AI from guessing. It forces the AI to prove its work. ## The Illusion of Naive TDD in AI Engineering Traditional TDD is simple. You write a test, it fails. You write code, it passes. You refactor. This works because traditional code is deterministic. For a given input, there is a single, knowable, correct output. LLMs do not care about your deterministic dreams. If you ask an LLM to make a test pass, it might fix the logic. Or it might rewrite the test to expect the wrong output. Or it might import a massive, unnecessary dependency that happens to resolve the error. Applying naive TDD to AI development often falls short because you are dealing with a non-deterministic black box. You aren't just writing code anymore; you are managing probability. The fix is to stop, back up, and split the problem into smaller, aggressively constrained pieces. Addy Osmani recently noted that the only way to survive LLM coding workflows going into 2026 is incremental context carrying. You generate tests for a tiny piece, execute, and anchor that state. EvanFlow systematizes this exact philosophy into Claude Code. ## Anatomy of the EvanFlow Loop EvanFlow isn't a standalone binary. It’s a collection of 16 cohesive Claude Code skills that hijack the AI's internal loop. It forces the model to walk an idea through four rigid phases: Brainstorm, Plan, Execute, Iterate. Most importantly, it enforces checkpoints. ### Phase 1: Brainstorming (With Chains) When you ask an LLM to build a feature, it immediately starts spitting out code. This is usually garbage code. The EvanFlow `brainstorm` skill explicitly disables code generation. It forces Claude to ask questions, define the boundaries of the feature, and identify edge cases *before* any files are created. ```bash # How you start a feature in EvanFlow claude --skill evanflow-brainstorm "Add rate limiting to the Express auth routes" ``` The output isn't code. It's a structured markdown document outlining the Redis schema, the failure states, and the exact HTTP status codes to return. It creates a contract. ### Phase 2: The Planning and Test Generation Once the contract exists, the framework shifts to planning. But it doesn't plan the implementation. It plans the tests. This is the core of the TDD feedback loop. The LLM is instructed to write the assertions first, based entirely on the brainstorming document. ```typescript // evanflow_generated_tests/auth.rate.spec.ts import { test, expect } from 'vitest'; import { rateLimiter } from '../src/middleware/rateLimiter'; test('Blocks request after 5 attempts', async () => { // Test generation prioritized over implementation const req = mockRequest({ ip: '192.168.1.1' }); for(let i=0; i<5; i++) { await rateLimiter(req, mockRes, mockNext); } const finalRes = await rateLimiter(req, mockRes, mockNext); expect(finalRes.status).toBe(429); expect(finalRes.body.error).toBe('Too Many Requests'); }); ``` Because the LLM wrote the test without seeing the implementation, it is less likely to write tautological tests that just echo bad logic. ### Phase 3: Execution and Checkpoints Here is where standard AI workflows break down. You run the test. It fails. You paste the error to the LLM. The LLM changes 40 lines of code across three files. Now you have six errors. EvanFlow restricts the execution phase using checkpoints. A checkpoint is a snapshot of the working state. If a test fails, the LLM is only allowed to modify the exact file under test. If it fails three times in a row, the skill forces a hard stop. The LLM cannot keep guessing. It must output a summary of the failure and return control to the human. ```bash # The execution loop claude --skill evanflow-execute --target tests/auth.rate.spec.ts [EvanFlow] Running Vitest... FAIL [EvanFlow] Attempt 1/3: Modifying src/middleware/rateLimiter.ts [EvanFlow] Running Vitest... FAIL [EvanFlow] Attempt 2/3: Modifying src/middleware/rateLimiter.ts [EvanFlow] Running Vitest... PASS [EvanFlow] Checkpoint created. State anchored. ``` ### Phase 4: Iterate and Anchor Once a test passes, the context is anchored. The LLM writes a brief summary of what was achieved and clears its scratchpad. This prevents context bloat. One of the biggest silent killers in AI development is an LLM trying to hold the entire history of a debugging session in its context window. It gets confused by its own previous mistakes. By anchoring the state and clearing the scratchpad, EvanFlow keeps the model focused only on the *current* delta. ## EvanFlow vs. The Alternatives How does this stack up against the way you are probably writing code today? | Feature | YOLO Prompting (Cursor/Copilot) | Heavy Agentic (AutoGPT/Devin) | EvanFlow (TDD Loop) | | :--- | :--- | :--- | :--- | | **Pacing** | Instant, often wrong | Slow, expensive, opaque | Stepped, human-in-the-loop | | **Testing** | Afterthought | Usually ignores them | Foundational | | **Context Management**| Bloats until failure | Infinite loops common | Anchored and cleared per test | | **Failure Mode** | Spaghetti code | Burns $40 of API credits | Halts after 3 failed test runs | | **Setup Cost** | Zero | High | Medium (Requires Claude Code) | YOLO prompting is fine for boilerplate. Heavy agents are fine if you don't care about your AWS bill. But if you want maintainable, testable code, forcing a rigid loop is the only sustainable path. ## Why This Matters Now We are hitting the limits of "just ask the AI." The models are incredibly smart, but they lack executive function. They will happily dig a hole to the center of the earth if you don't tell them to stop. By adopting a strict TDD framework for LLMs, you aren't just improving the code quality. You are saving your own sanity. You no longer have to review 500-line diffs generated by a hallucinating model. You only review the tests, and you let the model fight with the compiler until the tests pass. ## Actionable Takeaways If you want to stop wrestling with AI and start engineering with it, implement these patterns immediately: 1. **Stop letting LLMs write tests and code in the same prompt.** Always separate the generation. Force the model to output the test file, commit it, and *then* ask it to write the implementation. 2. **Implement hard stops.** Give your AI workflows a maximum retry count. If it can't pass the test in 3 attempts, the approach is fundamentally flawed. Stop the loop and rethink the architecture. 3. **Anchor your context.** Do not carry a 20-turn debugging conversation into your next feature. Once a checkpoint passes, wipe the AI's short-term memory. Feed it only the current, working state and the next test. 4. **Define your edge cases in plain text.** Force the AI to write a markdown document of failure modes before it touches your source code. You will catch 90% of logic errors before a single line is written.