Stormap Blog | AI Automation, OpenClaw, and Developer Guides

# EvanFlow: TDD-Driven Framework for Claude Code We need to talk about the elephant in the room regarding AI-assisted software development. If you let a Large Language Model (LLM) loose in your repository without strict guardrails, you aren't getting a 10x developer. You are getting a tireless, caffeine-addled junior engineer who will happily rewrite your authentication middleware at 3 AM, completely ignore edge cases, and break production because they didn't write a single test. The honeymoon phase of asking a chatbot to "build a React app" and watching magic happen in the terminal is over. Serious engineering teams are hitting the wall. The context windows are massive, the models are undeniably smart, and the initial speed is intoxicating. But the methodology the industry has adopted by default is entirely broken. We are prioritizing generation speed over structural integrity, and it is costing us dearly in technical debt. Enter EvanFlow. Yes, the creator named it after himself. Yes, Hacker News had a field day roasting the ego involved, drawing immediate parallels to other eponymously named tech stacks. But if you can get past the self-indulgent branding, EvanFlow solves the single biggest problem with agentic coding today: the complete and utter lack of an iterative, test-driven feedback loop. ## The Problem with Naked Prompts When you fire up Claude Code, Cursor, Aider, or any other agentic CLI, the default behavior is raw, unconstrained execution. You state a problem. The AI generates the implementation. If you are lucky, it works on the first try. If you aren't—which is the reality for any non-trivial codebase—you spend the next two hours untangling a hallucinated API integration that relies on npm packages that haven't been updated since 2018. The core issue is that LLMs are fundamentally people-pleasers. They are eager to give you the finished code immediately because their reinforcement learning has taught them that delivering the final answer yields the highest reward. They skip the planning phase. They skip the edge cases. They absolutely skip writing tests, unless you hold a proverbial gun to their head. A staff engineer on Reddit recently summarized this after months of painful iteration: the only workflow that actually produces maintainable, production-ready code with AI is strict Test-Driven Development (TDD). You have to force the agent into a Red-Green-Refactor loop. If you don't, you are just accumulating technical debt at the speed of compute. You end up with "spaghetti code generated at lightspeed," which requires a senior engineer to spend days deciphering and fixing. ## The EvanFlow Architecture EvanFlow isn't a new foundation model or a standalone commercial application. It is a highly opinionated framework of 16 cohesive Claude Code skills, orchestrated through careful system prompts and terminal intercepts. These skills form a rigid pipeline. They physically prevent the agent from rushing to the implementation phase. The pipeline walks an idea through four distinct, non-negotiable phases: Brainstorm, Plan, Execute, and Iterate. Most importantly, it inserts hard human-in-the-loop checkpoints that the AI cannot bypass. ### The 16-Skill Pipeline Instead of a single "do the work" prompt, EvanFlow breaks the cognitive load into discrete, atomic operations. By forcing the AI to use specific skills for specific phases, it prevents context contamination. Here is what the execution phase looks like under the hood when initiating a new feature: ```bash # A typical EvanFlow initialization in a new feature branch claude run evanflow:init --feature "user-auth-oauth2" # The framework intercepts and forces the planning phase # Output: > Running skill: evanflow-brainstorm > Analyzing current architecture... > Scanning existing authentication paradigms... > Generating test cases... > STOP: Awaiting human approval on test specifications. The framework forces Claude to stop and ask for permission before a single line of implementation code is written. It demands that the test specifications are explicitly approved. If you skip this, the AI will build whatever is easiest for the AI, not what is correct for your system. ## Forcing the Red Phase The absolute magic of EvanFlow lies in its strict enforcement of the "Red" phase of TDD. As software engineering instructor Steve Kinney pointed out in his AI development courses, you have to be explicitly clear with LLMs that you are doing TDD. Otherwise, they will write the test and the implementation at the exact same time, entirely defeating the purpose of the methodology. Worse, they will write weak mock implementations or tautological tests just to make the suite pass on the first try without doing any real work. EvanFlow uses custom skill prompts to enforce failure. It expects the test to fail, and it validates that failure. Here is a simplified version of what a strict TDD enforcement skill looks like in the EvanFlow ecosystem: ```yaml --- name: tdd-integration description: Enforce strict Red-Green-Refactor cycle using integration tests. trigger: "When implementing new features or functionality" instructions: | 1. Write the failing integration test FIRST. Do not write any implementation code. 2. Run the test suite using the project's testing command. 3. Ensure it FAILS. Do not proceed until you see the failure output. 4. Output the exact error message and stack trace to the console. 5. ONLY THEN write the minimal implementation code required to make the test pass. 6. Run the test suite again. 7. If green, proceed to refactor for performance and readability. If red, fix the implementation, not the test. --- If the agent tries to write the implementation alongside the test, the framework rejects the output. It forces the agent to experience the failure. ## The Psychology of the LLM: Why Context Anchoring Works Why go through all this trouble just to make an AI fail? It comes down to how attention mechanisms work in transformer models. When an LLM generates code in a vacuum, it relies entirely on its training weights and the abstract prompt you provided. This is a highly probabilistic state where hallucinations thrive. However, when an LLM runs a test and receives a stack trace—a literal roadmap of exactly what is broken—that error output anchors the model's context window in reality. The stack trace acts as a localized constraint. It forces the LLM's attention heads to focus strictly on resolving the specific error (e.g., `TypeError: Cannot read properties of undefined (reading 'id')`) rather than dreaming up a generalized, abstract solution. By forcing the "Red" phase, EvanFlow weaponizes the AI's natural ability to parse logs and fix bugs. You are effectively tricking the AI into debugging mode, which is historically where LLMs perform exponentially better than in greenfield generation mode. When it sees the actual stack trace from the failing test, its subsequent implementation code is dramatically more accurate. ## The Feedback Loop We have all watched an AI agent get stuck in a death spiral. It writes a bad function, the linter yells at it, it writes a worse function to fix the linter, the compiler crashes, it attempts to downgrade a core dependency to fix the compiler, and suddenly your `package.json` is missing and your project won't boot. EvanFlow mitigates this by restricting the agent's scope during the "Iterate" phase and enforcing a strict state machine. ### Atomic Commits and Checkpoints The framework relies on highly granular version control. Every time the agent successfully reaches the "Green" phase of the TDD loop, EvanFlow triggers a checkpoint. ```javascript // Example of a generated checkpoint commit via EvanFlow import { execSync } from 'child_process'; function triggerCheckpoint(featureName, step) { try { execSync(`git add .`); execSync(`git commit -m "chore(evanflow): checkpoint ${step} for ${featureName} - tests passing"`); console.log(`Checkpoint saved. Safe to proceed to refactor phase.`); } catch (error) { console.error(`Failed to create checkpoint. Halting agent to prevent state corruption.`); process.exit(1); } } ``` If the agent goes off the rails during the refactor phase—which happens frequently when AI tries to get "clever" with performance optimizations—you don't lose the implementation. You just roll back to the last green checkpoint. This turns AI coding from a high-stakes gamble into a predictable, manageable, and auditable process. ## Overcoming the "Token Burn" Objection The most common criticism of EvanFlow and similar TDD-driven AI workflows is the cost. Running a model through Brainstorm, Plan, Test (Fail), Implement, Test (Pass), and Refactor burns significantly more tokens than a single zero-shot generation prompt. Some developers look at their API bill and balk. "Why am I spending $0.40 on a single function when I could get it for $0.05?" This is a fundamental misunderstanding of developer economics. If a zero-shot prompt gives you code that takes a human engineer 45 minutes to review, debug, and fix, that code didn't cost $0.05. It cost $0.05 in compute, plus $75 in engineering time. EvanFlow trades compute for reliability. Burning an extra $0.35 in API tokens to guarantee that the code is covered by tests, adheres to existing architectural patterns, and actually works is the highest ROI investment you can make in your engineering workflow. Tokens are cheap; human attention and debugging time are incredibly expensive. ## Framework Comparison How does this stack up against the alternatives? Let's look at the data based on community benchmarks and qualitative team feedback. | Feature | Naked Claude Code | Custom System Prompts | EvanFlow Framework | | :--- | :--- | :--- | :--- | | **Test Coverage** | ~15% (Often hallucinated) | ~40% (Inconsistent) | 95%+ (Enforced) | | **Implementation Speed** | Very Fast | Moderate | Slow (but accurate) | | **Human Intervention** | High (Debugging garbage) | Medium | Low (Only at checkpoints) | | **Code Architecture** | Spaghetti | Variable | Highly Structured | | **State Revertability** | None | Manual Git Commands | Automated Checkpoints | | **Context Degradation** | High (Forgets early instructions) | Medium | Low (Resets at checkpoints)| Standard usage gives you the illusion of speed. You get 500 lines of code in ten seconds, but you spend three days debugging it. EvanFlow is intentionally slower. It forces the agent to read, think, write tests, run tests, fail, write code, run tests, pass, and commit. It treats the AI like a junior developer who needs constant, rigid supervision, which is exactly how we should be treating these models. ## Practical Implementation: Step-by-Step You don't need to adopt the entirety of EvanFlow's 16 skills to see the benefits. You can cherry-pick the core philosophy and integrate it into your existing workflow today. Here is a practical, step-by-step guide to manually enforcing this workflow in your AI CLI of choice. **Step 1: The Contract Prompt** Never ask for a feature. Ask for a test. *Instead of:* "Build a Stripe webhook handler for failed payments." *Use:* "Write a Jest test suite for a Stripe webhook handler that processes 'invoice.payment_failed'. Mock the Stripe event object. Do not write the handler yet." **Step 2: The Red Verification** Force the AI to run the test and prove it fails. *Prompt:* "Run the Jest suite. Output the failure. Confirm that it is failing because the handler does not exist." **Step 3: The Minimal Implementation** Once the failure is anchored, ask for the code. *Prompt:* "Now, write ONLY the minimal code required in `stripeHandler.ts` to make this specific test pass. Do not add extra features." **Step 4: The Green Verification** Make the AI run the test again to prove its code worked. *Prompt:* "Run the test suite again. If it passes, commit the code with the message 'test: stripe webhook handler passing'." **Step 5: The Refactor** Now that you have a safety net, you can ask the AI to clean up the code, extract constants, or improve typings. Because the test exists, the AI can safely modify the code and re-run the test to ensure it didn't break anything. ## Scaling EvanFlow for Enterprise Teams Adopting this methodology at an individual level is straightforward, but what happens when a team of twenty engineers starts using TDD-driven AI? For enterprise teams, EvanFlow's concepts shine when integrated directly into the CI/CD pipeline. By enforcing AI checkpoints as discrete Git commits, code reviewers can easily trace the AI's "thought process." Instead of reviewing a single massive pull request containing 2,000 lines of AI-generated code, a senior engineer can step through the atomic commits: 1. `test: add specs for user login` 2. `feat: implement minimal user login` 3. `refactor: extract token generation to utility` This dramatically lowers the cognitive burden on human reviewers. Furthermore, because EvanFlow guarantees high test coverage by default, CI/CD pipelines fail less frequently due to unhandled edge cases. It bridges the gap between AI generation and enterprise-grade compliance. ## The Developer Experience Using a TDD-driven agent feels incredibly weird at first. You are spending tokens and time watching an AI fail on purpose. You will see red text in your terminal. You will watch it struggle to parse its own error messages. You will feel an overwhelming urge to jump in and just write the function yourself. But then, it clicks. The agent stops guessing. It stops writing massive monolithic functions. Because it is forced to write tests first, it naturally writes decoupled, testable code. The functions get smaller. The dependencies get clearer. The architecture improves automatically, simply because writing tests for bad architecture is inherently difficult, and the AI will naturally seek the path of least resistance (clean, modular code) to make the test pass. You stop feeling like a babysitter and start feeling like an orchestra conductor. ## Actionable Takeaways If you want to stop babysitting your AI agents and start shipping reliable code, you need to change your methodology immediately. * **Ban direct implementation prompts:** Stop asking agents to write features. Ask them to write failing tests for features that don't exist yet. This is the single most important change you can make. * **Enforce the Red Phase:** Do not allow the agent to write the implementation until it has output the stack trace of the failing test. This anchors the context window in reality. * **Implement Checkpoints:** Use a framework like EvanFlow or write a simple bash script that commits the code automatically every time the test suite goes green. Never let the agent refactor without a safety net. * **Restrict Scope:** Break your 16-step plans down into atomic units. An agent should never be working on more than one failing test at a time. * **Accept the Overhead:** TDD with AI is slower in the short term. It burns more tokens. Accept this cost. The time saved in debugging and architectural refactoring is exponential. ## Frequently Asked Questions (FAQ) **Q: Does EvanFlow work with languages or frameworks that don't have great testing ecosystems?** A: It can, but it is significantly harder. EvanFlow thrives in ecosystems like TypeScript/JavaScript (Jest/Vitest), Python (PyTest), and Ruby (RSpec) where testing frameworks are mature and output clear, parsable stack traces. If your testing framework outputs ambiguous errors, the AI will struggle to anchor its context. **Q: Can I use EvanFlow concepts with UI/Frontend development?** A: Yes, but you must shift your focus from visual testing to behavioral testing. You cannot easily TDD a CSS layout with an AI. However, you can perfectly TDD the state management, the API fetching, and the component rendering logic using tools like React Testing Library. Force the AI to test the *behavior* (e.g., "clicking the button shows a loading state"), not the pixel perfection. **Q: What if the AI writes a bad test?** A: This is why EvanFlow includes a human-approval checkpoint after the test generation phase. The human developer's primary job shifts from writing code to reviewing test specifications. If the test is flawed, the resulting implementation will be flawed. You must intervene here. **Q: Isn't TDD dead? Why force AI to use an outdated methodology?** A: TDD has fallen out of favor in some fast-paced human startups because it is time-consuming for humans. However, AI operates at a different speed and suffers from different flaws (hallucinations, lack of long-term memory). TDD is the perfect mathematical counterweight to an LLM's chaotic generative nature. It provides the rigid boundaries that LLMs desperately need. **Q: Do I actually need to install the EvanFlow framework?** A: No. The "framework" is essentially a set of system prompts and behavioral guidelines. While the specific tooling provides a nice wrapper, the core philosophy—prompting for tests, forcing failure, and committing on green—can be done in standard Claude Code, Cursor, or even a web-based ChatGPT session if you are willing to copy/paste. ## Conclusion The tooling surrounding AI-assisted software development will continue to evolve. The models will get larger, faster, and context windows will stretch into the millions. But the fundamental rules of software engineering remain unchanged. If it isn't tested, it's broken. Relying on an AI to generate thousands of lines of untested implementation code is a recipe for legacy debt. By adopting the principles of EvanFlow and forcing your AI agents into a strict Test-Driven Development loop, you transform them from erratic junior developers into reliable, systematic engineering tools. Stop letting your AI agents pretend that tests don't matter, and start building software that lasts.