Thursday, December 11, 2025

From One-Off Prompts to an Autonomous AI Team: Inside ClubHub’s LLM Worker Architecture

Imagine kicking off a new feature and having a team of AI colleagues rally to design, code, test, and review it—while you watch the progress over a cup of coffee. Just a year ago, that idea felt like sci-fi. Early experiments with a single AI assistant often produced “almost there” code that still needed lots of hand-holding. Fast forward to today: at ClubHub, we’ve evolved those one-off attempts into a structured AI Worker architecture – essentially an internal team of LLM-based agents with specialized roles, a test-driven workflow, and cross-checks that mimic a well-oiled engineering team. The result? Our AI teammates now contribute reliable, production-ready code with minimal human intervention.

Why share this now? Because we see this as more than a cool internal hack – it’s a glimpse into the future of engineering collaboration. Cross-reviewed, test-driven AI development has transformed our workflow, and we suspect it could transform others. This post outlines our journey, design, and the big implications. If you’re curious about integrating AI into your dev team (or already doing so), read on – and let’s compare notes!

From Unreliable Beginnings to Test-Driven Autonomy

Like many developers, we started by asking a single large language model (LLM) to write code for us. We’d prompt it with a feature idea or bug description, and it would generate code in one go. Sometimes the results were impressive, but just as often they were brittle or off-base. There was no guarantee the code actually worked in our codebase context, and we had to scrutinize every line. In short, early one-off prompt attempts felt like junior dev outputs – requiring a senior engineer’s time to review and fix. The potential was there, but trust wasn’t.

The turning point came when we treated AI not as a magic code generator, but as a team of collaborators. We asked: what if, instead of one AI trying to do everything in a single step, we gave it a structured process like an engineering team uses? Real software teams have roles (architects, developers, testers, reviewers) and a process (design → code → test → review) that catches errors and ensures quality. We decided to have our AI follow a similar multi-step pipelineemergentmind.com. This shift—from single-shot to assembly line—was the genesis of our AI Worker architecture.

Crucially, we made it test-driven and review-driven. In human teams, nobody merges code that fails tests or lacks review approval; we imposed the same discipline on our AI. Early prototypes of this pipeline were immediately more reliable: even if the AI wrote imperfect code on the first pass, the “Test Engineer” agent would catch the issues, and the AI could iterate to fix them before a human ever got involved. Our confidence grew with each success. Over time, what started as a hacky script evolved into a robust system where multiple AI agents cooperate under guardrails, each agent checking the others’ work. We were no longer crossing our fingers hoping the AI’s output was good – we knew it was good, because it had passed all the same gates we require of human code.

Assembling an AI Team: Roles and Responsibilities

To mirror a real dev team, we structured our AI into distinct roles, each with clear responsibilitiesemergentmind.com. Together, they operate as a cohesive unit, handing off work just like specialists on a project. Here are the key players in our AI team:

  • Product Explainer – This agent acts as the “product manager” for the AI team. It takes high-level input (a feature request, user story, or bug report) and clarifies the requirements in detail. The Product Explainer ensures the problem is well-defined, disambiguates any unclear points, and might even restate the goal in simpler terms to set the team up for success.

  • Architect – Once the requirements are clear, the Architect agent steps in. It proposes a solution approach or design: which modules or components to touch, high-level logic, and any relevant patterns. Essentially, it drafts a mini technical design doc for the change. This keeps the development focused and aligned with our overall architecture.

  • Feature Developer – The Feature Developer agent is the coder of the bunch. With the requirements and an outline in hand, it writes the actual code to implement the feature or fix. It generates new functions or modifies existing code, aiming to fulfill the spec provided by the Product Explainer and Architect. This agent is akin to a software engineer writing a PR.

  • Test Engineer – As the code is written (or sometimes even in parallel), the Test Engineer agent creates and runs tests. It might write new unit tests or scenarios to validate the feature. Importantly, this agent doesn’t take the code’s word for it – it checks that the code does what the Product Explainer outlined. If the new code breaks existing tests or fails new ones, the Test Engineer flags it.

  • Reviewer – Even after code passes tests, we enlist a Reviewer agent to double-check the quality. This agent reviews the code diff like a human reviewer: looking for correctness, edge cases, coding style, compliance with best practices, and whether the changes truly solve the described problem. The Reviewer provides an assessment and can request changes if something seems off.

  • Orchestrator – Overseeing the whole process is the Orchestrator (think of it as a tech lead or project manager). The Orchestrator agent coordinates all the other agents, making sure each handoff happens in the right order and that everyone’s output meets criteria. It feeds the Product Explainer’s clarified spec to the Architect, the Architect’s plan to the Developer, the Developer’s code to the Tester and Reviewer, and so on. The Orchestrator decides when to loop back – for example, if tests fail or the Reviewer is unhappy, the Orchestrator might invoke the Developer agent again to address the issues. It’s the glue that holds the pipeline together and ensures the AI team’s work reaches “done done” status.

This division of labor meant each agent could focus on its specialty. By not asking one LLM to juggle everything at once, we reduced complexity and cognitive load. Just as importantly, it introduced accountability at each step. Each agent essentially checks or expands on the work of the previous: the Architect validates understanding of requirements, the Developer’s code is guided by the design, the Test Engineer validates the code, and the Reviewer scrutinizes everything. It’s a system of checks and balances entirely within the AI realm, supervised by the Orchestrator.

Cross-Review and Test Coverage: Enabling Trust and Autonomy

Establishing these roles was step one. Step two was ensuring they actually uphold quality. We treated failing tests and critical review comments as a hard gate: the AI must address them before moving forward. In practice, this means the Orchestrator will not consider the task complete (and certainly won’t merge any code) until all tests pass and the Reviewer signs off. If a test fails, the Orchestrator sends the issue back to the Developer agent to fix the code (or sometimes to the Test Engineer to refine the test, if the test was flawed). Similarly, if the Reviewer finds a bug or poor implementation, it triggers a redo or fixes from the Developer. This loop can repeat for several cycles of refine-and-check, just like an iterative code review process between humans.

This cross-review dynamic—AI agents reviewing and testing each other’s work—turned out to be the key to trusting the system. It’s one thing to have an AI churn out code; it’s another to have multiple AI agents in agreement that the code is solid. When the Test Engineer and Reviewer agents both give a thumbs-up, it starts to feel equivalent to “all tests green and peer-approved” in a normal dev process. We could then allow the code to be automatically merged or deployed with much greater confidence.

Research into AI agents has noted the value of such self-reflection and peer review loops. By having agents critique or verify each other’s outputs, errors and hallucinations drop significantlyemergentmind.com. We observed the same: the code quality from the AI team with cross-checks was dramatically higher than lone-Hero AI attempts. Simple bugs (think off-by-one errors, missing null checks) that a single-pass AI might miss were caught by the test cases. Logical design slips (like an incomplete understanding of a requirement) that the Developer agent might make were often caught by the Reviewer agent’s analysis. Each agent brought a different perspective, and together they covered each other’s blind spots.

Test coverage played a pivotal role in autonomy. Because our AI team knows it cannot proceed without passing all tests, it actually started writing tests proactively to prove its work. In some cases, the Product Explainer or Architect would suggest acceptance criteria that the Test Engineer turns into tests, guarding against requirement drift. This test-driven mindset meant the AI wasn’t just generating code, it was verifying code. The combination of high test coverage and cross-agent review gave us the courage to let the AI system work on bigger tasks with less oversight. It’s akin to a new developer earning trust by consistently passing CI checks and code reviews – you gradually give them more leeway. By the same token, as our AI agents consistently produced solid results, we treated them more like autonomous colleagues than tools.

Perhaps the ultimate sign of trust: we’ve had instances where the AI team’s code changes were merged into a feature branch and deployed to staging – without a human manually editing the code at all. Of course, engineers still did a quick final look (old habits die hard!), but the fact that nothing had to be changed is a testament to how far the internal quality controls had come. Cross-review and test gating turned the AI from an unpredictable junior dev into a reliable autonomous contributor.

The AI Worker Model: Design and Orchestration Patterns

Designing the orchestration of these AI agents was a project unto itself. We didn’t just throw a bunch of prompts together; we built what we now call the AI Worker model with careful consideration for how information flows and decisions are made. At its heart is a pattern of staged, linear orchestration – essentially a pipeline that mirrors our development workflow (from requirements to design to code to test to review) emergentmind.com. This was intentional. A sequential pipeline is easy to reason about and aligns with our existing CI/CD stages. Each stage produces artifacts: the Product Explainer produces a clarified spec, the Architect produces a design outline, the Developer produces code diffs, the Test Engineer produces test results, and the Reviewer produces a code critique. The Orchestrator moves these artifacts along like work items on a kanban board, ensuring each is completed before the next begins.

We chose a mostly deterministic hand-off style (sometimes called waterfall or assembly-line in agent literature) because it offered clarity and reproducibility emergentmind.com. Alternative orchestration patterns exist – for example, having multiple agents debate and vote on solutions, or running roles in parallel – but those introduce complexity we didn’t need initially. By keeping it linear, we could easily trace why a certain piece of code was written (just check the Architect’s plan and Product Explainer’s notes) and why it was accepted (tests X, Y, Z passed and the Reviewer approved). This traceability is gold for auditability, which I’ll touch on shortly.

Within this pipeline, we built in a feedback loop for revisions. The Orchestrator isn’t blindly funneling outputs downstream; it’s evaluating at key checkpoints. If an agent’s output is unsatisfactory (e.g. the design is missing a key use case, or the code diff is too large and unwieldy), the Orchestrator can prompt that agent (or a different one) to refine the output before progressing. In essence, the pipeline can pause and backtrack if needed, which is important for complex tasks. We found that a little upfront correction (say, revising the architectural approach early on) can save a lot of rework later in the pipeline.

Another design decision was to abstract this entire AI team into a reusable module. We realized early that once we got the kinks worked out, we’d want to use AI workers on multiple projects. Instead of copy-pasting pipeline code in every repo, we created a separate repository called ai-project-hub that encapsulates the orchestration logic and role definitions. Think of it as our internal AI-as-a-service toolkit. Any codebase at ClubHub can include ai-project-hub and with a bit of configuration (providing domain-specific context or libraries) spin up its own squad of AI agents. This abstraction has been extremely useful: it standardizes how the AI team interacts with our code (through well-defined interfaces), and updates to the core logic (like improving the Reviewer’s prompt or upgrading the LLM model) can be rolled out across all uses. In practice, ai-project-hub provides the templates, state management, and integration hooks (for version control, CI, etc.), while each individual project provides the specific goals and repository context. We can imagine open-sourcing parts of this hub in the future, because it’s not tied to our proprietary code – it’s an orchestration framework that could benefit others building AI teams.

Designing the AI Worker architecture also meant considering failure modes and safeties. What if the agents get stuck in a loop (e.g., two agents passing a task back and forth)? What if the LLM outputs start drifting off-topic? We implemented timeouts and sanity-checks at the Orchestrator level. For example, if a certain number of iterations fail to produce passing tests, the Orchestrator will flag a human for help. Similarly, the Orchestrator can detect if the “conversation” starts repeating or going in circles. These guardrails ensure that while the AI team is autonomous, it doesn’t spin out of control or waste too many cycles on a hopeless task.

Lastly, our architecture heavily logs everything for transparency. Every prompt, response, code diff, and test result is recorded. This not only helps us audit and debug the AI’s actions, but it also serves as a training dataset of sorts for future improvements. By reading the logs, we’ve learned a ton about where the AI still struggles (e.g., misunderstanding a nuanced requirement) and we feed those insights into better prompts or occasional model fine-tuning. The logs also reassure team members and stakeholders: we can always explain what decision the AI made and why, because we have the full trace.

Developer Experience and Workflow Integration

All these fancy AI agents would be moot if they didn’t play nicely with our developers’ day-to-day workflow. We treated developer experience (DevEx) as a first-class concern in this project. Our engineers shouldn’t feel like they have to wrangle a complex AI pipeline; it should feel like a helpful teammate integrated into familiar tools.

One way we achieved this was through CI/CD integration. We hooked the AI Worker pipeline into our continuous integration system so that triggering the AI is as simple as, say, labeling a pull request or commenting on an issue. For example, a developer can write up a feature request in our issue tracker and add a tag like “/AI-Assisted”. The Orchestrator (listening via a bot) will pick it up, spin up the necessary agent team, and start working on a solution branch. Throughout the process, it posts updates in the issue or PR thread – much like a human would. You might see a comment, “Architect: Proposed design uploaded,” followed by “Feature Developer: Code committed to feature/ai-123,” and then “Test Engineer: All tests passed.” By the time it comments “Reviewer: Code looks good. Ready for human review/merge,” the developer is presented with a fully fleshed out solution to approve or give feedback on. This tight integration means using the AI doesn’t feel like using a separate tool – it’s part of our development lifecycle, accessible through the same Git and CI interfaces we use every day.

We also focused on making the AI team configurable and safe by default. Developers can constrain the scope of what the AI works on (e.g., limit it to certain paths in the repository or a diff size limit) by configuration in the ai-project-hub settings for that repo. This prevents an overeager AI from refactoring the whole codebase when you just wanted a small change! We learned that giving human developers some control knobs (like “please only write tests” or “only suggest changes, don’t commit”) eased adoption and trust. When engineers feel in control, they’re more likely to embrace the AI help rather than see it as a rogue element.

Auditability, as mentioned, was a huge part of developer acceptance. Anyone can inspect the AI’s work logs if something seems off. In fact, we expose an AI report for each task: a summary of what each agent did, any issues encountered and resolved, and pointers to the diff and tests. It’s like a mini documentation of how the feature was developed. This turned out to be useful not just for trust, but for onboarding new team members as well – they can see how the AI would approach a problem, which often is a distilled best-practices version of our coding standards (after all, we trained/tuned it on our style and used our standards in the prompts).

From a developer’s perspective, working with the AI team now feels less like “using a tool” and more like collaborating with a teammate. We’ve heard developers joke that it feels like there’s an extra engineer who writes code in the background and politely asks for reviews. That’s exactly the vibe we were going for. When the process is smooth, the AI fades into the background and you just focus on higher-level decisions: picking which tasks the AI should handle and refining the requirements or edge cases upfront so it can do a good job. Everything else – the grunt work of churning out code and tests – happens automatically under the hood.

Why This Shift Matters for the Future of Engineering Collaboration

Stepping back, the reason we’re excited about this AI Worker architecture isn’t just the immediate productivity boost (though that has been significant). It’s the potential paradigm shift in how software is built. We’ve essentially onboarded non-human team members that can handle substantial portions of development work. This changes the equation for collaboration: teams can be smaller or tackle more ambitious projects with the same size, and human engineers can focus more on creative, complex problem-solving while delegating routine or boilerplate tasks to AI agents.

It also opens the door to a more continuous development process. With AI agents able to respond quickly to new tasks (even at 3 AM on a Sunday), the idea of 24/7 development becomes feasible. Imagine a future where a product manager’s request on Friday night is designed, coded, and tested by an AI over the weekend, ready for the human team’s review on Monday morning. We’re not fully there yet, but we’ve seen enough glimmers of this to believe it’s coming. This kind of human-AI collaboration could accelerate innovation dramatically.

From an organizational standpoint, embracing AI teammates encourages us to rethink skill sets and roles. The traditional silos (dev, QA, ops) might blur when an AI can span multiple roles in a blink. We might place more emphasis on roles like “AI Orchestrator” or “AI Ethicist” or simply engineers who are adept at guiding AI. Even hiring could shift: recruiters and hiring managers reading this might consider how familiarity with AI tools or prompt engineering could become a sought-after skill. Conversely, having AI agents could alleviate talent gaps in certain areas – for example, if you lack a dedicated QA team, an AI Test Engineer might fill that gap and ensure quality isn’t compromised.

There’s also a cultural aspect. Introducing AI agents into a team can seem daunting or even threatening to some, but our experience has been largely positive by positioning the AI as augmenting, not replacing. We encouraged the team to see the AI as their helper (almost like an apprentice or an automated intern) rather than an outsider. Over time, as the AI proved its usefulness, skepticism turned into curiosity and even pride. Engineers are proud that we have such advanced tooling and that they can get more done. And when a tough bug is finally fixed by the AI after multiple iterations, we celebrate it just like we would a human contribution – it’s a win for the team.

Illustration: A pair of friendly robots share a cake at a celebratory product ship party, symbolizing AI teammates joining in on the success. In the near future, we might figuratively “invite our AI colleagues” to celebrate project milestones – because they genuinely contribute to our victories. The image of robots eating cake together with humans is a fun metaphor, but it reflects a real shift: AI agents are becoming part of the fabric of our teams. Embracing that with a bit of humor and humanity goes a long way in making this transition enjoyable. We’ve found that treating AI contributions with the same appreciation (and retrospective analysis) as human work fosters an environment where everyone – human or AI – is focused on the same goals and continuous improvement.

The broader engineering community is just beginning to grapple with what software development looks like when AI is deeply integrated. Our take is that it’s a huge opportunity for collaboration. Open-source projects could have AI-powered maintainers to triage issues or even propose fixes. Cross-company initiatives might share AI agent blueprints for common tasks. And individual developers might become more like AI conductors, orchestrating resources to build solutions faster and with fewer errors. The next decade could mirror the DevOps revolution, but for AI-driven development – breaking down barriers between what humans do and what machines do in the software creation process.

Join the Conversation (CTA)

Our journey with an AI worker team is ongoing, and there’s so much more to learn and improve. We shared this story to spark dialogue and collaboration. If you’re exploring similar ideas or want to dip your toes into AI-driven development, reach out to us! Whether you’re a curious engineer, a researcher, or a tech leader rethinking team composition, we’d love to exchange notes. Let’s connect, compare experiences, and together shape how AI and humans build the future of software. Feel free to contact me via LinkedIn or email – the AI team (and the human one!) at ClubHub is always eager to chat about pushing the boundaries of our industry.

No comments:

🐌 From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours

From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours How throttling led to a complete rewrite, cost optimization, and a mo...