Sunday, December 14, 2025

🐌 From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours

From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours

How throttling led to a complete rewrite, cost optimization, and a more capable autonomous development agent

The Problem: Codex CLI Throttling

It started with a simple frustration: the Codex CLI was throttling my requests. I had built a shell-based workflow for autonomous AI workers to complete development tasks, but when I needed to scale up or run multiple workers, I kept hitting rate limits.

# The old approach - simple but limited
codex --prompt "Implement user authentication" --model gpt-5.1-codex
# ❌ Rate limit exceeded. Please try again later.

The throttling was killing productivity. I needed a solution that:

Bypassed CLI limitations
Gave me direct control over API calls
Allowed for cost optimization
Provided better observability

So I made the decision: migrate from Codex CLI to direct OpenAI API calls.

The Migration: Shell Scripts → Python Application

Phase 1: Direct API Replacement

The initial migration was straightforward - replace CLI calls with HTTP requests:

Before (Shell Script):

#!/bin/bash
# Simple Codex CLI wrapper
RESULT=$(codex --prompt "$PROMPT" --model "$MODEL")
echo "$RESULT"

After (Python with OpenAI Responses API):

import requests

def create_response(prompt: str, model: str, api_key: str):
    """Direct API call to OpenAI Responses API."""
    response = requests.post(
        "https://api.openai.com/v1/responses",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": model,
            "input": prompt,
            "tools": define_tools()  # Function calling support
        }
    )
    return response.json()

This gave us direct control, but we quickly realized we needed more sophistication.

Phase 2: Two-Phase Optimization Strategy

The breakthrough came when we noticed a pattern: workers were spending most of their time reading files, not writing code. Why use expensive models for exploration?

We implemented a two-phase approach:

Context Gathering Phase (gpt-5-mini - $0.25/$2.00 per 1M tokens)
- Read files, explore project structure
- Understand requirements
- Gather context
Implementation Phase (Dynamic model selection)
- Simple tasks: gpt-5.1-codex ($1.25/$10.00 per 1M tokens)
- Complex tasks: gpt-5.2 ($1.75/$14.00 per 1M tokens)

def get_current_model(self) -> str:
    """Get model for current phase."""
    if self.current_phase == "context":
        return MODELS["context"]  # gpt-5-mini - cheap exploration
    else:
        # Implementation phase - use assessed complexity
        return self.implementation_model  # Dynamically selected

This simple change reduced costs by ~80% for context-heavy tasks.

The Evolution: Iterative Improvements

Over 24 hours, we made 30+ commits, each addressing a specific issue:

Challenge 1: Token Bloat

Problem: Input tokens were growing exponentially (12M+ tokens for 100 iterations).

Solution: Aggressive context summarization

def should_summarize(self) -> bool:
    """Determine if context should be summarized."""
    # Force summarization if over token limit
    if self.current_usage.input_tokens > self.config.max_context_tokens:
        return True
    
    # Summarize every 10 iterations in implementation phase
    if (self.current_phase == "implementation" and 
        self.iteration % self.config.summarize_every_n_implementation == 0):
        return True
    
    return False

Challenge 2: Workers Getting Stuck in Loops

Problem: Workers would read the same file 20+ times or call the same function repeatedly.

Solution: Loop detection with replanning (not just exiting)

def detect_loop(self) -> Optional[str]:
    """Detect repetitive behavior patterns."""
    # Check for repetitive file reads
    if self.loop_detector.detect_repetitive_reads(self.files_read):
        return "repetitive_file_reads"
    
    # Check for repetitive function calls
    if self.loop_detector.detect_repetitive_calls(self.recent_activity):
        return "repetitive_function_calls"
    
    # Check for lack of progress
    if self.loop_detector.detect_no_progress(self.recent_activity):
        return "no_progress"
    
    return None

# When loop detected, replan instead of exiting
if loop_reason:
    self._replan_with_guidance(loop_reason)
    # Continue with new guidance, don't exit

Challenge 3: Workers Asking Questions Instead of Implementing

Problem: Workers would ask "Should I use X or Y?" instead of making decisions.

Solution: Question detection with forced action

def detect_question(self, text: str) -> bool:
    """Detect if worker is asking questions."""
    question_indicators = [
        "should i", "do you want", "can you confirm",
        "please choose", "which do you prefer"
    ]
    return any(indicator in text.lower() for indicator in question_indicators)

# After 3 questions, force tool use
if self.question_count >= 3:
    # Structure input to REQUIRE tool call
    input_for_api = f"""🚨 MANDATORY TOOL CALL REQUIRED:

{self.current_input}

**YOU MUST RESPOND WITH A FUNCTION CALL, NOT TEXT.**
Call a tool function now."""
    
    # Use tool_choice="required" at API level
    response = self.api_client.create_response(
        input_data=input_for_api,
        tools=self.tools,
        model=model,
        force_tool_use=True  # Forces tool_choice="required"
    )

Challenge 4: Cost Visibility

Problem: No way to track spending or identify expensive operations.

Solution: Comprehensive cost tracking

@dataclass
class CostTracker:
    """Track token usage and costs."""
    
    def calculate_cost(self, model: str) -> tuple[float, float]:
        """Calculate cost in USD and EUR."""
        pricing = self.config.get_pricing(model)
        
        # Calculate input cost (including cached tokens at 10% rate)
        input_cost = (
            (self.usage.input_tokens * pricing["input"] / 1_000_000) +
            (self.cached_tokens * pricing["input"] * 0.1 / 1_000_000)
        )
        
        # Calculate output cost
        output_cost = self.usage.output_tokens * pricing["output"] / 1_000_000
        
        total_usd = input_cost + output_cost
        total_eur = total_usd * self.config.usd_to_eur
        
        return total_usd, total_eur

Costs are logged to both human-readable task logs and machine-readable JSON:

{
  "daily": {
    "2025-12-13": {
      "usd": 0.25,
      "eur": 0.23,
      "tasks": 2
    }
  },
  "tasks": {
    "A37": {
      "usd": 0.0744,
      "eur": 0.0685,
      "count": 1
    }
  }
}

The Architecture: A Sophisticated Python Application

What started as a simple API wrapper evolved into a 1,370-line Python application with:

Core Components

1. Worker Orchestration (worker.py)

Two-phase execution (context → implementation)
Dynamic model selection based on complexity
Loop detection and recovery
Question detection and forced action

2. API Client (api_client.py)

OpenAI Responses API integration
Conversation chaining with previous_response_id
Tool calling support
Visual feedback during API calls

3. Cost Tracking (cost_tracker.py, cost_logger.py)

Real-time token usage tracking
USD/EUR cost calculation
Aggregated reporting (daily/weekly/monthly/yearly)
Per-task cost attribution

4. Smart Optimizations

File Caching: Avoid redundant file reads
Context Summarization: Reduce token bloat
Smart Model Selection: Cheap for reading, expensive for writing
Parallel Tool Execution: Concurrent reads during context phase

Example: Running the Worker

# Simple usage
./scripts/clubhub-ai-worker.sh developer

# Auto-selects highest priority task, locks it, implements it
# Automatically merges to develop when complete

What happens under the hood:

Context Gathering (iterations 1-10, gpt-5-mini)

1️⃣ Making API call (model: gpt-5-mini) | 🔍 Exploring project structure
⚙️  read_backlog
⚙️  read_file: docs/backlog/epic-A-mvp-core/A37-*.md
⚙️  read_file: frontend/src/pages/Members.tsx
💰 Tokens: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR

Complexity Assessment

🔍 Complexity Assessment:
   Files read: 9
   Large files (>10KB): 8
   Assessment: COMPLEX - 9 files, 8 large files
   → Model selection: gpt-5.1-codex
   → Reasoning: Complex task benefits from enhanced capability

Implementation (iterations 11+, gpt-5.1-codex)

1️⃣ 1️⃣ Making API call (model: gpt-5.1-codex) | 💻 Implementing
⚙️  write_file: frontend/src/pages/Members.tsx
✅ Wrote 12006 bytes
💰 Tokens: in=20173 out=862 | Cost: $0.0167 USD / €0.0154 EUR

Challenges of Direct API Use

Challenge 1: Conversation Management

The Responses API uses previous_response_id for conversation chaining. Managing this correctly was tricky:

# Wrong: Always including previous_response_id
response = api.create_response(
    input_data=prompt,
    previous_response_id=self.previous_response_id  # ❌ Breaks on phase transitions
)

# Right: Reset on phase transitions
if transitioning_phases:
    self.previous_response_id = None  # Start fresh conversation

response = api.create_response(
    input_data=prompt,
    previous_response_id=self.previous_response_id if not transitioning else None
)

Challenge 2: Tool Call Extraction

The Responses API returns tool calls in a nested structure. Extracting them correctly required careful parsing:

def extract_function_calls(self, response: Dict) -> List[Dict]:
    """Extract function calls from API response."""
    function_calls = []
    
    # Responses API structure: response.items[].content[].function_call
    for item in response.get("items", []):
        for content in item.get("content", []):
            if content.get("type") == "function_call":
                function_calls.append({
                    "id": content.get("call_id"),
                    "name": content.get("function", {}).get("name"),
                    "arguments": json.loads(content.get("function", {}).get("arguments", "{}"))
                })
    
    return function_calls

Challenge 3: Error Handling

API errors needed graceful handling without breaking the worker:

try:
    response = self.api_client.create_response(...)
except requests.RequestException as e:
    # Log error but don't crash
    print(f"⚠️  API call failed: {e}", file=sys.stderr)
    # Retry with exponential backoff or continue with cached context
    return self._handle_api_error(e)

Cost Tracking and Monitoring

Real-Time Cost Display

Every iteration shows cost:

💰 Tokens: in=7063 out=278 (reasoning=256) | Total: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR

Aggregated Reporting

Costs are tracked at multiple levels:

def log_task_cost(
    work_dir: str,
    worker_id: str,
    task_id: str,
    cost_usd: float,
    cost_eur: float,
    tokens: TokenUsage,
    model: str,
    success: bool
):
    """Log cost to both task log and cost tracking file."""
    
    # Append to human-readable task log
    task_log_entry = (
        f"{datetime.utcnow().isoformat()}Z | {worker_id} | {task_id} | "
        f"{'completed' if success else 'failed'} | "
        f"cost: ${cost_usd:.4f} USD / €{cost_eur:.4f} EUR "
        f"(tokens: in={tokens.input_tokens:,} out={tokens.output_tokens:,} "
        f"reasoning={tokens.reasoning_tokens:,}, model: {model})"
    )
    
    # Update machine-readable JSON
    update_cost_tracking_json(
        work_dir=work_dir,
        task_id=task_id,
        cost_usd=cost_usd,
        cost_eur=cost_eur,
        success=success
    )

Cost Optimization Results

Before optimization:

Single model (gpt-5.1-codex) for everything
Average cost per task: ~$0.15-0.30

After optimization:

Two-phase approach with smart model selection
Average cost per task: ~$0.05-0.10
~60-70% cost reduction

Key Learnings

1. Direct API Control is Powerful

Moving from CLI to API gave us:

Fine-grained control over requests
Better error handling and retries
Cost optimization opportunities
Custom tool calling logic

2. Two-Phase Approach is Essential

Using cheap models for exploration and expensive models only when needed:

Reduces costs by 60-80%
Maintains quality for implementation
Scales better for multiple workers

3. Observability Matters

Cost tracking and logging helped us:

Identify expensive operations (reading large files repeatedly)
Optimize model selection (when to use expensive models)
Debug issues (why did this task cost so much?)

4. Iterative Development Works

30+ commits in 24 hours, each addressing a specific issue:

Start simple, add complexity as needed
Fix issues as they arise
Measure and optimize continuously

The Result: A Production-Ready AI Worker

What we built:

✅ Autonomous task completion - Workers pick tasks, implement them, merge automatically
✅ Cost-optimized - 60-70% cost reduction through smart model selection
✅ Observable - Real-time cost tracking and detailed logging
✅ Robust - Loop detection, question handling, error recovery
✅ Tested - 69+ unit tests with >90% coverage

Example output:

🚀 Starting AI Worker Runner
   Project:  clubhub
   Role:     developer
   
🔍 Auto-selecting highest priority available task...
✅ Successfully locked task: A37

1️⃣ Making API call (model: gpt-5-mini) | 🔍 Exploring project structure
⚙️  read_backlog
📋 Task ID updated from backlog read: auto -> A37
   💰 Tokens: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR

[... 30 iterations later ...]

🔍 Complexity Assessment:
   Assessment: COMPLEX - 9 files, 8 large files
   → Model selection: gpt-5.1-codex

🔄 Transitioning from context gathering to implementation phase...

1️⃣ 1️⃣ Making API call (model: gpt-5.1-codex) | 💻 Implementing
⚙️  write_file: frontend/src/pages/Members.tsx
✅ Wrote 12006 bytes
   💰 Tokens: in=20173 out=862 | Cost: $0.0167 USD / €0.0154 EUR

✅ Task completed successfully
📤 Pushing to develop...
✅ Automatic merge and cleanup completed successfully!

Code Examples

Complete Worker Execution

# Entry point: openai-worker.py
from openai_worker.config import WorkerConfig
from openai_worker.worker import AIWorker

config = WorkerConfig.from_env()
worker = AIWorker(prompt, config)
exit_code = worker.run()

API Client Usage

# api_client.py
class APIClient:
    def create_response(
        self,
        input_data: Union[str, List[Dict[str, Any]]],
        tools: List[Dict[str, Any]],
        model: str,
        previous_response_id: Optional[str] = None,
        force_tool_use: bool = False,
    ) -> Dict[str, Any]:
        request_body = {
            "model": model,
            "tools": tools,
            "input": input_data,
        }
        
        if previous_response_id:
            request_body["previous_response_id"] = previous_response_id
        
        if force_tool_use:
            request_body["tool_choice"] = "required"
        
        response = requests.post(
            f"{self.base_url}/responses",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json=request_body
        )
        return response.json()

Cost Tracking

# cost_tracker.py
@dataclass
class CostTracker:
    def calculate_cost(self, model: str) -> tuple[float, float]:
        pricing = self.config.get_pricing(model)
        
        input_cost = (
            self.usage.input_tokens * pricing["input"] / 1_000_000 +
            self.cached_tokens * pricing["input"] * 0.1 / 1_000_000
        )
        output_cost = self.usage.output_tokens * pricing["output"] / 1_000_000
        
        total_usd = input_cost + output_cost
        total_eur = total_usd * self.config.usd_to_eur
        
        return total_usd, total_eur

What's Next?

The system is working well, but there's always room for improvement:

Better context compression - Summarize more aggressively
Smarter file caching - Cache parsed ASTs, not just raw files
Parallel context gathering - Multiple workers reading different files simultaneously
Predictive model selection - Use ML to predict optimal model before starting

But for now, we have a production-ready, cost-optimized, autonomous AI worker that can handle real development tasks.

Conclusion

What started as a workaround for CLI throttling became a complete rewrite and optimization effort. The result is a more capable, cost-effective, and observable system.

Key takeaways:

Direct API access gives you control and flexibility
Two-phase optimization dramatically reduces costs
Observability (cost tracking, logging) is essential
Iterative development works - fix issues as they arise

The migration from Codex CLI to OpenAI API wasn't just about avoiding throttling - it was about building a better system.

Built in 24 hours. 30+ commits. 1,370 lines of Python. 69+ tests. 60-70% cost reduction. Production-ready.

Saturday, December 13, 2025

💰 From Rate Limits to Autonomous Workers: Building an AI Development Team with Shell Scripts

How a Codex CLI throttle became the catalyst for building something better

The Throttle That Launched a Thousand Lines of Bash

It started, as many good engineering stories do, with frustration.

I was happily using OpenAI's Codex CLI to power my AI development workflow. Workers would pick up tasks from a backlog, implement features, run tests, and merge to develop. It was beautiful—until it wasn't.

Rate limited. Throttled. Queued.

The irony wasn't lost on me: I was being told to slow down by a tool designed to speed me up.

So I did what any reasonable engineer would do. I stared at the ceiling for five minutes, muttered something unprintable, and then asked: "What if I just talk to the API directly?"

The Pivot: From CLI to Raw API

The OpenAI Responses API is surprisingly approachable. It's essentially a conversation loop with function calling:

Send a prompt
Model responds (maybe with tool calls)
Execute the tools, send results back
Repeat until done

The challenge wasn't the API—it was everything around it:

Task selection: How does a worker know what to work on?
Isolation: How do multiple workers avoid stepping on each other?
Observability: What's happening inside that loop?
Cost tracking: Am I accidentally burning through my API budget?
Integration: How do changes get merged back?

Enter: `openai-worker.sh`

800+ lines of bash that probably shouldn't work as well as it does.

The architecture is beautifully stupid:

┌─────────────────────────────────────────────────────────┐
│                    AI Worker Runner                      │
│  - Loads project manifest                               │
│  - Selects/locks task from backlog                      │
│  - Creates git worktree for isolation                   │
│  - Builds role-specific prompt                          │
│  - Launches OpenAI worker                               │
│  - Merges changes back to develop                       │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    OpenAI Worker                         │
│  - Conversation loop with Responses API                 │
│  - Function calling: read/write files, git, shell       │
│  - Token tracking with live cost estimation             │
│  - Autonomous task completion                           │
└─────────────────────────────────────────────────────────┘

The Fun Parts

Number Emoji Iterations: Because [Iteration 42] is boring, but 4️⃣2️⃣ sparks joy.

number_to_icon() {
  local num="$1"
  local result=""
  local digits=$(echo "$num" | sed 's/./& /g')
  for digit in $digits; do
    case "$digit" in
      0) result="${result}0️⃣" ;;
      1) result="${result}1️⃣" ;;
      # ... you get the idea
    esac
  done
  echo "$result"
}

Live Cost Tracking: Every iteration shows cumulative spend in USD and EUR.

5️⃣7️⃣ Making API call...
   💰 Tokens: in=145780 out=636 | Total: in=4579501 out=23199 | Cost: $5.9443 USD / €5.4688 EUR

When you're burning through 4.5 million tokens in a session, you want to see that counter tick up.

Auto-Select Mode: When the worker can't lock the requested task (someone else got there first), it doesn't sulk—it finds another high-priority task and gets to work. The prompt literally tells it:

"You are expected to be autonomous and eager to work. Do NOT wait for user input."

I'm not saying I'm training my AI workers to have a Protestant work ethic, but I'm not not saying that.

The Tools: Teaching AI to Touch Files

The Responses API's function calling is the secret sauce. We define a schema, and the model tells us what it wants to do:

{
  "name": "write_file",
  "arguments": {
    "path": "frontend/src/pages/Members.tsx",
    "content": "... 12,250 bytes of React ..."
  }
}

Our worker supports:

read_file / write_file - The basics
run_command - With timeout, for when npm test decides to contemplate infinity
read_directory - Because the model needs to explore
git_status / git_commit - Version control awareness
task_complete - The satisfying finish line

Each tool execution gets logged with emoji flair:

📖 Reading: frontend/src/pages/Members.tsx
✅ Read 12250 bytes

✏️  Writing: frontend/src/pages/Members.tsx  
✅ Wrote 12238 bytes

🔧 Command: npm test
✅ Exit code: 0

The Workflow: From Backlog to Merge

Here's what happens when you run ./scripts/clubhub-ai-worker.sh developer A37:

Load manifest - Find project config, locate backlog
Check stale locks - Clean up any abandoned tasks
Lock task A37 - Mark it as in-progress with a 4-hour expiry
Create worktree - Fresh git worktree branched from develop
Build prompt - Include project context, coding standards, task details
Launch worker - Start the API conversation loop
Worker does work - Read files, write code, run tests, commit
Auto-merge - Fast-forward merge to develop
Cleanup - Remove worktree and branch

The whole thing is designed for parallelism. Run 5 workers, they each get their own worktree, their own branch, their own task. No conflicts until merge time.

The Model Choice: gpt-5.1-codex or Bust

We're strict about models. Only two are allowed:

gpt-5.1-codex (default): The workhorse. $1.25/1M input, $10.00/1M output.
gpt-5.2 (max): For when you need the big brain. $1.75/1M input, $14.00/1M output.

Anything else gets rejected:

if [[ "$MODEL" != "gpt-5.1-codex" ]] && [[ "$MODEL" != "gpt-5.2" ]]; then
  echo "⚠️  Warning: Model $MODEL not allowed. Using default: gpt-5.1-codex"
  MODEL="gpt-5.1-codex"
fi

No surprises. No accidental GPT-4o bills. No tears.

Lessons Learned

1. Bash Can Do Surprisingly Much

JSON parsing? Python one-liners.
HTTP requests? curl.
State management? Files and environment variables.

Is it elegant? Debatable.
Does it work? Surprisingly well.

2. Observability Is Everything

Early versions were opaque. The worker would churn for 20 minutes and I'd have no idea if it was making progress or chasing its tail.

Now every iteration shows:

What tool is being called
What file is being touched
How many tokens consumed
Running cost in real currency

The difference between "what is happening?" and "I see exactly what's happening" is about 50 lines of echo statements.

3. Autonomy Requires Guardrails

Telling an AI "work until done" is dangerous without:

Max iterations (100 by default)
Task completion markers (explicit "I'm done" signal)
Cost visibility (so you see that $15 bill coming)
Worktree isolation (so mistakes are contained)

4. Rate Limits Are a Feature, Not a Bug

Getting throttled on Codex CLI forced me to understand what I actually needed. The result is a system that:

Has no external dependencies beyond curl and Python
Gives me complete control over the conversation
Costs roughly the same (maybe less, without CLI overhead)
Scales to as many workers as my API limits allow

The Output: Today's Session

In one afternoon, we went from "rate limited on Codex CLI" to:

✅ Full OpenAI Responses API integration
✅ Autonomous task selection with backlog integration
✅ Git worktree isolation per worker
✅ Live token and cost tracking (USD/EUR)
✅ Emoji iteration counters (because why not)
✅ Auto-merge to develop on completion
✅ Renamed "manual mode" to "auto-select mode" (words matter)

And then we let the worker loose on a real task: A37 - Member archive/reactivate.

60 iterations. 4.8 million tokens. Backend and frontend implementation. Tests updated. Merged to develop.

All while I wrote this blog post.

The Code

It's all open in ai-project-hub:

ai-project-hub/
├── tools/
│   ├── openai-worker.sh      # The API conversation loop
│   ├── ai-worker-runner.sh   # Task selection, worktree, merge
│   ├── git-branch-helpers.sh # Worktree management
│   └── task-lock-helpers.sh  # Backlog integration
└── projects/
    └── clubhub/
        └── manifest.yaml     # Project config

Is it production-ready? For my production, yes.
Would I recommend it? If you're comfortable reading bash, absolutely.

What's Next

Parallel worker orchestration - Launch N workers, distribute tasks
Smarter cost budgets - "Stop if you hit $20"
Cached context - Reduce token usage with conversation summaries
Better error recovery - Right now a failed iteration is terminal

But those are problems for another throttled afternoon.

Written by a human, with an AI worker implementing features in a separate terminal.

Total cost of that parallel session: approximately €5.47

Total value of not being rate limited: priceless

Sunday, December 14, 2025