Sunday, December 14, 2025

🐌 From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours

From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours

How throttling led to a complete rewrite, cost optimization, and a more capable autonomous development agent


The Problem: Codex CLI Throttling

It started with a simple frustration: the Codex CLI was throttling my requests. I had built a shell-based workflow for autonomous AI workers to complete development tasks, but when I needed to scale up or run multiple workers, I kept hitting rate limits.

# The old approach - simple but limited
codex --prompt "Implement user authentication" --model gpt-5.1-codex
# ❌ Rate limit exceeded. Please try again later.

The throttling was killing productivity. I needed a solution that:

  • Bypassed CLI limitations
  • Gave me direct control over API calls
  • Allowed for cost optimization
  • Provided better observability

So I made the decision: migrate from Codex CLI to direct OpenAI API calls.


The Migration: Shell Scripts → Python Application

Phase 1: Direct API Replacement

The initial migration was straightforward - replace CLI calls with HTTP requests:

Before (Shell Script):

#!/bin/bash
# Simple Codex CLI wrapper
RESULT=$(codex --prompt "$PROMPT" --model "$MODEL")
echo "$RESULT"

After (Python with OpenAI Responses API):

import requests

def create_response(prompt: str, model: str, api_key: str):
    """Direct API call to OpenAI Responses API."""
    response = requests.post(
        "https://api.openai.com/v1/responses",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": model,
            "input": prompt,
            "tools": define_tools()  # Function calling support
        }
    )
    return response.json()

This gave us direct control, but we quickly realized we needed more sophistication.

Phase 2: Two-Phase Optimization Strategy

The breakthrough came when we noticed a pattern: workers were spending most of their time reading files, not writing code. Why use expensive models for exploration?

We implemented a two-phase approach:

  1. Context Gathering Phase (gpt-5-mini - $0.25/$2.00 per 1M tokens)

    • Read files, explore project structure
    • Understand requirements
    • Gather context
  2. Implementation Phase (Dynamic model selection)

    • Simple tasks: gpt-5.1-codex ($1.25/$10.00 per 1M tokens)
    • Complex tasks: gpt-5.2 ($1.75/$14.00 per 1M tokens)
def get_current_model(self) -> str:
    """Get model for current phase."""
    if self.current_phase == "context":
        return MODELS["context"]  # gpt-5-mini - cheap exploration
    else:
        # Implementation phase - use assessed complexity
        return self.implementation_model  # Dynamically selected

This simple change reduced costs by ~80% for context-heavy tasks.


The Evolution: Iterative Improvements

Over 24 hours, we made 30+ commits, each addressing a specific issue:

Challenge 1: Token Bloat

Problem: Input tokens were growing exponentially (12M+ tokens for 100 iterations).

Solution: Aggressive context summarization

def should_summarize(self) -> bool:
    """Determine if context should be summarized."""
    # Force summarization if over token limit
    if self.current_usage.input_tokens > self.config.max_context_tokens:
        return True
    
    # Summarize every 10 iterations in implementation phase
    if (self.current_phase == "implementation" and 
        self.iteration % self.config.summarize_every_n_implementation == 0):
        return True
    
    return False

Challenge 2: Workers Getting Stuck in Loops

Problem: Workers would read the same file 20+ times or call the same function repeatedly.

Solution: Loop detection with replanning (not just exiting)

def detect_loop(self) -> Optional[str]:
    """Detect repetitive behavior patterns."""
    # Check for repetitive file reads
    if self.loop_detector.detect_repetitive_reads(self.files_read):
        return "repetitive_file_reads"
    
    # Check for repetitive function calls
    if self.loop_detector.detect_repetitive_calls(self.recent_activity):
        return "repetitive_function_calls"
    
    # Check for lack of progress
    if self.loop_detector.detect_no_progress(self.recent_activity):
        return "no_progress"
    
    return None

# When loop detected, replan instead of exiting
if loop_reason:
    self._replan_with_guidance(loop_reason)
    # Continue with new guidance, don't exit

Challenge 3: Workers Asking Questions Instead of Implementing

Problem: Workers would ask "Should I use X or Y?" instead of making decisions.

Solution: Question detection with forced action

def detect_question(self, text: str) -> bool:
    """Detect if worker is asking questions."""
    question_indicators = [
        "should i", "do you want", "can you confirm",
        "please choose", "which do you prefer"
    ]
    return any(indicator in text.lower() for indicator in question_indicators)

# After 3 questions, force tool use
if self.question_count >= 3:
    # Structure input to REQUIRE tool call
    input_for_api = f"""🚨 MANDATORY TOOL CALL REQUIRED:

{self.current_input}

**YOU MUST RESPOND WITH A FUNCTION CALL, NOT TEXT.**
Call a tool function now."""
    
    # Use tool_choice="required" at API level
    response = self.api_client.create_response(
        input_data=input_for_api,
        tools=self.tools,
        model=model,
        force_tool_use=True  # Forces tool_choice="required"
    )

Challenge 4: Cost Visibility

Problem: No way to track spending or identify expensive operations.

Solution: Comprehensive cost tracking

@dataclass
class CostTracker:
    """Track token usage and costs."""
    
    def calculate_cost(self, model: str) -> tuple[float, float]:
        """Calculate cost in USD and EUR."""
        pricing = self.config.get_pricing(model)
        
        # Calculate input cost (including cached tokens at 10% rate)
        input_cost = (
            (self.usage.input_tokens * pricing["input"] / 1_000_000) +
            (self.cached_tokens * pricing["input"] * 0.1 / 1_000_000)
        )
        
        # Calculate output cost
        output_cost = self.usage.output_tokens * pricing["output"] / 1_000_000
        
        total_usd = input_cost + output_cost
        total_eur = total_usd * self.config.usd_to_eur
        
        return total_usd, total_eur

Costs are logged to both human-readable task logs and machine-readable JSON:

{
  "daily": {
    "2025-12-13": {
      "usd": 0.25,
      "eur": 0.23,
      "tasks": 2
    }
  },
  "tasks": {
    "A37": {
      "usd": 0.0744,
      "eur": 0.0685,
      "count": 1
    }
  }
}

The Architecture: A Sophisticated Python Application

What started as a simple API wrapper evolved into a 1,370-line Python application with:

Core Components

1. Worker Orchestration (worker.py)

  • Two-phase execution (context → implementation)
  • Dynamic model selection based on complexity
  • Loop detection and recovery
  • Question detection and forced action

2. API Client (api_client.py)

  • OpenAI Responses API integration
  • Conversation chaining with previous_response_id
  • Tool calling support
  • Visual feedback during API calls

3. Cost Tracking (cost_tracker.pycost_logger.py)

  • Real-time token usage tracking
  • USD/EUR cost calculation
  • Aggregated reporting (daily/weekly/monthly/yearly)
  • Per-task cost attribution

4. Smart Optimizations

  • File Caching: Avoid redundant file reads
  • Context Summarization: Reduce token bloat
  • Smart Model Selection: Cheap for reading, expensive for writing
  • Parallel Tool Execution: Concurrent reads during context phase

Example: Running the Worker

# Simple usage
./scripts/clubhub-ai-worker.sh developer

# Auto-selects highest priority task, locks it, implements it
# Automatically merges to develop when complete

What happens under the hood:

  1. Context Gathering (iterations 1-10, gpt-5-mini)

    1️⃣ Making API call (model: gpt-5-mini) | 🔍 Exploring project structure
    ⚙️  read_backlog
    ⚙️  read_file: docs/backlog/epic-A-mvp-core/A37-*.md
    ⚙️  read_file: frontend/src/pages/Members.tsx
    💰 Tokens: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR
    
  2. Complexity Assessment

    🔍 Complexity Assessment:
       Files read: 9
       Large files (>10KB): 8
       Assessment: COMPLEX - 9 files, 8 large files
       → Model selection: gpt-5.1-codex
       → Reasoning: Complex task benefits from enhanced capability
    
  3. Implementation (iterations 11+, gpt-5.1-codex)

    1️⃣ 1️⃣ Making API call (model: gpt-5.1-codex) | 💻 Implementing
    ⚙️  write_file: frontend/src/pages/Members.tsx
    ✅ Wrote 12006 bytes
    💰 Tokens: in=20173 out=862 | Cost: $0.0167 USD / €0.0154 EUR
    

Challenges of Direct API Use

Challenge 1: Conversation Management

The Responses API uses previous_response_id for conversation chaining. Managing this correctly was tricky:

# Wrong: Always including previous_response_id
response = api.create_response(
    input_data=prompt,
    previous_response_id=self.previous_response_id  # ❌ Breaks on phase transitions
)

# Right: Reset on phase transitions
if transitioning_phases:
    self.previous_response_id = None  # Start fresh conversation

response = api.create_response(
    input_data=prompt,
    previous_response_id=self.previous_response_id if not transitioning else None
)

Challenge 2: Tool Call Extraction

The Responses API returns tool calls in a nested structure. Extracting them correctly required careful parsing:

def extract_function_calls(self, response: Dict) -> List[Dict]:
    """Extract function calls from API response."""
    function_calls = []
    
    # Responses API structure: response.items[].content[].function_call
    for item in response.get("items", []):
        for content in item.get("content", []):
            if content.get("type") == "function_call":
                function_calls.append({
                    "id": content.get("call_id"),
                    "name": content.get("function", {}).get("name"),
                    "arguments": json.loads(content.get("function", {}).get("arguments", "{}"))
                })
    
    return function_calls

Challenge 3: Error Handling

API errors needed graceful handling without breaking the worker:

try:
    response = self.api_client.create_response(...)
except requests.RequestException as e:
    # Log error but don't crash
    print(f"⚠️  API call failed: {e}", file=sys.stderr)
    # Retry with exponential backoff or continue with cached context
    return self._handle_api_error(e)

Cost Tracking and Monitoring

Real-Time Cost Display

Every iteration shows cost:

💰 Tokens: in=7063 out=278 (reasoning=256) | Total: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR

Aggregated Reporting

Costs are tracked at multiple levels:

def log_task_cost(
    work_dir: str,
    worker_id: str,
    task_id: str,
    cost_usd: float,
    cost_eur: float,
    tokens: TokenUsage,
    model: str,
    success: bool
):
    """Log cost to both task log and cost tracking file."""
    
    # Append to human-readable task log
    task_log_entry = (
        f"{datetime.utcnow().isoformat()}Z | {worker_id} | {task_id} | "
        f"{'completed' if success else 'failed'} | "
        f"cost: ${cost_usd:.4f} USD / €{cost_eur:.4f} EUR "
        f"(tokens: in={tokens.input_tokens:,} out={tokens.output_tokens:,} "
        f"reasoning={tokens.reasoning_tokens:,}, model: {model})"
    )
    
    # Update machine-readable JSON
    update_cost_tracking_json(
        work_dir=work_dir,
        task_id=task_id,
        cost_usd=cost_usd,
        cost_eur=cost_eur,
        success=success
    )

Cost Optimization Results

Before optimization:

  • Single model (gpt-5.1-codex) for everything
  • Average cost per task: ~$0.15-0.30

After optimization:

  • Two-phase approach with smart model selection
  • Average cost per task: ~$0.05-0.10
  • ~60-70% cost reduction

Key Learnings

1. Direct API Control is Powerful

Moving from CLI to API gave us:

  • Fine-grained control over requests
  • Better error handling and retries
  • Cost optimization opportunities
  • Custom tool calling logic

2. Two-Phase Approach is Essential

Using cheap models for exploration and expensive models only when needed:

  • Reduces costs by 60-80%
  • Maintains quality for implementation
  • Scales better for multiple workers

3. Observability Matters

Cost tracking and logging helped us:

  • Identify expensive operations (reading large files repeatedly)
  • Optimize model selection (when to use expensive models)
  • Debug issues (why did this task cost so much?)

4. Iterative Development Works

30+ commits in 24 hours, each addressing a specific issue:

  • Start simple, add complexity as needed
  • Fix issues as they arise
  • Measure and optimize continuously

The Result: A Production-Ready AI Worker

What we built:

  • ✅ Autonomous task completion - Workers pick tasks, implement them, merge automatically
  • ✅ Cost-optimized - 60-70% cost reduction through smart model selection
  • ✅ Observable - Real-time cost tracking and detailed logging
  • ✅ Robust - Loop detection, question handling, error recovery
  • ✅ Tested - 69+ unit tests with >90% coverage

Example output:

🚀 Starting AI Worker Runner
   Project:  clubhub
   Role:     developer
   
🔍 Auto-selecting highest priority available task...
✅ Successfully locked task: A37

1️⃣ Making API call (model: gpt-5-mini) | 🔍 Exploring project structure
⚙️  read_backlog
📋 Task ID updated from backlog read: auto -> A37
   💰 Tokens: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR

[... 30 iterations later ...]

🔍 Complexity Assessment:
   Assessment: COMPLEX - 9 files, 8 large files
   → Model selection: gpt-5.1-codex

🔄 Transitioning from context gathering to implementation phase...

1️⃣ 1️⃣ Making API call (model: gpt-5.1-codex) | 💻 Implementing
⚙️  write_file: frontend/src/pages/Members.tsx
✅ Wrote 12006 bytes
   💰 Tokens: in=20173 out=862 | Cost: $0.0167 USD / €0.0154 EUR

✅ Task completed successfully
📤 Pushing to develop...
✅ Automatic merge and cleanup completed successfully!

Code Examples

Complete Worker Execution

# Entry point: openai-worker.py
from openai_worker.config import WorkerConfig
from openai_worker.worker import AIWorker

config = WorkerConfig.from_env()
worker = AIWorker(prompt, config)
exit_code = worker.run()

API Client Usage

# api_client.py
class APIClient:
    def create_response(
        self,
        input_data: Union[str, List[Dict[str, Any]]],
        tools: List[Dict[str, Any]],
        model: str,
        previous_response_id: Optional[str] = None,
        force_tool_use: bool = False,
    ) -> Dict[str, Any]:
        request_body = {
            "model": model,
            "tools": tools,
            "input": input_data,
        }
        
        if previous_response_id:
            request_body["previous_response_id"] = previous_response_id
        
        if force_tool_use:
            request_body["tool_choice"] = "required"
        
        response = requests.post(
            f"{self.base_url}/responses",
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json=request_body
        )
        return response.json()

Cost Tracking

# cost_tracker.py
@dataclass
class CostTracker:
    def calculate_cost(self, model: str) -> tuple[float, float]:
        pricing = self.config.get_pricing(model)
        
        input_cost = (
            self.usage.input_tokens * pricing["input"] / 1_000_000 +
            self.cached_tokens * pricing["input"] * 0.1 / 1_000_000
        )
        output_cost = self.usage.output_tokens * pricing["output"] / 1_000_000
        
        total_usd = input_cost + output_cost
        total_eur = total_usd * self.config.usd_to_eur
        
        return total_usd, total_eur

What's Next?

The system is working well, but there's always room for improvement:

  • Better context compression - Summarize more aggressively
  • Smarter file caching - Cache parsed ASTs, not just raw files
  • Parallel context gathering - Multiple workers reading different files simultaneously
  • Predictive model selection - Use ML to predict optimal model before starting

But for now, we have a production-ready, cost-optimized, autonomous AI worker that can handle real development tasks.


Conclusion

What started as a workaround for CLI throttling became a complete rewrite and optimization effort. The result is a more capable, cost-effective, and observable system.

Key takeaways:

  • Direct API access gives you control and flexibility
  • Two-phase optimization dramatically reduces costs
  • Observability (cost tracking, logging) is essential
  • Iterative development works - fix issues as they arise

The migration from Codex CLI to OpenAI API wasn't just about avoiding throttling - it was about building a better system.


Built in 24 hours. 30+ commits. 1,370 lines of Python. 69+ tests. 60-70% cost reduction. Production-ready.

No comments:

🐌 From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours

From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours How throttling led to a complete rewrite, cost optimization, and a mo...