From Codex CLI to OpenAI API: Building a Smarter AI Worker in 24 Hours
How throttling led to a complete rewrite, cost optimization, and a more capable autonomous development agent
The Problem: Codex CLI Throttling
It started with a simple frustration: the Codex CLI was throttling my requests. I had built a shell-based workflow for autonomous AI workers to complete development tasks, but when I needed to scale up or run multiple workers, I kept hitting rate limits.
# The old approach - simple but limited
codex --prompt "Implement user authentication" --model gpt-5.1-codex
# ❌ Rate limit exceeded. Please try again later.
The throttling was killing productivity. I needed a solution that:
- Bypassed CLI limitations
- Gave me direct control over API calls
- Allowed for cost optimization
- Provided better observability
So I made the decision: migrate from Codex CLI to direct OpenAI API calls.
The Migration: Shell Scripts → Python Application
Phase 1: Direct API Replacement
The initial migration was straightforward - replace CLI calls with HTTP requests:
Before (Shell Script):
#!/bin/bash
# Simple Codex CLI wrapper
RESULT=$(codex --prompt "$PROMPT" --model "$MODEL")
echo "$RESULT"
After (Python with OpenAI Responses API):
import requests
def create_response(prompt: str, model: str, api_key: str):
"""Direct API call to OpenAI Responses API."""
response = requests.post(
"https://api.openai.com/v1/responses",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"model": model,
"input": prompt,
"tools": define_tools() # Function calling support
}
)
return response.json()
This gave us direct control, but we quickly realized we needed more sophistication.
Phase 2: Two-Phase Optimization Strategy
The breakthrough came when we noticed a pattern: workers were spending most of their time reading files, not writing code. Why use expensive models for exploration?
We implemented a two-phase approach:
Context Gathering Phase (
gpt-5-mini- $0.25/$2.00 per 1M tokens)- Read files, explore project structure
- Understand requirements
- Gather context
Implementation Phase (Dynamic model selection)
- Simple tasks:
gpt-5.1-codex($1.25/$10.00 per 1M tokens) - Complex tasks:
gpt-5.2($1.75/$14.00 per 1M tokens)
- Simple tasks:
def get_current_model(self) -> str:
"""Get model for current phase."""
if self.current_phase == "context":
return MODELS["context"] # gpt-5-mini - cheap exploration
else:
# Implementation phase - use assessed complexity
return self.implementation_model # Dynamically selected
This simple change reduced costs by ~80% for context-heavy tasks.
The Evolution: Iterative Improvements
Over 24 hours, we made 30+ commits, each addressing a specific issue:
Challenge 1: Token Bloat
Problem: Input tokens were growing exponentially (12M+ tokens for 100 iterations).
Solution: Aggressive context summarization
def should_summarize(self) -> bool:
"""Determine if context should be summarized."""
# Force summarization if over token limit
if self.current_usage.input_tokens > self.config.max_context_tokens:
return True
# Summarize every 10 iterations in implementation phase
if (self.current_phase == "implementation" and
self.iteration % self.config.summarize_every_n_implementation == 0):
return True
return False
Challenge 2: Workers Getting Stuck in Loops
Problem: Workers would read the same file 20+ times or call the same function repeatedly.
Solution: Loop detection with replanning (not just exiting)
def detect_loop(self) -> Optional[str]:
"""Detect repetitive behavior patterns."""
# Check for repetitive file reads
if self.loop_detector.detect_repetitive_reads(self.files_read):
return "repetitive_file_reads"
# Check for repetitive function calls
if self.loop_detector.detect_repetitive_calls(self.recent_activity):
return "repetitive_function_calls"
# Check for lack of progress
if self.loop_detector.detect_no_progress(self.recent_activity):
return "no_progress"
return None
# When loop detected, replan instead of exiting
if loop_reason:
self._replan_with_guidance(loop_reason)
# Continue with new guidance, don't exit
Challenge 3: Workers Asking Questions Instead of Implementing
Problem: Workers would ask "Should I use X or Y?" instead of making decisions.
Solution: Question detection with forced action
def detect_question(self, text: str) -> bool:
"""Detect if worker is asking questions."""
question_indicators = [
"should i", "do you want", "can you confirm",
"please choose", "which do you prefer"
]
return any(indicator in text.lower() for indicator in question_indicators)
# After 3 questions, force tool use
if self.question_count >= 3:
# Structure input to REQUIRE tool call
input_for_api = f"""🚨 MANDATORY TOOL CALL REQUIRED:
{self.current_input}
**YOU MUST RESPOND WITH A FUNCTION CALL, NOT TEXT.**
Call a tool function now."""
# Use tool_choice="required" at API level
response = self.api_client.create_response(
input_data=input_for_api,
tools=self.tools,
model=model,
force_tool_use=True # Forces tool_choice="required"
)
Challenge 4: Cost Visibility
Problem: No way to track spending or identify expensive operations.
Solution: Comprehensive cost tracking
@dataclass
class CostTracker:
"""Track token usage and costs."""
def calculate_cost(self, model: str) -> tuple[float, float]:
"""Calculate cost in USD and EUR."""
pricing = self.config.get_pricing(model)
# Calculate input cost (including cached tokens at 10% rate)
input_cost = (
(self.usage.input_tokens * pricing["input"] / 1_000_000) +
(self.cached_tokens * pricing["input"] * 0.1 / 1_000_000)
)
# Calculate output cost
output_cost = self.usage.output_tokens * pricing["output"] / 1_000_000
total_usd = input_cost + output_cost
total_eur = total_usd * self.config.usd_to_eur
return total_usd, total_eur
Costs are logged to both human-readable task logs and machine-readable JSON:
{
"daily": {
"2025-12-13": {
"usd": 0.25,
"eur": 0.23,
"tasks": 2
}
},
"tasks": {
"A37": {
"usd": 0.0744,
"eur": 0.0685,
"count": 1
}
}
}
The Architecture: A Sophisticated Python Application
What started as a simple API wrapper evolved into a 1,370-line Python application with:
Core Components
1. Worker Orchestration (worker.py)
- Two-phase execution (context → implementation)
- Dynamic model selection based on complexity
- Loop detection and recovery
- Question detection and forced action
2. API Client (api_client.py)
- OpenAI Responses API integration
- Conversation chaining with
previous_response_id - Tool calling support
- Visual feedback during API calls
3. Cost Tracking (cost_tracker.py, cost_logger.py)
- Real-time token usage tracking
- USD/EUR cost calculation
- Aggregated reporting (daily/weekly/monthly/yearly)
- Per-task cost attribution
4. Smart Optimizations
- File Caching: Avoid redundant file reads
- Context Summarization: Reduce token bloat
- Smart Model Selection: Cheap for reading, expensive for writing
- Parallel Tool Execution: Concurrent reads during context phase
Example: Running the Worker
# Simple usage
./scripts/clubhub-ai-worker.sh developer
# Auto-selects highest priority task, locks it, implements it
# Automatically merges to develop when complete
What happens under the hood:
Context Gathering (iterations 1-10,
gpt-5-mini)1️⃣ Making API call (model: gpt-5-mini) | 🔍 Exploring project structure ⚙️ read_backlog ⚙️ read_file: docs/backlog/epic-A-mvp-core/A37-*.md ⚙️ read_file: frontend/src/pages/Members.tsx 💰 Tokens: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EURComplexity Assessment
🔍 Complexity Assessment: Files read: 9 Large files (>10KB): 8 Assessment: COMPLEX - 9 files, 8 large files → Model selection: gpt-5.1-codex → Reasoning: Complex task benefits from enhanced capabilityImplementation (iterations 11+,
gpt-5.1-codex)1️⃣ 1️⃣ Making API call (model: gpt-5.1-codex) | 💻 Implementing ⚙️ write_file: frontend/src/pages/Members.tsx ✅ Wrote 12006 bytes 💰 Tokens: in=20173 out=862 | Cost: $0.0167 USD / €0.0154 EUR
Challenges of Direct API Use
Challenge 1: Conversation Management
The Responses API uses previous_response_id for conversation chaining. Managing this correctly was tricky:
# Wrong: Always including previous_response_id
response = api.create_response(
input_data=prompt,
previous_response_id=self.previous_response_id # ❌ Breaks on phase transitions
)
# Right: Reset on phase transitions
if transitioning_phases:
self.previous_response_id = None # Start fresh conversation
response = api.create_response(
input_data=prompt,
previous_response_id=self.previous_response_id if not transitioning else None
)
Challenge 2: Tool Call Extraction
The Responses API returns tool calls in a nested structure. Extracting them correctly required careful parsing:
def extract_function_calls(self, response: Dict) -> List[Dict]:
"""Extract function calls from API response."""
function_calls = []
# Responses API structure: response.items[].content[].function_call
for item in response.get("items", []):
for content in item.get("content", []):
if content.get("type") == "function_call":
function_calls.append({
"id": content.get("call_id"),
"name": content.get("function", {}).get("name"),
"arguments": json.loads(content.get("function", {}).get("arguments", "{}"))
})
return function_calls
Challenge 3: Error Handling
API errors needed graceful handling without breaking the worker:
try:
response = self.api_client.create_response(...)
except requests.RequestException as e:
# Log error but don't crash
print(f"⚠️ API call failed: {e}", file=sys.stderr)
# Retry with exponential backoff or continue with cached context
return self._handle_api_error(e)
Cost Tracking and Monitoring
Real-Time Cost Display
Every iteration shows cost:
💰 Tokens: in=7063 out=278 (reasoning=256) | Total: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR
Aggregated Reporting
Costs are tracked at multiple levels:
def log_task_cost(
work_dir: str,
worker_id: str,
task_id: str,
cost_usd: float,
cost_eur: float,
tokens: TokenUsage,
model: str,
success: bool
):
"""Log cost to both task log and cost tracking file."""
# Append to human-readable task log
task_log_entry = (
f"{datetime.utcnow().isoformat()}Z | {worker_id} | {task_id} | "
f"{'completed' if success else 'failed'} | "
f"cost: ${cost_usd:.4f} USD / €{cost_eur:.4f} EUR "
f"(tokens: in={tokens.input_tokens:,} out={tokens.output_tokens:,} "
f"reasoning={tokens.reasoning_tokens:,}, model: {model})"
)
# Update machine-readable JSON
update_cost_tracking_json(
work_dir=work_dir,
task_id=task_id,
cost_usd=cost_usd,
cost_eur=cost_eur,
success=success
)
Cost Optimization Results
Before optimization:
- Single model (
gpt-5.1-codex) for everything - Average cost per task: ~$0.15-0.30
After optimization:
- Two-phase approach with smart model selection
- Average cost per task: ~$0.05-0.10
- ~60-70% cost reduction
Key Learnings
1. Direct API Control is Powerful
Moving from CLI to API gave us:
- Fine-grained control over requests
- Better error handling and retries
- Cost optimization opportunities
- Custom tool calling logic
2. Two-Phase Approach is Essential
Using cheap models for exploration and expensive models only when needed:
- Reduces costs by 60-80%
- Maintains quality for implementation
- Scales better for multiple workers
3. Observability Matters
Cost tracking and logging helped us:
- Identify expensive operations (reading large files repeatedly)
- Optimize model selection (when to use expensive models)
- Debug issues (why did this task cost so much?)
4. Iterative Development Works
30+ commits in 24 hours, each addressing a specific issue:
- Start simple, add complexity as needed
- Fix issues as they arise
- Measure and optimize continuously
The Result: A Production-Ready AI Worker
What we built:
- ✅ Autonomous task completion - Workers pick tasks, implement them, merge automatically
- ✅ Cost-optimized - 60-70% cost reduction through smart model selection
- ✅ Observable - Real-time cost tracking and detailed logging
- ✅ Robust - Loop detection, question handling, error recovery
- ✅ Tested - 69+ unit tests with >90% coverage
Example output:
🚀 Starting AI Worker Runner
Project: clubhub
Role: developer
🔍 Auto-selecting highest priority available task...
✅ Successfully locked task: A37
1️⃣ Making API call (model: gpt-5-mini) | 🔍 Exploring project structure
⚙️ read_backlog
📋 Task ID updated from backlog read: auto -> A37
💰 Tokens: in=7063 out=278 | Cost: $0.0023 USD / €0.0021 EUR
[... 30 iterations later ...]
🔍 Complexity Assessment:
Assessment: COMPLEX - 9 files, 8 large files
→ Model selection: gpt-5.1-codex
🔄 Transitioning from context gathering to implementation phase...
1️⃣ 1️⃣ Making API call (model: gpt-5.1-codex) | 💻 Implementing
⚙️ write_file: frontend/src/pages/Members.tsx
✅ Wrote 12006 bytes
💰 Tokens: in=20173 out=862 | Cost: $0.0167 USD / €0.0154 EUR
✅ Task completed successfully
📤 Pushing to develop...
✅ Automatic merge and cleanup completed successfully!
Code Examples
Complete Worker Execution
# Entry point: openai-worker.py
from openai_worker.config import WorkerConfig
from openai_worker.worker import AIWorker
config = WorkerConfig.from_env()
worker = AIWorker(prompt, config)
exit_code = worker.run()
API Client Usage
# api_client.py
class APIClient:
def create_response(
self,
input_data: Union[str, List[Dict[str, Any]]],
tools: List[Dict[str, Any]],
model: str,
previous_response_id: Optional[str] = None,
force_tool_use: bool = False,
) -> Dict[str, Any]:
request_body = {
"model": model,
"tools": tools,
"input": input_data,
}
if previous_response_id:
request_body["previous_response_id"] = previous_response_id
if force_tool_use:
request_body["tool_choice"] = "required"
response = requests.post(
f"{self.base_url}/responses",
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json=request_body
)
return response.json()
Cost Tracking
# cost_tracker.py
@dataclass
class CostTracker:
def calculate_cost(self, model: str) -> tuple[float, float]:
pricing = self.config.get_pricing(model)
input_cost = (
self.usage.input_tokens * pricing["input"] / 1_000_000 +
self.cached_tokens * pricing["input"] * 0.1 / 1_000_000
)
output_cost = self.usage.output_tokens * pricing["output"] / 1_000_000
total_usd = input_cost + output_cost
total_eur = total_usd * self.config.usd_to_eur
return total_usd, total_eur
What's Next?
The system is working well, but there's always room for improvement:
- Better context compression - Summarize more aggressively
- Smarter file caching - Cache parsed ASTs, not just raw files
- Parallel context gathering - Multiple workers reading different files simultaneously
- Predictive model selection - Use ML to predict optimal model before starting
But for now, we have a production-ready, cost-optimized, autonomous AI worker that can handle real development tasks.
Conclusion
What started as a workaround for CLI throttling became a complete rewrite and optimization effort. The result is a more capable, cost-effective, and observable system.
Key takeaways:
- Direct API access gives you control and flexibility
- Two-phase optimization dramatically reduces costs
- Observability (cost tracking, logging) is essential
- Iterative development works - fix issues as they arise
The migration from Codex CLI to OpenAI API wasn't just about avoiding throttling - it was about building a better system.
Built in 24 hours. 30+ commits. 1,370 lines of Python. 69+ tests. 60-70% cost reduction. Production-ready.
No comments:
Post a Comment