AGENTS.md is a project-level configuration file that provides AI coding assistants with complete project context, development principles, and operational guidelines. It follows the agents.md standard and works across Claude, Cursor, GitHub Copilot, and other AI tools.

What is Jimmy's Workflow?

Jimmy's Workflow is a four-phase validation system (PRE-FLIGHT, IMPLEMENT, VALIDATE, CHECKPOINT) designed to prevent AI hallucination and ensure robust implementation. Each phase has specific gates and confidence levels (HIGH/MEDIUM/LOW) that determine when human review is needed.

How do I set up multiple AI instances as a team?

Assign each AI instance a role card with identity, responsibilities, personality traits, and success criteria. Use file-based handoff protocols for coordination. Personality orchestration at the system prompt level produces measurably better results than treating AI instances as generic tools.

Haiku 4.5 Findings

Claude Haiku 4.5, when paired with Jimmy’s Workflow, delivered 1.8x faster execution, 67% lower cost, and 5% higher quality scores than Claude Sonnet 4.5 operating without structured workflow. This was the opposite of the expected result.

Executive summary

Metric	Larger model (Sonnet 4.5)	Smaller model + workflow (Haiku 4.5)	Difference
Average speed	12.9s per query	7.6s per query	1.8x faster
Cost (input/output per 1M tokens)	$3 / $15	$1 / $5	67% cheaper
Quality score	4.4 / 5	4.6 / 5	+5% (Haiku higher)
Success rate	71%	86%	+15 percentage points

Testing was conducted on a content processing platform with a production database of 3,833 articles across 29 RSS sources.

The synergy discovery

The central finding was not that Haiku is “good enough” — it is that Jimmy’s Workflow eliminates Haiku’s primary weakness while amplifying its primary strength.

Haiku’s weakness: Less implicit inference. Without explicit structure, Haiku makes more assumptions and misses nuance that larger models handle intuitively.

Haiku’s strength: Fast, precise pattern execution. When told exactly what to do and how to validate it, Haiku follows instructions with higher fidelity than Sonnet.

Jimmy’s Workflow provides: Explicit thinking templates, mandatory validation gates, progressive disclosure patterns, and structured output requirements.

The result is a synergy — the workflow compensates for the model’s limitation and the model excels at the workflow’s demands. Haiku does not merely tolerate the structured workflow. It thrives on it.

Quantitative results

Speed

Metric	Sonnet 4.5	Haiku 4.5
Average response time	12.9s	7.6s
Consistency	Variable (high variance)	Consistent (low variance)
Speed advantage	Baseline	1.8x faster

Haiku was not only faster on average but more predictable. The lower variance matters for production systems where consistent response times affect user experience.

Cost

Metric	Sonnet 4.5	Haiku 4.5
Input cost (per 1M tokens)	$3.00	$1.00
Output cost (per 1M tokens)	$15.00	$5.00
Cost ratio	Baseline	67% cheaper

The 67% cost reduction is on the token pricing alone. Combined with the speed improvement and higher success rate, effective cost savings are even larger when accounting for retries and timeouts.

Quality

Metric	Sonnet 4.5	Haiku 4.5 + workflow
Quality score	4.4 / 5	4.6 / 5
Success rate	71%	86%

This was the unexpected result. The smaller model with structured workflow scored higher on quality than the larger model. The explanation lies in the behavioural differences documented below.

What works excellently with smaller models + structured workflow

Explicit thinking templates

Haiku follows thinking templates more thoroughly than Sonnet. Where Sonnet sometimes abbreviates its reasoning (treating steps as obvious), Haiku executes every step of the template with full detail. The result is more transparent, more auditable reasoning.

Mandatory validation gates

Haiku never skipped a validation gate during testing. Not once. Sonnet occasionally treated validation as a formality, producing abbreviated checks. Haiku treated every gate as a genuine checkpoint, running full validation each time.

Progressive disclosure patterns

When given multi-step tasks with progressive complexity, Haiku executed the progression flawlessly. Each step built cleanly on the previous one, with no shortcuts or collapsed steps.

Safety behaviour

Haiku was more cautious and more detailed in its safety checks than Sonnet. It flagged more edge cases, asked more clarifying questions, and documented more assumptions. For production systems, this is a feature, not a limitation.

Self-correction via validation gates

When Haiku’s initial output contained an error, the validation gate triggered self-correction. The structured workflow creates a feedback loop that catches and fixes mistakes before they reach the output. This self-healing property is critical for reliability.

What struggles without workflow structure

Implicit expectations

Without explicit structure, Haiku’s quality drops measurably. Tasks that rely on the model “knowing what you mean” without being told are where larger models have a genuine advantage. Structured workflow eliminates this gap by making all expectations explicit.

Schema assumptions

Both Haiku and Sonnet make incorrect assumptions about data schemas when not given explicit schema definitions. This is not a Haiku-specific weakness — it is a universal AI limitation that structured workflow mitigates equally for both model tiers.

Behavioural differences

Aspect	Larger model (Sonnet 4.5)	Smaller model + workflow (Haiku 4.5)
Thinking detail	Moderate — sometimes abbreviates	High — executes every template step
Validation steps per task	3-4	4-5 (more thorough)
Workflow compliance	Good (occasional shortcuts)	Perfect (100% compliance)
Edge case flagging	Moderate	High (more cautious)
Self-correction rate	Lower (fewer errors caught)	Higher (validation gates trigger fixes)
Reasoning transparency	Implicit leaps	Explicit step-by-step

The key insight

Structure matters more than raw model capability for well-defined tasks.

This is the central finding. When the task is well-defined and the workflow is explicit, the model’s “intelligence ceiling” is less important than its ability to follow structured instructions reliably. Haiku’s perfect workflow compliance outweighs Sonnet’s superior implicit reasoning for any task where the workflow can be made explicit.

This does not mean smaller models are universally better. It means that for structured, well-defined tasks — which is the majority of production workloads — investing in workflow design pays higher returns than investing in larger models.

Practical implications

Use smaller models for structured, well-defined tasks. If you can define the task explicitly with clear inputs, expected outputs, and validation criteria, a smaller model with structured workflow will likely match or beat a larger model.
Always provide explicit workflow structure. Never rely on the model “figuring out” what you want. Make thinking steps, validation criteria, and output format explicit.
Validation gates are critical. They are not overhead — they are the mechanism that makes smaller models viable. Removing validation gates to save tokens will cost more in retries and quality failures.
Cost savings scale linearly with query volume. The 67% per-query savings compound directly with volume.

Projected cost savings

Assuming average token usage of 2,000 input tokens and 1,000 output tokens per query:

Daily queries	Sonnet 4.5 daily cost	Haiku 4.5 daily cost	Daily savings	Monthly savings
100	$2.10	$0.70	$1.40	$42
1,000	$21.00	$7.00	$14.00	$420
10,000	$210.00	$70.00	$140.00	$4,200

These figures reflect token cost only. When factoring in the higher success rate (86% vs 71%), the effective savings increase further because fewer retries are needed.

Limitations

Sample size: 7-query test suite. Directionally strong but not statistically robust.
Single domain: Content processing tasks only. Code generation, creative writing, and other domains were not tested.
Model versions: Claude Haiku 4.5 and Claude Sonnet 4.5 as available in October 2025. Model updates may change the dynamics.
Quality assessment: Scores assigned by a single evaluator on a subjective 1-5 scale.
Workflow dependency: Results are specific to Jimmy’s Workflow v2.1. Other structured workflow systems may produce different outcomes.

These findings inform architecture decisions. They are not a substitute for testing with your own workloads and quality criteria.