AGENTS.md is a project-level configuration file that provides AI coding assistants with complete project context, development principles, and operational guidelines. It follows the agents.md standard and works across Claude, Cursor, GitHub Copilot, and other AI tools.

What is Jimmy's Workflow?

Jimmy's Workflow is a four-phase validation system (PRE-FLIGHT, IMPLEMENT, VALIDATE, CHECKPOINT) designed to prevent AI hallucination and ensure robust implementation. Each phase has specific gates and confidence levels (HIGH/MEDIUM/LOW) that determine when human review is needed.

How do I set up multiple AI instances as a team?

Assign each AI instance a role card with identity, responsibilities, personality traits, and success criteria. Use file-based handoff protocols for coordination. Personality orchestration at the system prompt level produces measurably better results than treating AI instances as generic tools.

Prompt Testing

Prompt testing applies the same rigour to prompts that software testing applies to code. This guide covers testing strategies and implementations for the CAP (Composable Agentic Prompt) workflow across Python, TypeScript, and Rust.

The prompt testing pyramid

        /\
       /  \  E2E (5%) — Full prompt runs
      /----\
     /      \  Integration (15%) — Multi-component
    /--------\
   /          \  Component (30%) — Single lens
  /------------\
 /              \  Unit (50%) — Schema, parsing
/________________\

What to test

Dimension	What	Why
Correctness	Output matches expected	Core functionality
Schema compliance	Output structure valid	Integration reliability
Robustness	Handles edge cases	Production stability
Performance	Latency and token usage	Cost management
Determinism	Consistent results	Reproducibility
Safety	No harmful outputs	Security and compliance

Python implementation

from pydantic import BaseModel
from typing import Literal, Optional

class SecurityFinding(BaseModel):
    type: str
    severity: Literal['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']
    file: str
    line: int
    evidence: str
    cwe: Optional[str]

def test_security_lens_output_schema():
    result = run_security_lens(test_input)
    finding = SecurityFinding(**result)
    assert finding.severity in ('CRITICAL', 'HIGH', 'MEDIUM', 'LOW')

def test_sql_injection_detection():
    code = "query = 'SELECT * FROM users WHERE id=' + userId"
    result = run_security_lens(code)
    assert any(f['type'] == 'sql_injection' for f in result['findings'])

TypeScript implementation

import { z } from 'zod';

const FindingSchema = z.object({
  type: z.string(),
  severity: z.enum(['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']),
  file: z.string(),
  line: z.number(),
  evidence: z.string(),
});

test('Security lens detects SQL injection', async () => {
  const code = "query = 'SELECT * FROM users WHERE id=' + userId";
  const result = await runLens(securityLens, code);

  expect(result.findings).toContainEqual(
    expect.objectContaining({
      type: 'sql_injection',
      severity: 'CRITICAL'
    })
  );
});

test('Output matches schema', async () => {
  const result = await runLens(securityLens, testInput);
  const parsed = FindingSchema.safeParse(result.findings[0]);
  expect(parsed.success).toBe(true);
});

Rust implementation

use serde::Deserialize;

#[derive(Deserialize, Debug)]
struct SecurityFinding {
    finding_type: String,
    severity: Severity,
    file: String,
    line: u32,
    evidence: String,
}

#[derive(Deserialize, Debug, PartialEq)]
enum Severity {
    CRITICAL,
    HIGH,
    MEDIUM,
    LOW,
}

#[test]
fn test_output_schema_valid() {
    let result = run_security_lens(test_input());
    let finding: SecurityFinding = serde_json::from_str(&result)
        .expect("Output should deserialise to SecurityFinding");
    assert!(matches!(finding.severity, Severity::CRITICAL | Severity::HIGH));
}

Testing patterns

Boundary exclusion testing

Verify that components respect their boundaries:

def test_security_lens_ignores_performance():
    """Security lens should NOT flag performance issues"""
    slow_code = "for i in range(1000000): data.append(fetch(i))"
    result = run_security_lens(slow_code)
    assert not any(f['type'].startswith('perf') for f in result['findings'])

Determinism testing

Run the same prompt multiple times and check consistency:

def test_determinism():
    results = [run_security_lens(test_input) for _ in range(5)]
    types = [set(f['type'] for f in r['findings']) for r in results]
    assert all(t == types[0] for t in types), "Results should be consistent"

Token budget testing

Track and control cost:

def test_token_budget():
    result = run_lens_with_metrics(security_lens, large_codebase)
    assert result.input_tokens < 50000, "Should stay under budget"
    assert result.output_tokens < 5000, "Output should be concise"

Full reference

The complete prompt testing guide (1200+ lines with CI/CD integration, cross-language patterns, and advanced testing strategies) is at prompt-testing-implementation-guide.md.