Skip to content

Prompt Testing

Prompt testing applies the same rigour to prompts that software testing applies to code. This guide covers testing strategies and implementations for the CAP (Composable Agentic Prompt) workflow across Python, TypeScript, and Rust.

/\
/ \ E2E (5%) — Full prompt runs
/----\
/ \ Integration (15%) — Multi-component
/--------\
/ \ Component (30%) — Single lens
/------------\
/ \ Unit (50%) — Schema, parsing
/________________\
DimensionWhatWhy
CorrectnessOutput matches expectedCore functionality
Schema complianceOutput structure validIntegration reliability
RobustnessHandles edge casesProduction stability
PerformanceLatency and token usageCost management
DeterminismConsistent resultsReproducibility
SafetyNo harmful outputsSecurity and compliance
from pydantic import BaseModel
from typing import Literal, Optional
class SecurityFinding(BaseModel):
type: str
severity: Literal['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']
file: str
line: int
evidence: str
cwe: Optional[str]
def test_security_lens_output_schema():
result = run_security_lens(test_input)
finding = SecurityFinding(**result)
assert finding.severity in ('CRITICAL', 'HIGH', 'MEDIUM', 'LOW')
def test_sql_injection_detection():
code = "query = 'SELECT * FROM users WHERE id=' + userId"
result = run_security_lens(code)
assert any(f['type'] == 'sql_injection' for f in result['findings'])
import { z } from 'zod';
const FindingSchema = z.object({
type: z.string(),
severity: z.enum(['CRITICAL', 'HIGH', 'MEDIUM', 'LOW']),
file: z.string(),
line: z.number(),
evidence: z.string(),
});
test('Security lens detects SQL injection', async () => {
const code = "query = 'SELECT * FROM users WHERE id=' + userId";
const result = await runLens(securityLens, code);
expect(result.findings).toContainEqual(
expect.objectContaining({
type: 'sql_injection',
severity: 'CRITICAL'
})
);
});
test('Output matches schema', async () => {
const result = await runLens(securityLens, testInput);
const parsed = FindingSchema.safeParse(result.findings[0]);
expect(parsed.success).toBe(true);
});
use serde::Deserialize;
#[derive(Deserialize, Debug)]
struct SecurityFinding {
finding_type: String,
severity: Severity,
file: String,
line: u32,
evidence: String,
}
#[derive(Deserialize, Debug, PartialEq)]
enum Severity {
CRITICAL,
HIGH,
MEDIUM,
LOW,
}
#[test]
fn test_output_schema_valid() {
let result = run_security_lens(test_input());
let finding: SecurityFinding = serde_json::from_str(&result)
.expect("Output should deserialise to SecurityFinding");
assert!(matches!(finding.severity, Severity::CRITICAL | Severity::HIGH));
}

Verify that components respect their boundaries:

def test_security_lens_ignores_performance():
"""Security lens should NOT flag performance issues"""
slow_code = "for i in range(1000000): data.append(fetch(i))"
result = run_security_lens(slow_code)
assert not any(f['type'].startswith('perf') for f in result['findings'])

Run the same prompt multiple times and check consistency:

def test_determinism():
results = [run_security_lens(test_input) for _ in range(5)]
types = [set(f['type'] for f in r['findings']) for r in results]
assert all(t == types[0] for t in types), "Results should be consistent"

Track and control cost:

def test_token_budget():
result = run_lens_with_metrics(security_lens, large_codebase)
assert result.input_tokens < 50000, "Should stay under budget"
assert result.output_tokens < 5000, "Output should be concise"

The complete prompt testing guide (1200+ lines with CI/CD integration, cross-language patterns, and advanced testing strategies) is at prompt-testing-implementation-guide.md.