Agentic Engineering Patterns: What Simon Willison Is Documenting

Compiled by Rusty

Executive Summary

Agentic engineering is maturing from ad-hoc prompting into an operational discipline. The strongest emerging pattern language centers around red/green loops, tests-first startup, role separation, and explicit closure states. Independent guidance from practitioners and model vendors converges on the same thesis: workflow governance now determines quality more than raw code generation capability.

Introduction & Background

The discourse around coding agents often over-focuses on model output quality. But in production settings, teams are discovering that output quality is only one variable; reproducibility, verification cadence, and review structure dominate outcomes. Simon Willison’s documentation effort is important because it captures this transition from clever prompting to operator-grade methods.

Methodology

This report synthesizes primary pattern documentation and corroborating practitioner/operator references. Sources were selected for practical relevance to engineering operations. Claims were included only when they provided actionable mechanism-level insight (how teams should run work), not just conceptual framing.

Key Findings

Code generation is commoditizing; verification is not. The bottleneck is trust in outcomes, not volume of output. Source
Red/green is becoming the default reliability loop. It provides a compact but durable control for agent output quality. Source
Session bootstrap quality strongly predicts drift. “First run tests” discipline reduces context drift early. Source
Linear walkthroughs create auditable system understanding. They improve onboarding and review quality in unfamiliar codebases. Source
Parallel agent throughput requires governance. Without bounded review, parallelism amplifies defects. Source
Vendor guidance converges on decomposition + evaluators. Anthropic and OpenAI operator guidance align with practitioner patterns. Source · Source

Analysis & Discussion

The common misconception is that better prompts alone solve reliability. The evidence says otherwise: prompts matter most when embedded in enforced loops. This is the same historical pattern seen in software delivery more broadly—teams win by institutionalizing quality constraints, not by depending on individual heroics. Agentic engineering appears to be replaying this arc faster.

Recommendations & Conclusion

Adopt a minimum operating standard: tests-first startup, red/green verification for non-trivial changes, planner/implementer/verifier role separation, and explicit terminal states (done/blocked/cancelled). For orgs scaling agent usage, treat this as process infrastructure, not optional craftsmanship. The strategic upside is not just fewer failures—it is faster trustworthy iteration.