Google Engineering Director: 21 Design Patterns for Creating AI Agents

Author: Yanhua

Antonio Gullí is the Engineering Director at Google. He wrote a 453-page book that breaks down the development of AI Agents into 21 design patterns.

But this is not a book review. My motivation for reading this book is very specific: I have written about Harness Engineering, shared my pitfalls with Clawdbot, and discussed the seven turning points from “AI agents are not magic” that go from burning tokens to being genuinely useful. After each writing, there has always been a question I haven’t fully thought through: Is there a reusable underlying logic behind these things?

This book gave me an answer, and it was deeper than I expected.

What You Write May Not Be an Agent at All

The harshest judgment in the book is hidden in the prologue.

Most people using “AI” are only at Level 0 : bare LLMs, without tools, memory, or actions. If you ask it what the best picture at the Oscars in 2025 is, it guesses. The book states plainly: Level 0 things are not Agents .

Moving up is what constitutes a real Agent:

Level 1: Tool User Agents start using tools: search, APIs, databases. But it’s not just about “being able to call interfaces”; it also needs to judge when to call, what to call, and how to use the results. The book provides a very specific example: when a user asks, “What new shows are there recently?”, the Agent realizes that this information is not in the training data and proactively calls the search tool to find it, then synthesizes the results. The key step is “realizing on its own.” It’s not a human telling it, “Go search,” but rather it judging that it needs to search. This judgment ability is the threshold for Level 1.

Level 1: Tool User

Agents start using tools: search, APIs, databases. But it’s not just about “being able to call interfaces”; it also needs to judge when to call, what to call, and how to use the results. The book provides a very specific example: when a user asks, “What new shows are there recently?”, the Agent realizes that this information is not in the training data and proactively calls the search tool to find it, then synthesizes the results. The key step is “realizing on its own.” It’s not a human telling it, “Go search,” but rather it judging that it needs to search. This judgment ability is the threshold for Level 1.

Level 2: Strategic Thinker Two additional elements are added: planning and Context Engineering. The book defines Context Engineering as not just piling up information, but carefully selecting, trimming, and packaging context. A clever example is provided: when a user wants to find a coffee shop between two locations, the Agent first calls the map tool to gather a lot of data, then judges that “only the street names are needed next,” trims the map output into a short list, and feeds it to the local search tool. Each step is about reducing noise in the information.

Level 2: Strategic Thinker

Two additional elements are added: planning and Context Engineering. The book defines Context Engineering as not just piling up information, but carefully selecting, trimming, and packaging context. A clever example is provided: when a user wants to find a coffee shop between two locations, the Agent first calls the map tool to gather a lot of data, then judges that “only the street names are needed next,” trims the map output into a short list, and feeds it to the local search tool. Each step is about reducing noise in the information.

There is a sentence in the book that I read several times: “To achieve the highest accuracy with AI, you must provide it with short, focused, and powerful context.” Context Engineering is about doing this. At this level, the Agent can also self-reflect. After completing a task, it reviews its work, identifies problems, and makes corrections. I will elaborate on this later. Level 3: Multi-Agent Collaboration

At this level, the Agent can also self-reflect. After completing a task, it reviews its work, identifies problems, and makes corrections. I will elaborate on this later.

Level 3: Multi-Agent Collaboration

The book’s stance is clear: stop thinking about creating an all-powerful super agent. The truly reliable approach is to assemble a team, like a project manager Agent + researcher Agent + designer Agent + copywriter Agent. An example given in the book is a new product launch: a “project manager Agent” coordinates everything, assigning tasks to “market research Agent,” “product design Agent,” and “marketing Agent.” The key is communication: how Agents transmit data, synchronize states, and handle conflicts. This chapter illustrates six types of communication topologies, from the simplest single Agent to the most flexible custom mixes, with explanations of what scenarios each is suitable for.

After reviewing these four levels, I suddenly understood why many people say, “My Agent is not useful.” The model is not the problem; the issue is that you are treating it like a chatbot, and it may not even have reached Level 1.

Context Engineering: The Most Underestimated Concept in the Book

I previously wrote about Harness Engineering, discussing how track design is more important than engine horsepower. After reading this book, I realized that Context Engineering is the mapping of Harness Engineering at the prompt level.

Traditional Prompt Engineering only concerns “how you ask.” The Context Engineering in the book deals with “what context is in front of the Agent before asking.” It includes four layers of information:

First Layer, System Prompt . Defines who the Agent is, what tone to use, and what boundaries to set. Most people only write this layer.

Second Layer, External Data . Documents retrieved by RAG, return values from tool calls, real-time API data. This is where most people get stuck: they know they need to feed data but don’t know how to do it without overwhelming the model.

Third Layer, Implicit Data . User identity, interaction history, environmental state. Information that you haven’t explicitly stated but the Agent should know. For example, if you tell the Agent, “Help me email John to confirm tomorrow’s meeting,” it should know what tomorrow’s meeting is from your calendar and what your relationship with John is.

Fourth Layer, Feedback Loop . After each output, the Agent automatically evaluates quality and adjusts the context strategy for the next time. The book refers to this as “automated context optimization,” and Google’s Vertex AI Prompt Optimizer is an engineering implementation of this idea.

When I read this, I recalled a previous experience I shared in “AI agents are not magic,” where I mentioned that “your agent needs rules, and many of them.” Looking back, those rules are essentially the manual version of Context Engineering, which the book has systematized.

Reflection: Two Agents Are Really Better Than One

This is the most practically valuable pattern in the entire book for me.

The core of Reflection is simple: after the Agent completes a task, it reviews its work and makes corrections if it finds problems. But the implementation method is crucial. The book clearly states: The Producer and Critic must use two different Agents, each with a different system prompt . A single persona reviewing its own work will always have blind spots. If you have the same LLM write code and then review its own code, it is likely to say, “It’s pretty good.”

The book provides a complete code example.

The Producer’s prompt is “You are a Python developer, write a function to calculate factorials, handling edge cases and exceptions.”

The Critic’s prompt is “You are a nitpicking senior engineer, review the code line by line, checking for bugs, style, missed edge cases, and areas for improvement. If it’s perfect, output CODE_IS_PERFECT ; otherwise, list all the issues.”

Then there is a for loop: Producer writes code → Critic reviews → Producer makes changes based on feedback → Critic reviews again → until Critic says CODE_IS_PERFECT or the maximum iteration count is reached.

It’s that simple. But the book reminds us of a cost issue that is easily overlooked: each reflection loop is a new LLM call, and the more iterations, the more expensive it becomes. Moreover, as the conversation history expands, the context window gets filled with earlier versions and critiques, reducing the actual reasoning space. Therefore, the best practice for Reflection is: set a reasonable maximum iteration count (the book uses 3), and stop once the Critic is satisfied; do not pursue perfection .

The application goes far beyond just writing code. Writing articles, making plans, summarizing documents, solving logic problems—all can utilize the Producer-Critic model. The book lists seven application scenarios, with the core logic being the same: produce first, then review, and finally correct.

Multi-Agent Is Not Better When More Complex

What I liked most about the Multi-Agent Collaboration chapter is the six communication topology diagrams. Many people jump straight into complexity, but in fact, three types are sufficient for most scenarios:

Single Agent (Independent Execution) : Tasks can be broken down into independent sub-problems, each Agent handles its own. Simple and easy to maintain.

Peer-to-Peer Network : Agents communicate directly with each other, without a central control node. Decentralized, fault-tolerant; if one Agent fails, it doesn’t affect the whole system. However, coordination costs are high, and it can become chaotic.

Supervisor (Central Coordination) : A Supervisor Agent manages a group of Worker Agents. It assigns tasks, collects results, and resolves conflicts. Clear hierarchy, easy to manage. But the Supervisor is a single point of failure and a performance bottleneck.

The other three types (Supervisor-as-Tool, hierarchical, custom mix) are variations and combinations of the first three. The book states realistically: The topology you need depends on the complexity of your task. The more fragmented the task, the higher the communication costs; at a certain point, the Supervisor model can be more efficient than a hierarchical one.

My experience is that many people spend 80% of their time on communication protocols when building Multi-Agents, forgetting to ask a more fundamental question: Does this task really require multiple Agents? The book makes it clear that a Level 2 single Agent with Reflection is often sufficient. Level 3 is reserved for scenarios that truly cannot be handled by a single Agent.

The Three-Layer Model of Memory, Which I Had Previously Felt but Not Named

The Memory chapter resonated with me the most because when I wrote the articles on Obsidian + Claude, I was constantly pondering a question: How should the Agent’s memory be layered?

The book provides the answer:

Session (Conversation Layer) : The context window of the current conversation, which is the shortest memory and disappears when the conversation ends. Long context models simply enlarge this window, but essentially it is still temporary, and each inference has to process the entire window, which is costly and slow.

State (State Layer) : Temporary data during the current task. For example, “What is the current task?”, “How far has it progressed?”, “What data has been generated in between?”. Longer than Session, but cleared when the task ends; the book provides a complete example using Google ADK’s State mechanism.

Memory (Persistent Layer) : Long-term memory that spans sessions and tasks. User preferences, learned experiences, important historical decisions, stored in a database or vector database, for semantic retrieval. The book emphasizes an important point: Memory is not just about storing; it also requires designing a complete strategy for “what to store, when to store, and how to retrieve.” Storing too much creates noise, while storing too little is insufficient.

In my previous article on Clawdbot, I mentioned “state files” and “workspace documents,” which are essentially hand-crafted State and Memory layers; the book has framed this task systematically.

Five Assumptions, the Fifth One Is the Most Absurd

At the end of the book, five assumptions about the future of Agents are mentioned, with the first four still within reasonable extrapolation: general-purpose Agents evolving from coding to project management, deeply personalized proactive discovery of your needs, embodied intelligence moving off the screen into the physical world, and Agents becoming independent economic entities.

The fifth one shocked me: Transforming Multi-Agent .

You only declare a goal, such as “Create an e-commerce business selling premium coffee.” The system automatically decides: first create a “market research Agent” and a “branding Agent.” After running a round of data, it judges that the branding Agent is no longer needed and breaks it down into three new Agents: “Logo Design Agent,” “Website Building Agent,” and “Supply Chain Agent.” If the Website Building Agent becomes a bottleneck, the system will automatically duplicate three parallel Agents to work on different pages simultaneously. Throughout the process, the system continuously optimizes each Agent’s prompt and reorganizes the team structure.

The book refers to this as a “goal-driven, self-transforming multi-Agent system.” It is not executing a plan you wrote; it is generating its own plan, adjusting its plan, and reorganizing its execution team by itself.

This reminds me of Karpathy’s AutoResearch: write a program.md , define goals, metrics, boundaries, and hit “start.” Humans are outside the loop. But this book pushes it further: even how to form and reorganize the Agent team is left to the system to decide. Humans only declare “what they want.”

Three Actions You Can Take Immediately

After reading this book, I have three immediate actions I can implement:

First, add a Critic to your current Agent. Whether you are using Claude Code, CrewAI, or your own framework, add a step at the end of your existing workflow: have another Agent (using a different system prompt) review the output of the previous step. Code generation plus code review, article writing plus fact-checking, planning plus feasibility assessment. This adds one more LLM call, but the quality improvement is often doubled. The Producer-Critic model in the book is plug-and-play.

Second, start doing Context Engineering, not just Prompt Engineering. Look back at the instruction files you wrote for the Agent. If they are all rules about “how you should do it” and lack context about “what environment you are facing right now,” fill that in. Tell the Agent what project it is currently in, what decisions have been made previously, and what user preferences are. The Context Engineering chapter in the book and your AGENTS.md are two expressions of the same thing.

Third, don’t rush into Multi-Agent. Bring your single Agent up to Level 2: with tools, Reflection, and Memory. The book repeatedly emphasizes that a Level 2 single Agent combined with Producer-Critic and Context Engineering can cover the vast majority of practical scenarios. Level 3 is reserved for truly cross-domain, multi-stage tasks that require parallel division of labor. Most people’s problem is not that they don’t have enough Agents, but that they haven’t optimized a single Agent.

This book has 453 pages and will be published by Springer in 2025. The code examples cover LangChain/LangGraph, Google ADK, CrewAI, and OpenAI API. The foreword is written by the Google Cloud AI VP, and there is a recommendation from the Goldman Sachs CIO, which is unexpectedly well-written.

But my reason for recommending it is not “comprehensive.” It is that after reading, you will realize one thing: the pitfalls you have encountered with Agents over the past six months have already been organized into patterns by others. You no longer need to reinvent Reflection, guess how to layer Memory, or experiment with which communication topology to use for Multi-Agent.

Someone has drawn the map for you; all that’s left is to walk it.

Are you using AI Agents for development? What level is your current Agent at?

原文来源：https://www.chaincatcher.com/en/article/2267024