AI Agents & Systems — Master Learning Guide

The AI Harness

AI Systems — From Theory to Production

Technical depth for engineers. Strategic clarity for leaders. Two perspectives on building AI that actually works in production.

📚

Technical Learning Guide

Concepts · Patterns · Production code

Every concept taught with intuition → technical reality → production trap → recall hook. For engineers and architects building AI systems.

🧠 AI Agent Fundamentals

🏗️ Agent Design Patterns

💾 Memory Architecture

🔒 Security & Interop

🚀 Production Architecture

See all 10 sections →

💡

Thought Leadership

Analysis · Strategy · Real decisions

Strategic takes on AI adoption, security risk, and enterprise deployment — written for decision-makers who need clarity without the jargon.

🔑 Token Security — The Weakest Link

🛡️ Shadow AI & Unsanctioned Tools

⚖️ How to Deploy GenAI Safely

👔 The Lazy AI Trap

🏢 Enterprise Agents at Scale

Browse all 6 articles →

Quick navigation — all technical topics

🧠

What is an Agent?

Anatomy, the Think→Act→Observe loop, agent vs. autocomplete

Part 1

📊

Agent Taxonomy

5 levels from Encyclopedia to Self-Evolving system

Part 2

🏗️

Design Patterns

Sequential, Parallel, Loop, Swarm, ReAct, HITL — when to use each

Part 3

💾

Memory Architecture Deep Dive

3-tier memory, ADK state scopes, Mem0 proactive patterns

Part 4

⚙️

Google ADK

Event-driven runtime, agent types, tools, MCP, callbacks

Part 5

🖥️

KV Cache

GPU-level caching, vLLM, LMCache, NVIDIA unified memory

Part 6

💰

Prompt Caching — Claude & Gemini

API-level caching, Claude & Gemini providers, token efficiency

Part 7

🔍

Semantic Caching

Vector similarity, context window key, LiteLLM stack

Part 8

🔒

Security & Interop

Agent identity, least privilege, A2A, MCP

Part 9

🚀

Agent Ops & Architecture

Eval datasets, LLM judges, CI/CD, full production diagram

Part 10

PART 1

What is an AI Agent?

The shift from predictor → actor, and the anatomy of an agent

🟢 The Intuition

Autocomplete vs. Employee

Old AI = a very smart autocomplete. You ask, it predicts. No awareness, no plan, no actions. It's a parrot.

An Agent = a junior employee you give a goal to. They don't need you to tell every step. They reason, use tools, check results, and adjust — until the job is done.

The shift is from "finish my sentence" to "go do the job."

🔵 The Technical Reality

Shortest Definition

⚡ Agent = LLM in a loop + Tools to accomplish an objective

An LLM alone is stateless and passive. An Agent adds:

A loop around the LLM (Think → Act → Observe → repeat)
Tools to interact with the real world
Memory so it remembers what it already did
Orchestration to manage when to think vs. act

The Anatomy

🧠

MODEL — The Brain

Reasons, decides, plans. Quality determines the agent's ceiling. Route expensive tasks to Pro, cheap tasks to Flash.

🛠️

TOOLS — The Hands

APIs, databases, code execution, HITL confirmation. Exposed via function calling (OpenAPI / MCP).

⚙️

ORCHESTRATION — The Nervous System

Runs the Think→Act→Observe loop. Manages memory, state, and planning strategy.

🚀

DEPLOYMENT — The Body

Monitoring, logging, rate limiting, auth. Agents talk to users via GUI or to other agents via A2A.

The Think → Act → Observe Loop

1. MISSION User goal arrives 2. SCAN Load context + memory 3. THINK LLM reasons, makes plan 4. ACT Call a tool / API 5. OBSERVE Get result, update state └─────────────── loop until done ────────── 6. REPORT Final answer to user

ReAct Pattern (Most Common)

ReAct: Reason + Act Interleaved

Thought: "I need to find the halfway point first"
Action:  maps_tool(origin="Mountain View", destination="SF")
Observation: "Halfway point is Millbrae"

Thought: "Now I search for coffee in Millbrae"
Action:  search_tool(query="good coffee in Millbrae")
Observation: [results...]

Final Answer: "Try Blue Bottle in Millbrae"

🔴 The Production Trap

People think "agent" = "just add a prompt." An agent is a complete application — state management, tool execution, failure handling, observability. Treat it like building a microservice, not writing a prompt.

Also: loops can go infinite. Always set max_llm_calls. Each "think" step costs tokens — a poorly-scoped task can burn budget fast.

💜 Recall Hook

Agent = Employee, not Autocomplete. Brain · Hands · Nervous System · Body. Mission → Scan → Think → Act → Observe → Loop.

PART 2

Agent Taxonomy: Levels 0–4

Pick the right level of agent before you write a line of code

🟢 The Intuition

Think of hiring: you wouldn't hire a contractor without knowing the job scope. Same here — pick the right level of agent for the complexity of the task.

Level	Name	What it can do	Production use	Limit
0	Encyclopedia	Answer from training data only	FAQ chatbot on known, static content	Knowledge cutoff. Can't act.
1	Connected Expert	LLM + tools (search, DB queries)	Support agent checking order status	One question at a time
2	Strategist	Multi-step planning, chains outputs	Research agent, code review pipeline	Still one agent
3	Manager	Orchestrates specialist sub-agents	Product launch: marketing + web + data agents	Coordination complexity
4	Self-Evolving	Creates new tools/agents on the fly	Internal automation (experimental)	Not for customer-facing prod yet

Level 2 Example — Context Engineering

Level 2: Output of Step 1 feeds into Step 2

# Step 1: Find halfway point
halfway = maps_tool("Mountain View", "SF")  # → "Millbrae"

# Step 2: Use output from step 1 as input to step 2
results = search_tool(f"good coffee in {halfway}")  # ← Context Engineering

Level 3 Example — Multi-Agent Coordinator

OrchestratorAgent ├── MarketingAgent → writes press release ├── WebDevAgent → builds landing page └── DataAgent → pulls competitor analysis Each agent: focused system prompt + relevant tools + smaller context Result: cheaper, faster, more reliable than one giant agent

🔴 The Production Trap

Don't jump to Level 3/4 for everything. Level 1 solves most business problems. The cost and complexity of multi-agent coordination only pays off at genuine workflow complexity.

💜 Recall Hook

Encyclopedia → Expert → Strategist → Manager → Self-Growing Org. Pick your level before you write a line of code.

PART 3

Multi-Agent Design Patterns

Every engineering org has an org chart. Multi-agent systems are just org charts for AI.

➡️

Sequential

Assembly line. Step A must finish before Step B. Used for ETL, ordered pipelines.

ADK: SequentialAgent

⬆️⬆️

Parallel

Multiple teams simultaneously. Used for gathering data from N sources at once.

ADK: ParallelAgent

🔄

Loop

Quality cycle. Run → Check → Repeat until good enough. Used for code that must pass tests.

ADK: LoopAgent

✏️🔍

Review/Critique

Generator + Critic. Writer + Editor. Used for content with quality control.

Pair pattern

📋

Coordinator

Project manager dispatches work. Used for complex projects with many subtasks.

ADK: RouterAgent

🌐

Swarm

Peer-to-peer collaboration. Used for research synthesis from many specialists.

Custom / A2A

🤔

ReAct

Single agent, iterative think→act→observe. Default for most agent workflows.

ADK: LlmAgent

👤

HITL

Human-in-the-loop approval gates. Required for financial, medical, legal decisions.

Compliance

Decision Table — Which Pattern for What?

Use Case	Pattern	Why
ETL: Extract → Clean → Load	Sequential	Order matters
Gather data from 5 sources simultaneously	Parallel	Speed, independent tasks
Code generation that must pass tests	Loop	Quality gate
Content with quality control	Review	Generator + critic
Complex project with many subtasks	Coordinator	Delegation at scale
Standard chat agent	ReAct	Default pattern
Financial / medical decisions	HITL	Legal/compliance requirement

ADK Code

📌 What is ADK?
ADK (Agent Development Kit) is Google’s open-source Python framework for building production AI agents — pip install google-adk. It ships with three ready-made workflow agent classes that map exactly onto the patterns above:

▶ SequentialAgent — runs sub-agents one after another (assembly line).
▶ ParallelAgent — runs sub-agents concurrently, merges results (parallel teams).
▶ LoopAgent — repeats a single agent until a stop condition is met (quality loop).
▶ RouterAgent (also called LlmAgent acting as coordinator) — uses an LLM to decide which sub-agent handles each request (coordinator/dispatcher).

The code below shows the minimal constructor call for each. In production you pass real agent instances and configure callbacks, tools, and memory on each sub-agent.

ADK: All Workflow Agents

# Assembly line
SequentialAgent(sub_agents=[ResearchAgent, DraftAgent, ReviewAgent])

# Parallel — runs simultaneously
ParallelAgent(sub_agents=[Source1Agent, Source2Agent, Source3Agent])

# Loop — iterate until condition
LoopAgent(sub_agent=RefineAgent, max_iterations=5)

# Dynamic routing — LLM decides next agent
RouterAgent(sub_agents=[BillingAgent, SupportAgent, SalesAgent])

💜 Recall Hook

Assembly line → Parallel teams → Quality loop → Manager → Peer network. Match the pattern to the workflow shape.

PART 4

Memory Architecture Deep Dive

Where most agent systems fail in production. Memory is not a feature — it's the foundation.

The Three Memory Types

🖥️

Workspace / Session Memory

Current conversation context, scratchpad. Lives only this conversation. Like your desktop right now.

session.state

👤

User Memory

Preferences, personality, past interactions. Persists across sessions, per-user. Like your manager's notes about you.

user: prefix

🌐

Global Memory

Org-wide knowledge, learned facts, policies. Persistent. Retrieved via RAG/semantic search. Like the company wiki.

app: prefix / RAG

ADK State Scopes

📌 What is ADK, and why does it appear here?
ADK (Agent Development Kit) is Google’s open-source Python framework for building production AI agents — pip install google-adk. It has a built-in concept called session.state that maps directly onto the three memory types shown above (session / user / global), using simple key prefixes to control scope and persistence.

ADK is used here as a concrete example of how the memory scope idea is actually implemented in a real framework — not because it is the only option. LangGraph, CrewAI, AutoGen, and others handle state differently, but the underlying question (“how long does this piece of data live, and who can see it?”) is universal across all of them.

ADK: session.state prefix rules

# NO PREFIX — lives only in this session
session.state['current_intent'] = 'book_flight'

# user: prefix — persists per user across all sessions
session.state['user:preferred_language'] = 'fr'

# app: prefix — shared across ALL users of this app
session.state['app:global_discount_code'] = 'SAVE10'

# temp: prefix — lives only within THIS invocation (never persists)
session.state['temp:raw_api_response'] = {...}

Mem0 — Proactive Memory (2026)

📌 Why Mem0 appears in this section
The memory architecture concepts covered in this section — memory types, scopes, storage approaches, proactive patterns — are universal ideas that apply to any AI memory system. Mem0 is used as a concrete, production-grade example of how those ideas are actually implemented — not because it is the only option, but because it has published benchmarks, open-source code, and clear design decisions that make the tradeoffs visible. Where code says “memory.search()” or “memory.add()”, that is Mem0’s API — the same pattern applies to any equivalent memory system.

The Core Concept — Two Types of Memory

🚫 RETROSPECTIVE (most agents today)

The user is always the trigger.

Think of it like a filing cabinet.
The cabinet has all your notes — but it just sits there. It only opens when you walk up and ask for something.

Flow:
User asks a question
↓ Agent searches memory
↓ Agent responds

✅ PROSPECTIVE (what Mem0 adds)

The situation is the trigger.

Think of it like a smart colleague.
The moment you open auth.py, they say "Hey — remember last week you were blocked on the OAuth token refresh in that file?" — before you even ask.

Flow:
Something in context changes
↓ Agent proactively surfaces memory
↓ User already has context

💡 The real-world difference:
You come back to work on Monday after a week away.

🚫 Retrospective agent: You say "Hey, remind me where I was on the auth bug" → agent searches and tells you.
✅ Prospective agent: You open VS Code, and the agent immediately says "You were blocked on the OAuth token refresh in auth.py — you identified the await chain as the cause but hadn't fixed it yet."

Same information. Completely different experience. The prospective agent connects memory to the right moment.

📌 First: Understand the Two Costs Before Reading the Patterns

✅ CHEAP — Vector DB Lookup

What happens:
1. Convert query text into a number vector (embedding)
2. Search the vector DB for the nearest stored memory vectors
3. Return top-k results by similarity score

What does NOT happen: No LLM call. No reasoning. Just math (distance between number arrays).

Cost: ~1–50ms. Essentially free.

💰 EXPENSIVE — LLM Reasoning Call

What happens:
1. Gather memories (could be many)
2. Send them to an LLM with a prompt like: "What will this user need tomorrow? What’s unresolved? What’s important?"
3. LLM thinks, reasons, generates a response

What DOES happen: Tokens consumed, inference latency, API cost.

Cost: hundreds of ms to seconds + token charges.

📚 Analogy — library vs librarian:
Vector lookup is like using a library’s search index. You type a keyword, it instantly shows you the shelf location. The index is pre-built. No human involved. Fast.
LLM reasoning is like asking the librarian to read a pile of books and summarise what you’ll probably need next week. The librarian is smart, but it takes time and costs something.

The Three Patterns — How to Actually Build This

Each pattern answers a different question: When should the agent surface memories?

📅

Pattern 1

At session start

💡

Pattern 2

When context shifts

🫠

Pattern 3

After session ends

📅 Pattern 1: Session-Start Scan

💡 Pattern 2: Context-Trigger Scan

🫠 Pattern 3: Scheduled Reflection

📅 PATTERN 1: SESSION-START SCAN — "Preload before the user speaks"

What it solves: The "cold start" problem — user opens the agent and has to re-explain everything from scratch.
Analogy: A doctor who reads your file before entering the room, not while you're sitting there waiting.

1

User opens the app — UI starts loading (spinner visible)

↓ simultaneously, in the background

2

Context probe — agent reads signals: What time is it? What file is open? What project are we in?

↓

3

Targeted search — NOT "find all memories" but a precise query: "What was this user recently blocked by in project X related to auth.py on a Tuesday morning?"

↓

4

Inject into system prompt — relevant memories become part of the agent's context before the user types anything

↓

✅

User types first message — agent already knows their context. Zero re-explaining needed. Zero latency added.

✅ The key: run the search async while the UI loads. The user waits for the spinner anyway — use that time to preload memory.

Python: session_start_scan

def session_start_scan(memory, user_id, context, limit=3):
    hour = datetime.now().hour
    time_of_day = "morning" if hour < 12 else "afternoon" if hour < 17 else "evening"

    # NOT a generic "find all memories" query
    # A PRECISE probe shaped around what we know about right now
    query = "What was this user recently working on or blocked by"
    if context.get("project"):
        query += f" in project {context['project']}"
    if context.get("open_file"):
        query += f" related to {context['open_file']}"
    query += f"? It is {time_of_day}."

    return memory.search(query, filters={"user_id": user_id}, top_k=limit)

💡 PATTERN 2: CONTEXT-TRIGGER SCAN — "Surface memory mid-conversation when the topic shifts"

What it solves: The user switches topics mid-conversation — e.g., starts asking about a specific file, or mentions an error. Pattern 1 already ran at session start but didn't cover this new context.
Analogy: A colleague who pays attention to what you're working on right now and says "oh, that reminds me — we had that same issue last sprint."

Why you can't just search memory on every single message:
Memory search costs tokens + latency. If your agent fires a vector search on every single user message, you're spending money constantly — even on messages like "ok thanks" or "got it" where past memory is completely irrelevant.

The solution: put a cheap gate in front

1

User sends a message — before calling memory, run a fast intent classifier

↓ DEMAND detected

↓ NO_DEMAND

Search memory with a targeted query → inject result mid-conversation

Skip search entirely → continue normally, no cost

What triggers DEMAND (memory retrieval needed)?

📄 User mentions a file (auth.py, database.sql)

❌ User mentions an error ("exception", "failing", "bug")

🔄 Topic clearly shifts to a new domain or task

🔗 User says "where we left off" or "last time"

⚠ The key insight from the PASK paper: knowing when NOT to fire is as important as knowing when to fire. A keyword check alone isn't enough — use an LLM classifier that reads the message in context of recent history.

Python: Intent-gated retrieval (PASK pattern)

def detect_intent(chat, message, history) -> Intent:
    # Recent history + message -> LLM classifies DEMAND/NO_DEMAND
    # Returns: {decision: "DEMAND", query: "database.py pool config"}
    # or:      {decision: "NO_DEMAND", query: null}
    ...

# Regex fallback for obvious signals (fast, no LLM cost)
_DEMAND_PATTERNS = [
    r"\b[\w\-]+\.(py|ts|js|sql|yaml)\b",   # file reference
    r"\b(error|bug|blocker|failing)\b",      # error signals
    r"\bwhere we left off\b",                  # continuation signal
]

🫠 PATTERN 3: SCHEDULED REFLECTION SCAN — "Think while the user is away, so you're ready when they return"

What it solves: The problem with Pattern 1 is that you only have raw memories to search — the agent still needs to reason "which of these matters tomorrow morning?" Pattern 3 does that reasoning offline after the session, so the next session start is instant and pre-reasoned.

Analogy: After a meeting ends, a great PA goes through the notes and prepares a briefing for tomorrow morning. You don't prepare the briefing during the next meeting — you prepare it the night before.

Phase 1 — After Session Ends (offline, user is away)

Session ends
↓ 💰 EXPENSIVE LLM call
• "What is still unresolved?"
• "What decisions matter next?"
• "What would the user need cold?"
↓ Answers stored tagged [PROACTIVE]

⚠ But user is away — they don’t feel this cost at all.

Phase 2 — Next Session Start (user opens app)

New session opens
↓ ✅ CHEAP vector lookup
Search for [PROACTIVE] tag
→ Returns pre-computed results
↓ Inject instantly into context
↓ User speaks — agent already ready

No LLM reasoning. Just math (vector distance).

💡 What makes the vector lookup “cheap”:
Because no LLM is involved. All the reasoning (“what will the user need?”) already happened once, offline, in Phase 1. What gets stored in the vector DB are the answers to that reasoning — plain text strings tagged [PROACTIVE].

Phase 2 just embeds the search query ("[PROACTIVE]"), finds the nearest stored answers by vector distance, and returns them. That’s pure math on pre-stored vectors — milliseconds, essentially zero token cost.

Without Pattern 3: every session start must run an LLM to reason over raw memories — expensive, and the user waits.
With Pattern 3: one offline LLM call → pre-stored answers → every session start after that is just a fast vector lookup.

⚠ The reflection LLM runs outside the live conversation — async, after the session ends. Never add it to the live loop. It's a background job.

Python: Scheduled Reflection + Pre-computed Retrieval

def run_reflection(memory, user_id):
    # Runs AFTER session ends, async background job
    envelope = memory.get_all(user_id=user_id)
    memories = envelope.get("results", [])

    # Reflection LLM asks: "What should I surface next session?"
    # Stores pre-computed answers tagged [PROACTIVE]
    memory.add(
        [{"role": "system", "content": f"[PROACTIVE] {item}"} for item in items],
        user_id=user_id,
        metadata={"type": "proactive_hint"}
    )

def on_next_session_open(memory, user_id):
    # At next session start -- zero LLM reasoning needed
    # Just a cheap lookup of pre-computed results
    return memory.search("[PROACTIVE]", user_id=user_id, limit=3)

All three patterns use vector DB search — here’s where each one fits:

Pattern 1:vector DB search (cheap) — shaped by context signals at session open

Pattern 2:intent classifier (tiny LLM gate) → if DEMAND, vector DB search (cheap)

Pattern 3:LLM reasoning (expensive, but offline after session) → stores results → next session = just vector DB search (cheap)

Pattern 3’s trick: It moves the expensive LLM reasoning to happen after the session ends — when the user is not waiting. Next time they open the app, the thinking is already done. All that’s left is a fast, cheap vector lookup of the pre-computed answer.

The Full Memory Taxonomy — A Holistic Map

📌 Two different ways to classify memory — not alternatives, both apply at once

The retrospective / prospective distinction covered in the Mem0 patterns above answers: WHEN is memory triggered? (by the user asking, or by context changing)
The episodic / semantic / procedural types below answer: WHAT kind of content gets stored?

These are two orthogonal lenses on the same system. A memory can be prospectively surfaced (WHEN) and semantic in type (WHAT) at the same time. Think of it like a filing system that has both a retrieval mechanism and a file category — you need both to describe any given memory fully.

All three cognitive types below come from the Mem0 blog (State of AI Agent Memory 2026). Mem0 handles episodic and semantic through the same extraction pipeline but handles procedural memory differently — it routes through a separate extraction prompt that focuses on distilling workflows and processes rather than facts.

The Three Cognitive Types of Memory

📞 Episodic

What happened
Specific events and facts from past interactions.

Example: "User mentioned they moved from NY to SF in March."

Mem0 scope: run_id / session_id
Extraction: standard facts pipeline

📚 Semantic

What is known
Distilled facts, preferences, and user profile.

Example: "User loves Thai food. Has a go-to Friday night spot."

Mem0 scope: user_id
Extraction: standard facts pipeline

⚙ Procedural

How to do things
Learned workflows, team processes, tool-use patterns.

Example: "Team always squash-merges PRs with Jira ticket ID prefix."

Mem0 API: memory_type="procedural_memory"
⚠ Different extraction prompt — distils procedures, not facts

Mem0 Memory Scopes — Who Owns the Memory?

Scope	Parameter	Persists	Use it for
User	`user_id`	Forever, all sessions	Preferences, profile, long-term facts
Agent	`agent_id`	Tied to that agent	Behaviors learned per specific agent instance
Session	`run_id` / `session_id`	This conversation only	Current task context, scratchpad
App / Org	`app_id` / `org_id`	Shared across all users	Company policies, org-wide knowledge, shared context

Vector Memory vs Graph Memory — What’s the Difference?

📊 Vector Memory (Mem0 default)

Retrieves semantically similar facts.
Ask: "What do I know about this user and Python?"
Answer: "User knows Python."

Good for: preferences, facts, personal history.
Latency p95: 1.44s

🕭 Graph Memory (Mem0g)

What is Mem0g? It’s the graph-powered variant of Mem0 (“g” = graph). Instead of storing facts as independent vectors, it stores them as a knowledge graph — nodes are entities (people, tools, companies), edges are relationships (“uses”, “works at”, “migrating from”). This lets it answer queries that require traversing connections, not just finding similar text.

Retrieves facts connected through relationships.
Ask: "What do I know about this user and Python?"
Answer: "User works with Python for data pipelines, using pandas, at a company using dbt, migrating from Spark."
↑ Vector found “User knows Python”. Graph traversed the graph to pull in the full connected context: tool → purpose → stack → company → migration.
Good for: complex entity networks, medical, enterprise hierarchies.
Latency p95: 2.59s — richer but costs more

Vector vs. Graph Memory

retrieval strategy comparison · accuracy & latency benchmarks

Vector Memory

Flat Retrieval

nearest-neighbor search in embedding space

Accuracy: 66.9% Latency: 0.2s

13× faster retrieval

Graph Memory

Multi-Hop Traversal

relationship-aware graph exploration

Accuracy: 68.5% Latency: 2.6s

+1.6% accuracy gain

Source: Mem0 Benchmark Suite — mem0.ai/blog/state-of-ai-agent-memory-2026

When to choose graph: When your queries are about relationships between entities (medical patient contexts, account hierarchies, system dependencies). For simple personalization — stick with vector.

How Mem0 Actually Works — The 2026 Algorithm

The problem with most memory systems: they only optimize retrieval.
The standard approach (store → embed → retrieve) misses the hard parts: knowing what to store, handling facts that change over time, and reasoning across multiple scattered memories.

Two real examples that shaped the architecture:

🍝 The Thai Restaurant Problem

User orders from the same Thai restaurant every Friday for two months. A naive system stores 8 records of "Ordered pad thai on Friday." When you ask where to book dinner, it has nothing useful to offer.

What good memory should do: distil the pattern into a semantic fact — "User loves Thai food, has a Friday night go-to spot" — weeks before you need it.

📍 The New York → San Francisco Problem

Profile says "user lives in New York." Six months later: new data shows San Francisco. Most systems overwrite — throwing away the fact that they moved.

What good memory should do: retain BOTH facts and the transition. "Your old neighborhood" → New York. "Your current location" → San Francisco.

The 4 Algorithm Changes (April 2026):

1

Single-pass ADD-only extraction — Old: two LLM passes (extract + reconcile). New: one pass, only adds. When facts change, both the old and new fact survive — so the system can reason about how things evolved, not just where they landed. Cuts extraction latency ~50%.

2

Agent-generated facts are now first-class — Old: only user statements were stored. New: when the agent says "I’ve booked your flight for March 3", that fact is stored with equal weight. This closed a major blind spot — huge gain on single-session assistant recall (+53.6 points).

3

Entity linking — Every memory is analyzed for entities (proper nouns, compound noun phrases). These get embedded in a separate lookup layer, linking all memories about the same person, place, or concept. At query time, entity matches boost ranking. Drove the +23.1 gain on multi-hop reasoning.

4

Multi-signal retrieval — Three scoring passes in parallel: semantic similarity + BM25 keyword + entity matching. Results are fused. "What meetings did I attend?" hits keyword. "What does Alice think about remote work?" hits entity. "How has attitude shifted?" requires semantic. No single signal wins all queries.

Results: Old vs New Algorithm

Benchmark	Old Score	New Score	Tokens/Query	Note
LoCoMo	71.4	91.6	6,956	+29.6 on temporal, +23.1 on multi-hop
LongMemEval	67.8	93.4	6,787	+53.6 on assistant recall (the blind spot, now fixed)
BEAM (1M tokens)	—	64.1	6,719	New benchmark, production-scale context volumes
BEAM (10M tokens)	—	48.6	6,914	Hardest benchmark; temporal/event-ordering remain open problems

Context: Full-context baselines on LoCoMo score 72.9% using 25,000+ tokens per query at 9.87s median / 17.12s p95 latency. Mem0’s new algorithm beats full-context accuracy and uses ~7,000 tokens — roughly 3.6× more efficient.

How We Measure Memory — The Benchmark Landscape

💡 The most important distinction in 2026:
Long-context benchmark: Give the model a single huge input (1M tokens). Ask it to find something in that fixed input. One pass. Nothing is written, nothing persists.
Memory benchmark: Give the system many conversations over time. Require it to write to a store. Later, ask it to retrieve from that store. State must persist across sessions.

A model that aces a 1M-token retrieval test is NOT proven to handle cross-session memory. They test different things. Conflating them makes a system look better than it is.

The Three Memory Benchmarks That Matter

📋 LoCoMo (2024) — 1,540 questions

Very long multi-session dialogue: avg 300 turns, 9,000 tokens, up to 35 sessions per conversation. Tests single-hop, multi-hop, open-domain, temporal, and adversarial recall. Most widely reported memory benchmark. Limitation: modest context by 2026 standards, no explicit knowledge-update scoring.

🔒 LongMemEval (2024) — 500 questions

Five abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention (questions about events that never happened — system must say "I don’t know" rather than hallucinate). LongMemEval_M extends to ~500 sessions per chat history — the regime where context-stuffing breaks completely.

⚡ BEAM (ICLR 2026) — 2,000 questions across 1M and 10M token scales

The hardest benchmark. 10 categories: preference following, instruction following, information extraction, knowledge update, multi-session reasoning, summarization, temporal reasoning, event ordering, abstention, contradiction resolution. Cannot be solved by expanding the context window. This is the one that proves whether memory actually works at production scale. BEAM-1M and BEAM-10M show the gap between "works on paper" and "works in production."

What a complete memory benchmark must test (5 things):

🔄 Cross-session continuity
Surface facts from earlier sessions in later ones. Without this, it’s a chat buffer, not memory.

✏ Selective writes
Decide what to remember. Write everything → store fills up. Write nothing → no memory.

📈 Retrieval under real token budgets
Benchmarks that allow 50K tokens of injected memory per prompt don’t test production conditions.

🔀 Forgetting and update
Overwrite stale memories. Pure retrieval benchmarks miss this entirely.

👤 Per-user isolation
Keep user A’s memory out of user B’s responses. Single-user benchmarks can’t catch this failure.

Open Problems (May 2026 — not yet solved):

Memory staleness at scale — A memory about a user’s employer is highly-retrieved until the user changes jobs, at which point it’s confidently wrong. Detecting when high-relevance memories go stale (not just low-relevance ones) is unsolved.

Cross-session identity resolution — The model assumes a stable user_id. Real users interact across devices, anonymous and authenticated sessions. Figuring out if two interactions came from the same person is an open identity problem.

Privacy and consent architecture — Mem0’s docs flag this directly. What governance looks like — how users inspect, edit, delete stored memories, how teams audit storage — is currently an application-layer concern with no standard.

Multi-session reasoning at 10M token scale — BEAM-10M score is 48.6 (vs 64.1 at 1M). Temporal reasoning drops to 16.3 at 10M. These require higher-order representations of how events relate across time — fact-level and entity-level matching are not enough.

💜 Recall Hook — Full Memory Picture

Three cognitive types: Episodic (what happened) → Semantic (what is known) → Procedural (how to do things).
Four scopes: user_id (forever) · agent_id (this agent) · run_id (this session) · app_id (everyone).
Two storage types: Vector (similar facts, fast) · Graph (related entities, richer — use for complex relationships).

Algorithm (2026): ADD-only (preserve history) · Agent facts = first-class · Entity linking · Multi-signal retrieval (semantic + keyword + entity).

Benchmarks: Memory ≠ long-context. LoCoMo (multi-session) · LongMemEval (updates + abstention) · BEAM (production scale, 1M–10M tokens, cannot be solved by bigger context window).

Proactive patterns: Preload at session start (async) · Gate mid-conversation (DEMAND/NO_DEMAND) · Reflect offline after session → cheap lookup next time.

ADK scopes: No prefix=this chat · user:=this person · app:=everyone · temp:=this one call.

PART 5

Google ADK Deep Dive

pip install google-adk — the production-grade event-driven agent framework

The Two ADK Principles

📝

1. Event-Driven Architecture

Every action is an immutable Event. System processes events one by one. Makes the system auditable, debuggable, and resumable after crash.

🔀

2. Compute / Memory Separation

Agent logic is stateless and can crash. Session Service is persistent and survives crash. If server dies mid-workflow, ADK resumes from last saved state.

The Runtime Architecture

When your application calls runner.run_async(), here’s what happens inside the ADK Runtime. You never call the agent directly — you always go through the Runner, which orchestrates the entire interaction:

Agent Development Kit Runtime

👤

User

① Request

"Help me do X"
+ session_id

③ Stream

Events → back

Runner — Orchestrator

⚙ Event Processor

Session mgmt · Context creation
Event handling · Agent invocation

Ask ↓ ② EVENT LOOP ↑ Yield

⚡ Execution Logic

Agent logic · LLM invocations
Callbacks · Tools

Services

📋 SessionService
📦 ArtifactService
🧠 MemoryService

⇅

Storage

🗄️ Database
☁️ Vertex AI

User → Runner orchestrates → Execution Logic does the work → Response streams back Runner ↔ Services ↔ Storage (DB / Cloud)

Runner Responsibilities

Session Management — creates or loads conversation history from the configured SessionService
Agent Invocation — determines which agent handles the message and calls its execution method
Event Handling — processes every Event the agent yields (persist state, handle auth, stream responses)
Context Creation — assembles an InvocationContext bundling session + services for the agent to use

The Event Loop — How Crash Safety Works

1

Execute

Agent runs its logic — reasons, plans, decides what action to take next.

↓

2

Yield an Event

Agent emits an immutable Event (send message, call tool, update state). Execution pauses immediately.

↓

3

Pause

Agent is frozen. It cannot continue until the Runner finishes processing the Event.

↓

4

Process & Persist

Runner saves state to the database via SessionService. Server crash here = safe. State is already written before the agent resumes.

↓

5

Resume

Runner wakes the Agent. Agent resumes from where it paused, now seeing the updated state.

✅ Crash recovery: Step 4 persists to the DB before step 5. A server failure between 4→5 is harmless — ADK replays from the last saved event using the invocation_id.

⚙️ RunConfig

Runtime Controls

streaming_mode — SSE (Server-Sent Events) or NONE
max_llm_calls — default 500, prevents infinite loops
save_input_blobs_as_artifacts — auto-archive uploaded files for audit

🔄 Resumability

Crash Recovery Config

Set ResumabilityConfig(is_resumable=True) on the App object. On restart, pass the original invocation_id — ADK skips already-completed steps and resumes exactly where it failed.

Agent Types

LlmAgent

WorkflowAgents

CustomAgent

Callbacks

LlmAgent — The Standard Agent

from google.adk.agents import LlmAgent

agent = LlmAgent(
    model="gemini-2.0-flash",
    instruction="You are a helpful customer support agent...",
    tools=[lookup_order, escalate_ticket],
    output_key="agent_response"  # auto-saves final answer to session.state
)

Non-deterministic — LLM decides what to do
Best for: language tasks, dynamic decisions, tool orchestration

WorkflowAgents — Deterministic

# Sequential — order matters
pipeline = SequentialAgent(sub_agents=[ResearchAgent, DraftAgent, ReviewAgent])

# Parallel — gather simultaneously
fetcher = ParallelAgent(sub_agents=[Source1, Source2, Source3])

# Loop — quality gate
refiner = LoopAgent(sub_agent=CodeRefineAgent, max_iterations=5)

CustomAgent — Full Python Control

class MyOrchestrator(BaseAgent):
    async def _run_async_impl(self, ctx: InvocationContext):
        # Arbitrary Python logic — conditionals, loops, anything
        if ctx.session.state.get('user_tier') == 'enterprise':
            async for event in self.enterprise_agent.run_async(ctx):
                yield event
        else:
            async for event in self.standard_agent.run_async(ctx):
                yield event

before_agent_callback → Check user permissions before_model_callback → Scrub PII from input after_model_callback → Filter toxic content before_tool_callback → Is user authorized? after_tool_callback → Log for compliance

Callback Control Pattern

def guard_pii(callback_context: CallbackContext, request: LlmRequest):
    if contains_pii(request.user_input):
        # Return a response → BLOCKS the LLM call
        return LlmResponse(text="I can't process personal information.")
    return None  # Return None → ALLOWS execution to continue

✅ Return None = proceed. Return anything = intercept and replace.

Session, State & Storage

ADK enforces strict separation: Agent logic (Compute) is stateless and can crash freely. Session State (Memory) is persisted externally and never lost.

The Session Object — Key Properties

Property	Type	What it holds
`id`	string	Unique conversation thread ID (e.g. `"test_id_modification"`). A single SessionService handles many sessions.
`app_name`	string	Which agent application this conversation belongs to (e.g. `"id_modifier_workflow"`)
`events`	list	Chronological history of all interactions — user messages, agent replies, tool calls
`state`	dict	Mutable key-value scratchpad — survives across turns within this session
`lastUpdateTime`	timestamp	When the last event was added to this session

State Scope Prefixes — How Long Does Data Live?

Prefix	Scope	Persistence	Example
(none)	This session only	Persisted if using DB/Vertex; lost with InMemory on restart	`state['current_intent'] = 'book_flight'`
`app:`	All users & sessions for this app	Always persistent (DB/Vertex)	`state['app:global_discount'] = 'SAVE10'`
`user:`	This user across all their sessions (same app)	Always persistent (DB/Vertex)	`state['user:preferred_language'] = 'fr'`
`temp:`	This invocation only — not even this session	Never persisted to any storage	`state['temp:raw_api_response'] = {...}`

💡 When a SequentialAgent or ParallelAgent calls sub-agents, they all share the same invocation_id — and therefore the same temp: state. Parent and sub-agents can pass data through temp: keys within a single invocation.

SessionService — Where State is Stored

InMemory

Vertex AI

Database

⚠️ Development only. All data lives in application memory — lost entirely on restart. Zero configuration required.

InMemorySessionService

from google.adk.sessions import InMemorySessionService

session_service = InMemorySessionService()  # nothing else needed

✅ Managed production on Google Cloud. Uses Vertex AI Agent Engine — fully managed and scalable. Requires GCP project + storage bucket + Reasoning Engine resource.

VertexAiSessionService

# pip install vertexai  — requires GCP project, storage bucket, Reasoning Engine
session_service = VertexAiSessionService(
    project=PROJECT_ID,
    location=LOCATION
)

✅ Self-hosted production. Connects to PostgreSQL, MySQL, or SQLite. You control the infrastructure.

DatabaseSessionService

session_service = DatabaseSessionService(
    db_url="postgresql://user:pass@host/db"
)

Session Lifecycle — One Conversation Turn

1

Start or Resume

App calls session_service.create_session() for new chats, or uses an existing session_id to resume.

↓

2

Context Provided

Runner fetches the Session object and assembles an InvocationContext — the agent’s full view of state + history + services.

↓

3

Agent Processing

Agent reads state, analyzes event history, generates a response, optionally writes state updates.

↓

4

Save Interaction

Runner calls session_service.append_event(session, event). Event added to history; state updated in storage; lastUpdateTime refreshed.

↓

5

Ready for Next Turn

Response sent to user. Updated session stored and waiting for the next message (loops back to step 1).

ADK Context Types

ADK provides the right context object automatically depending on where your code runs — you don’t create them; the framework injects them.

Context	Read State	Write State	Artifacts	Auth / Memory	Where used
`ReadonlyContext`	✅	❌	❌	❌	Dynamic instruction providers
`CallbackContext`	✅	✅	✅	❌	Guardrail callbacks (before/after model)
`ToolContext`	✅	✅	✅	✅	Tool functions & tool callbacks
`InvocationContext`	✅	✅	✅	✅	Core agent logic (`_run_async_impl`)

💡 InvocationContext (ctx) is the most comprehensive — direct access to ctx.session, ctx.agent, ctx.invocation_id, all backend services, and the control flag ctx.end_invocation = True to stop the entire workflow immediately.

Tools & MCP

Tools are how agents interact with the outside world. The LLM reads each tool’s docstring to decide when to call it — the docstring is the contract, not optional.

Function Tools

MCP Toolset

ToolContext Powers

📌 Critical: Write the docstring before the implementation. The LLM’s tool-calling decision depends entirely on reading it — including when not to call this tool.

Function Tool — Best-Practice Pattern

def lookup_order(order_id: str) -> dict:
    """Retrieve the current status and details of a customer order.

    Use this when the user asks about an order, shipment, or delivery status.
    Do NOT use this for refunds — use process_refund() instead.

    Args:
        order_id: The order number from the user (e.g. "ORD-12345")
    Returns:
        dict with keys: status, estimated_delivery, items
    """
    return {"status": "shipped", "items": [...]}

For enterprise systems (Salesforce, databases, internal APIs), connect to an external MCP Server instead of writing tool code inline. McpToolset auto-discovers available tools via list_tools() and converts them to ADK tools dynamically.

McpToolset — Enterprise Data Sources

from google.adk.tools.mcp_tool.mcp_toolset import McpToolset, StdioServerParameters

# STDIO transport = local MCP server process
toolset = McpToolset(
    connection_params=StdioServerParameters(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem"],
    )
)

# McpToolset queries list_tools() and converts them to ADK tools automatically
agent = LlmAgent(tools=[toolset], ...)

When ADK calls your tool function, it injects a ToolContext object if your function signature includes it. This gives the tool powerful access beyond just returning a value:

Power	Code	Use Case
Read/Write State	`tool_context.state['verified'] = True`	Flag something for downstream agents to see
Request Auth	`tool_context.request_credential(auth_config)`	Trigger an OAuth flow if user isn’t logged in
Transfer to Agent	`tool_context.actions.transfer_to_agent = "SupportAgent"`	Route conversation to a specialist agent
Search Memory	`tool_context.search_memory(query)`	Look up long-term knowledge from memory service
List Artifacts	`tool_context.list_artifacts()`	Discover files available in the session

Scale & Deployment

🏗️ App Container

The App Class

Top-level wrapper for your entire agent workflow. Wraps the root agent and provides:

Centralized lifecycle management (startup/shutdown hooks)
Shared resource configuration — API keys, DB connection pools
context_cache_config — cache large system prompts to save inference costs
ResumabilityConfig — enable crash recovery per-app

🌐 Agent2Agent (A2A)

Cross-Network Agents

to_a2a(agent) — turns any local agent into a microservice. Generates an Agent Card (metadata description) and an API endpoint automatically.

RemoteA2aAgent — a client that reads a remote Agent Card to discover capabilities, then calls the remote agent across the network. Enables multi-agent systems spanning different services.

💜 Recall Hook — ADK in One Paragraph

Architecture: Event-driven (every action is an immutable Event) + Compute/Memory separation (agent can crash; session state cannot be lost).

Runner: Orchestrates one user interaction — manages session, invokes agents, processes events, persists state to DB. You call Runner, not Agent directly.

Event Loop: Execute → Yield → Pause → Persist to DB (step 4) → Resume. A crash after step 4 is safe — replay from last event.

State scopes: temp: (this call only) · none (this session) · user: (this person) · app: (all users).

Context types: ReadOnly=dynamic instructions · Callback=guardrails · Tool=tool functions · Invocation=core agent logic.

Tools: docstring is the contract — LLM reads it to decide when to call. Use McpToolset for enterprise data sources.

BEGINNER

KV Cache — For Beginners, Read This First

Plain-English intuition before the technical deep-dive in Part 6

📚 Intuition: Attention is just a search

Imagine you’re at a library. You walk in with a question written on a card. Every book on the shelf has a label on its spine (a short summary of what’s inside). You scan the labels, find the best matches, then open those books and read the actual content.

That’s exactly what Q, K, and V are:

Library analogy: your question card

Q — Query

“What am I looking for?”

The current token asks this. Changes on every generation step. Never cached.

Library analogy: label on the spine

K — Key

“What do I contain?”

Every past token has one. Never changes. Gets cached.

Library analogy: content inside the book

V — Value

“Here’s my actual information.”

Every past token has one. Never changes. Gets cached.

What happens each step:
① Match — dot-product Q with every K → scores that say “how relevant is each token to me right now?”
② Normalize — softmax turns the scores into weights that sum to 100%
③ Blend — multiply each V by its weight, add them all up → one context-aware output vector

❓ Common confusion: does “past tokens” mean past conversations?

No. It means tokens within the current request — not across separate conversations.

When you send: “You are a helpful assistant. Alice asked: what is the capital of France?”

The model generates one token at a time:
The → capital → of → France → is → Paris

At each step, it needs to look at everything before it. Without KV Cache → recomputes K and V for the whole prompt every step. With KV Cache → computes once, reuses forever.

Three different problems, three different tools:

▶ KV Cache

GPU memory optimization. Don’t recompute what you already computed inside this generation.

▶ Prompt Cache (Anthropic / Gemini)

API-level optimization. Don’t re-process the same system prompt across separate API requests.

▶ Mem0 / RAG

Actual memory. Remember things across separate conversations, days, and users.

💬 Say it in one sentence

“Self-attention lets each word ask ‘which other words matter for understanding me?’ — Q is the question, K is the index, V is the answer, and the KV Cache just means you stop re-reading the index from scratch every time.”

👉 Ready for the full technical picture? Continue to Part 6: KV Cache — GPU-Level Inference Caching →

PART 6

KV Cache — GPU-Level Inference Caching

KV Cache architecture & distributed memory management

📚 First: What are Q, K, and V? (The foundation)

Inside every transformer model (GPT, Gemini, Claude, Llama…) is a mechanism called self-attention. Its job: for every token being generated, figure out which earlier tokens in the context are most relevant — and by how much.

The mechanism relies on three learned weight matrices — W_q, W_k, W_v — that are trained alongside the model weights. Every token embedding x_i is projected through each matrix to produce three vectors:

          q(i) = Wq · x(i)     ← Query vector for token i

          k(i) = Wk · x(i)     ← Key vector for token i

          v(i) = Wv · x(i)     ← Value vector for token i

          for i ∈ [1, T] — computed for every token in the sequence

Q — Query

“What am I looking for?”

Every token gets a Q. During generation, it’s the newest token’s Q that is dot-producted against every K in the sequence to ask: “how relevant is each past token to me?”

Q is never cached — recomputed each generation step.

K — Key

“What do I have to offer?”

Every token gets a K. The dot product Q·K^T produces unnormalized attention scores (ω) that say how strongly this token should attend to each other.

Past K’s never change — cached.

V — Value

“What is my actual content?”

Every token gets a V. After softmax normalizes the attention scores into weights α, the V’s are weighted and summed to produce the final context vector z.

Past V’s never change — cached.

The full pipeline, step by step:

1

Embed — each token id is converted to a dense embedding vector x⁽ⁱ⁾ via a lookup table.

2

Project — multiply every x⁽ⁱ⁾ by W_q, W_k, W_v to get q⁽ⁱ⁾, k⁽ⁱ⁾, v⁽ⁱ⁾. These matrices are learned during training.

3

Score — for the current token’s Q, compute dot products with every K: ω_ij = q⁽ⁱ⁾^T k^(j). Higher score = more relevant.

4

Scale & normalize — divide scores by √d_k to prevent exploding gradients (the dot product variance grows linearly with d_k; dividing restores it to ~1). Then softmax → weights α that sum to 1.

5

Output — weighted sum of all V’s using α as weights: z⁽ⁱ⁾ = ∑ α_ij v^(j). This is the context-enriched output for token i.

The full formula:

Attention(Q, K, V) = softmax( Q · KT / √dk ) · V

In practice, real models run Multi-Head Attention: multiple independent sets of (W_q, W_k, W_v) in parallel, each learning different relationship patterns (syntax, coreference, position…), then concatenate their outputs. Each set is one “head”.

So what’s the expensive part? Steps 2–5 must run for every token in the context window. But for tokens already seen, k⁽ⁱ⁾ and v⁽ⁱ⁾ never change — only the new token’s Q changes. Recomputing them from scratch on every generation step means paying enormous GPU cost on data that hasn’t moved. That’s exactly the problem KV Cache solves — store k and v for all past tokens once, reuse every step.

🌻 The Intuition

Imagine answering questions about a 500-page book. Without a cache, you re-read the entire book from page 1 every time someone asks a new question. That’s what an LLM does without KV cache — it recomputes attention (K and V vectors for every token) over the entire context on every single new token it generates.

🔵 The Technical Reality — Attention Math

For each new token, the model computes a new Q, then needs K and V tensors for all previous tokens to run the attention formula. Without caching → all K,V recomputed every time → O(n²) cost per token → GPU overloaded.

With KV cache → K and V for all previous tokens are stored in GPU HBM → the new token only computes K_new and V_new → appends them to the cache → O(n) cost per token.

What’s cached: The static prefix (system prompt + tools) is computed once and cached. The dynamic suffix (user messages, tool results) is appended each turn. This is why prompt structure matters so much — stable content at the top = maximum cache reuse.

The Memory Budget Problem

⚠️ Real numbers: Llama 3 70B with 128k token context = ~40 GB KV cache per user. Batch of 10 users = 400 GB. More than most GPUs have.

KV CACHE FIXED MEMORY POOL ┌────────────────────────────────────────────────┐ │ [user1 K/V blocks] [user2 K/V blocks] [...] │ │ ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ ← FULL │ └────────────────────────────────────────────────┘ ↓ when full K/V EVICTIONS (LRU policy) Evicted user's next request → CACHE MISS Cache miss → full recompute → latency spike

vLLM's solution: Paged KV Cache — borrowed from OS virtual memory. Shared Prefix Pages: 1000 users sharing the same RAG context → stored once → massive memory efficiency.

The Multi-Node Problem

SINGLE NODE: clean, automatic prefix reuse, boringly simple. MULTI-NODE PROBLEM: Load Balancer │ ├──── vLLM Node 1 ── KV Cache (private, GPU-local) └──── vLLM Node 2 ── KV Cache (private, GPU-local) User A turn 1 → Node 1 → KV computed, cached locally User A turn 2 → Node 2 → ZERO cache for User A → FULL RECOMPUTE ❌

💡 LMCache treats KV cache as shared infrastructure. Every node connects to a shared tiered storage: GPU HBM (hot) → CPU RAM (warm) → S3/ONTAP (cold). Use kv_role: "kv_both" for prefill + decode.

NVIDIA Unified Memory — The Hardware Solution

Python: RMM Unified Memory for Llama 3 70B

import rmm
import torch
from rmm.allocators.torch import rmm_torch_allocator

# Enable unified memory — GPU can transparently spill to CPU RAM
# GH200: 96 GB HBM + 480 GB LPDDR @ 900 GB/s NVLink-C2C
rmm.reinitialize(managed_memory=True)
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)

# Now loads without OOM — hardware handles data movement automatically
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-70B")

🔴 The Production Trap / Economics

"Training made the headlines. Inference pays the power bill." — NetApp/HPCWire 2026

The companies that win won't throw the most GPUs at inference — they'll engineer smarter inference paths. KV reuse + quantization + tiered storage = dramatically better economics without more hardware.

💜 Recall Hook

Single node: vLLM Paged KV. Multi-node: LMCache + shared tiers (GPU→CPU→S3). Hardware limit: NVIDIA unified memory (GH200, 900 GB/s NVLink). Economics: inference > training costs now.

PART 7

Prompt Caching — Claude, Gemini & Beyond

Token efficiency, Claude cache_control, and Gemini’s implicit caching

🟢 The Intuition

KV cache is inside the inference server — transparent to you. Prompt caching is exposed to you as a developer via the API. You structure your prompt so the stable parts are a shared prefix, and every request that matches it pays ~1/10th the input token price.

"It is often said in engineering that 'cache rules everything around me', and the same rule holds for agents." — Claude Code team

The System Prompt Layout

▲ STABLE — cache hits here (layers 1–3)

▲ DYNAMIC — no cache benefit (layer 4)

          1. BASE SYSTEM INSTRUCTIONS + TOOLS
          ← globally cached

          (static — never changes between sessions or users)
        

          2. CLAUDE.md / Memory / Project context
          ← cached per project

          (same within one project, changes project-to-project)
        

          3. SESSION STATE
           (env, MCP config, output style)
          ← cached per session
        

          4. MESSAGES
           (user messages, tool results, assistant responses)
          ← GROWS each turn — NOT cached
        

Layers 1–3 are your prompt prefix. Put your system instructions, tools, project context, and session config here. The model caches these once and reuses them across all turns — you pay full price only the first time.

Layer 4 is the conversation history. It grows with every turn. Nothing here is cached — you pay full input token cost every request for this part.

“Long running agentic products like Claude Code are made feasible by prompt caching which allows us to reuse computation from previous roundtrips and significantly decrease latency and cost.

At Claude Code, we build our entire harness around prompt caching.”
— Anthropic Claude Code engineering team · Lessons from Building Claude Code (Apr 30, 2026)

How Claude Code Does It — and What You Can Apply Everywhere

Claude Code builds its entire prompt harness around caching (the "prompt harness" is the code layer that assembles and sends the full prompt on every API call — system prompt, tool definitions, memory, and conversation history packaged together). Every agentic turn re-sends the full conversation history — system prompt, tools, memory, and all prior messages. Without caching, that re-sending would cost full input token price each time. With it, the stable parts are paid for once and reused indefinitely.

The key insight is structural discipline: place everything that does not change (system instructions, tool definitions, project context) at the top of the prompt, and everything that changes (conversation history, user input) at the bottom. The model caches from the top down — so any change in the middle invalidates everything below it.

🌍 Universal Best Practices — applies to Claude, Gemini, OpenAI and any provider with prompt caching

1

Static content first, dynamic content last. System prompt → tools → memory/context → then conversation history. Any token that changes breaks the cache for everything after it.

2

Never edit the system prompt mid-session. Instead, inject contextual updates (reminders, state changes) as a message in the conversation — not by mutating the system prompt. Mutating it = cache miss on every request that follows.

3

Keep tool definitions stable. Tool schemas are part of the cached prefix. Adding, removing, or reordering tools invalidates the entire conversation cache. Avoid dynamic tool lists.

4

Forked / parallel calls must share the same prefix. When spawning sub-agents or compaction calls, use an identical system prompt and tool set as the parent. Different prefix = no shared cache = pay full price for the whole history.

5

Monitor your cache hit rate. Claude: track cache_read_input_tokens. Gemini: check cached_token_count. A dropping hit rate is a cost alarm — treat it with the same urgency as a latency spike.

Pricing — Why This Changes Everything

📌 What does this mean in practice?
Anthropic charges differently depending on whether a token was cached or not. The numbers below are multipliers on the normal input token price.

Example: your system prompt is 10,000 tokens. Normal price = $X per request. With prompt caching, the first request costs $1.25X (cache write), but every request after that costs $0.10X (cache read) — a 90% cost reduction. For an agent that runs 1,000 turns per day with the same system prompt, this is the difference between a viable product and a product that loses money.

Action	Price
Cache write (first time)	1.25x normal
Cache read (every hit)	0.10x normal
Break-even point	~2 requests
After 10 requests	~90% cost reduction

Google Gemini Caching — Two Modes

Implicit (Auto)

Explicit (TTL)

Service Tiers

⚡ Yes — it's always on by default for Gemini 2.5+. You do nothing.

Gemini silently checks whether the beginning of your incoming prompt matches something it has already computed recently. If it does, it reuses that computation and charges you less — automatically, with no API changes on your side.

How to take advantage of it: put your large, stable content (system instructions, a long document, a big code file) at the top of your prompt. Dynamic content (the user's question, session ID, timestamp) goes at the bottom. Since Gemini matches from the start, the more stable tokens at the front → the bigger the cache hit.

Limitation: it's best-effort, not guaranteed. Google doesn't promise a hit — it depends on whether that prefix is still warm in their infrastructure. You can't force it or inspect the cache directly.

How to verify a hit: check response.usage_metadata.cached_token_count — if it's > 0, you got a cache hit and were charged less for those tokens.

What is Explicit Caching (TTL)?

Implicit caching is best-effort — you hope Gemini still has your prefix warm. Explicit caching is a guarantee. You upload your large stable content to Gemini once, give it a name and a TTL (time-to-live), and get back a cache handle. Every subsequent request that references that handle pays only for the new tokens (your question) — the large content is never re-processed.

When to use it: you have a large document, codebase, or transcript that many requests will query. Example: a 200-page PDF you want users to ask questions about — upload it once as a cache, then every question only costs a handful of tokens instead of 100,000+.

The TTL: how long Gemini keeps the cache alive. Default is 5 minutes (300s). You can extend it. After TTL expires the cache is deleted — you'd need to re-upload to use it again.

Python: Gemini Explicit Caching — Step by Step

from google import genai
from google.genai import types

client = genai.Client()

# STEP 1 — Upload your large stable content to Gemini as a named cache.
# This is the expensive step: Gemini processes the full document once.
# 'document' is your large content (e.g. a loaded PDF, long codebase, transcript).
cache = client.caches.create(
    model="gemini-3-flash-preview",
    config=types.CreateCachedContentConfig(
        system_instruction="You are an expert analyzing transcripts.",
        contents=[document],     # the large stable content — processed once
        ttl="300s",              # keep this cache alive for 5 minutes
        display_name="my-doc-cache"  # optional human-readable name
    )
)
# cache.name is your handle, e.g. "cachedContents/abc123"

# STEP 2 — Ask a question, referencing the cache by name.
# Gemini does NOT re-process the document. You only pay for the question tokens.
# You can repeat this step many times — each call is cheap.
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Please summarize this transcript",
    config=types.GenerateContentConfig(cached_content=cache.name)
)

# STEP 3 — Verify the hit. cached_token_count shows how many tokens
# were served from cache (i.e. NOT charged at full input price).
print(response.usage_metadata)
# e.g. prompt_token_count: 9, cached_token_count: 95800, candidates_token_count: 312

✅ Key difference from Implicit: with Explicit you own the cache — you can name it, extend its TTL, reuse it across users, and delete it. Implicit is automatic but ephemeral and uncontrolled.

Tier	Price	Latency	Best for
Standard	Full price	Seconds–minutes	General day-to-day apps
Priority	+75–100%	Seconds (non-sheddable)	Real-time fraud, production copilots
Flex	50% off ✅	1–15 min target	Multi-step agent workflows ← best value
Batch	50% off	Up to 24 hours	Pre-processing, bulk embeddings
Context Cache	90% off	Faster TTFT	Recurring queries over same content

✅ Flex is the sleeper for agent workflows: 50% discount, synchronous interface (no async code changes), and most multi-step agent chains are not latency-sensitive enough to need Priority pricing.

💜 Recall Hook

Stable top, dynamic bottom. Never change tools mid-session. Monitor cache hit rate like uptime.
Gemini: implicit=auto, explicit=TTL API. Flex tier = 50% off for agent workflows.

PART 8

Semantic Caching

Meaning-level cache hits — skip the LLM even when the wording differs

🌞 The Core Idea

Prompt caching (Parts 6–7) only helps when the exact bytes match. Semantic caching goes further: it caches LLM responses and, on the next request, checks whether the meaning is similar enough — not just whether the text is identical. If it is, it returns the cached answer and never calls the LLM at all.

User 1: "What is the return policy?" User 2: "How do I return a product?" ← different words, same intent User 3: "Can I get a refund on my order?" ← different words, same intent Prompt caching: 3 cache misses (bytes don't match) Semantic caching: 1 LLM call (User 1), then 2 cache hits (Users 2 & 3)

This matters most in customer-facing apps, chatbots, and Q&A agents where thousands of users ask the same things in slightly different ways. Every cache hit = zero LLM cost + much lower latency.

How It Works — Step by Step

When a new request comes in:

1

Embed the query. Run the user's question through an embedding model (e.g. text-embedding-3-small) to get a vector — a list of numbers that encodes the meaning.

2

Search the cache. Vector-similarity search against previously cached (query vector, LLM response) pairs stored in a vector DB like Qdrant or Redis. Returns a similarity score 0–1.

3a

Score ≥ threshold → cache hit. Return the stored LLM response immediately. No LLM call. Cost = near zero. Latency = milliseconds.

3b

Score < threshold → cache miss. Call the LLM. Store the (query vector, response) pair in the cache. This becomes a hit for the next semantically similar question.

🚨 The Context Window Problem (Critical Production Gotcha)

The Lake Huron Bug

A cache key built from only the latest message is dangerously wrong. Here's why:

Session A: User: "What is the largest lake in North America?" → "Lake Superior." → cached User: "What is the second largest?" → "Lake Huron." → cached Session B (different user, 10 minutes later): User: "What is the largest stadium in North America?" → "Michigan Stadium." → cached User: "What is the second largest?" → Semantic cache finds "What is the second largest?" from Session A → Returns "Lake Huron" ← WRONG. Context was about lakes, not stadiums.

🔴 Fix: The cache key MUST include the recent conversation context, not just the latest message. Vectorize a sliding window of the last N messages + the new message as the lookup key. This way "second largest" in a lakes conversation and "second largest" in a stadiums conversation produce different cache keys.

Similarity Score Tuning

The threshold is the single most important parameter in semantic caching. Too high = useless cache. Too low = wrong answers. You tune it per domain — a legal Q&A chatbot needs a much higher threshold than a general FAQ bot.

Threshold	Effect
Too high (e.g. 0.99)	Almost never matches. Cache fills with near-duplicate entries. You're essentially paying full LLM cost every time.
Too low (e.g. 0.70)	Matches too aggressively. Returns stale or wrong responses for questions that are superficially similar but semantically different.
Sweet spot: 0.92–0.97	Good for general use. Fine-tune toward 0.97+ for precise domains (legal, medical). Toward 0.92 for casual FAQ / support bots.

LiteLLM — What It Is and Why It Matters Here

What is LiteLLM?

LiteLLM is an open-source proxy / gateway that sits in front of your LLM API calls. It gives you a single OpenAI-compatible interface that works with Claude, Gemini, OpenAI, Mistral, and 100+ other providers — so you can switch providers without changing your application code.

For caching specifically: LiteLLM acts as a caching middleware. You point your app at LiteLLM instead of directly at the LLM API. LiteLLM checks its cache first; if there's a hit it returns instantly, otherwise it forwards to the LLM and caches the response for next time.

It supports: In-Memory Redis (exact match) Qdrant Semantic Redis Semantic S3 GCS — all configured from a single config.yaml.

YAML: LiteLLM config.yaml — Exact Match (Redis) + Semantic (Qdrant)

# --- OPTION A: Exact-match caching via Redis ---
# Every byte must match. Fast and cheap. Use for deterministic queries.
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: localhost
    port: 6379
    ttl: 600           # cache responses for 10 minutes
    namespace: "prod.cache"

---
# --- OPTION B: Semantic caching via Qdrant ---
# Meaning-level match. Use for natural language Q&A where wording varies.
# Qdrant is a vector DB — it stores and searches the cached query embeddings.
litellm_settings:
  cache: true
  cache_params:
    type: qdrant-semantic-cache
    qdrant_url: "http://localhost:6333"
    similarity_threshold: 0.95  # tune this: 0.92-0.97 for most use cases
    ttl: 600

Python: Per-request cache controls (works with any LiteLLM cache type)

# These go in extra_body on any request. They let you override cache
# behaviour per-request without changing global config.

# Force a fresh LLM response — bypass cache entirely (e.g. for real-time data)
extra_body={"cache": {"no-cache": True}}

# Set a custom TTL for just this response (in seconds)
extra_body={"cache": {"ttl": 300}}

# Only accept a cached response if it was stored less than 10 minutes ago
extra_body={"cache": {"s-maxage": 600}}

# Don't store this response at all (e.g. PII-sensitive queries)
extra_body={"cache": {"no-store": True}}

# Namespace: isolate cache for this user/tenant (prevents cross-user cache leaks)
extra_body={"cache": {"namespace": "user-1234"}}

⚠ Security note: Always use namespace to isolate cache per user or tenant in multi-tenant apps. Without it, User A's cached response could be returned to User B if their queries are semantically similar — a data leakage risk.

The Full 4-Layer Production Caching Stack

In a well-architected agent system, these layers sit in sequence. Each one catches requests that the layer above missed.

1

Semantic Cache (Vector DB)

Meaning-level match. Catches semantically equivalent questions even when wording differs. Must include context window in cache key. Threshold tuning required.

LiteLLM + Qdrant / Redis Semantic

↓ miss (no semantically similar cached response)

2

Exact Prompt Cache (Redis / API-level)

Byte-exact prefix match. Essentially free via Claude/Gemini APIs. Best for repeated system prompts and stable prefixes in agentic sessions.

Redis · Claude Prompt Caching · Gemini Implicit/Explicit

↓ miss (prompt prefix changed or new content)

3

LLM Inference

Actual model call. Most expensive. Unavoidable on a true cache miss. Result stored back into Layers 1 and 2 for future hits.

Pay full input + output token price

↓ (inside inference server — transparent to you)

4

KV Cache (GPU HBM → CPU RAM → S3)

Prefix reuse at GPU compute level. Handled by the inference runtime — you cannot control it directly but you benefit from it via prompt structure (static prefix first). vLLM Paged (single node), LMCache (multi-node).

Transparent — but shaped by your prompt layout

💜 Recall Hook

Semantic cache = meaning match, needs context window in key, threshold 0.92–0.97.
LiteLLM = one config, all providers, supports Redis (exact) + Qdrant (semantic).
Always namespace per tenant. Four layers: Semantic → Exact → LLM → KV (transparent).

PART 9

Security & Interoperability

Agent identity, least privilege, A2A protocol, MCP

The Three Security Concerns

💥

Rogue Actions

Agent does something harmful — deletes data, spends money, calls destructive APIs.

🔓

Data Disclosure

Agent leaks sensitive info to wrong party — user A sees user B's data.

🎭

Prompt Injection

Malicious content in tool results hijacks agent behavior — attacker controls the agent via a crafted document.

Defense-In-Depth — Two Layers

LAYER 1: Deterministic Guardrails (hardcoded rules, outside LLM) --> "No purchase over $100 without human approval" --> "Never call DELETE endpoints" --> Implemented as: before_tool_callback, policy engines LAYER 2: Reasoning-Based Defenses (AI reviewing AI) --> Specialized "guard model" reviews agent's plan before execution --> Flags risky steps: "This action will delete all user data -- block?" --> Slower but catches complex, context-dependent threats

Agent Identity — The Third Security Principal

The Intuition

Before agents: two things have permissions — humans (via login) and services (via service accounts).

After agents: a third category — autonomous actors that need their own identity, separate from the user who started them AND the developer who built them.

OLD WORLD: NEW WORLD: Human User --> OAuth/SSO Human User --> OAuth/SSO Service --> IAM Service Account Service --> IAM Service Account Agent --> Agent Identity (SPIFFE) --> own least-privilege permissions SalesAgent --> CRM read/write only HRAgent --> HR system read only FinanceAgent --> Reporting read, no transaction write If SalesAgent is hijacked --> can only touch CRM, nothing else (blast radius limited)

A2A Protocol — Agent-to-Agent Communication

How It Works

1. Discovery: Each agent publishes an Agent Card (JSON) — capabilities, URL, auth requirements

2. Communication: Task-based (async). Client sends a task, server streams updates back.

3. Why not REST: REST is request-response. Agents need async, streaming, long-running tasks.

ADK: A2A Setup

# Turn any local agent into an A2A microservice
a2a_server = to_a2a(my_local_agent)  # Generates Agent Card + API endpoint

# Connect to a remote agent
remote = RemoteA2aAgent(
    agent_card_url="https://shipping-agent.example.com/.well-known/agent.json"
)

MCP — Model Context Protocol

💡 Think USB-C for AI tools. Before MCP: every AI system needed custom integration for every tool. After MCP: one standard. If a tool has an MCP server, any MCP-compatible agent can use it automatically via list_tools() + call_tool().

💜 Recall Hook

Agents are a 3rd security principal — not user, not service. Least privilege. Deterministic guardrails first. Guard models second. A2A = async agent-to-agent. MCP = USB-C for tools.

PART 10

Agent Ops & Production Architecture

Agents in production are living systems that need continuous management

Why Traditional DevOps Fails

Traditional test: assert output == expected_output ✅ / ❌

Agent test: "Is this response good?" 🤔

Agents are stochastic — same input, different output on different runs. You cannot unit-test them like a pure function.

The Agent Ops Stack

1. DEFINE METRICS (business KPIs, not just technical) - Goal completion rate, user satisfaction, task latency, cost per interaction - Revenue impact, conversion, retention 2. BUILD EVALUATION DATASETS ("golden sets") - Sample from real production interactions - Cover the full range of use cases + edge cases - Domain expert review before using as ground truth 3. USE AN LLM-AS-JUDGE - Cannot assert exact match --> use a model to score quality - e.g., "Does this answer correctly resolve the user's intent? Score 1-5" 4. METRICS-DRIVEN DEPLOYMENT - Deploy new model/prompt version --> run against full eval set - Compare scores to production version --> Go/No-Go decision 5. TRACE WITH OPENTELEMETRY - For debugging: exact prompt sent, model reasoning, tool chosen, params, raw results - Platform: Google Cloud Trace, LangFuse, etc. 6. CLOSE THE FEEDBACK LOOP - User reports bad answer --> replicate --> add to eval dataset - Every bug becomes a test case

CI/CD for Agents

Trigger

Code change / New model / Prompt update

↓

Evaluate

Run against eval dataset

↓

Compare

LLM-as-Judge scores
New version vs. Production version

↓

✦ Quality Gate

Latency + Cost + Quality — all pass?

↓

🚀 Deploy

Deploy to production

↓

♻ Feedback Loop

Monitor → Collect feedback → Update eval dataset

Full Production Architecture

👤 User / Other Systems

Web UI Mobile A2A Client API

↓

⚙️ Agent Runtime — ADK / Vertex AI Agent Engine

Runner

Event Loop
Context mgmt

Services

SessionService → PostgreSQL
ArtifactService → GCS / S3
MemoryService → Mem0 / VectorDB

Agent Pipeline

OrchestratorAgent (Coordinator)

ResearchAgent · before_tool_callback (auth check) · Tool: RAG / NL2SQL

DraftAgent · before_model_callback (PII scrub)

ReviewAgent · after_model_callback (content filter)

↓

🧠 Inference Layer

Caching Stack

Semantic Cache
Prompt Cache
KV Cache (GPU)

Model Routing

Complex → Gemini 2.5 Pro
Simple → Gemini 2.5 Flash
Images → Specialized APIs

↓

📊 Agent Ops

Eval datasets LLM judges OpenTelemetry traces CI/CD pipeline Model upgrade automation Human feedback → new test cases

The Developer Mindset Shift

🧱 Bricklayer — Old Paradigm

            "Step 1: do X"

            "Step 2: if Y, do Z"

            "Step 3: else do W"

            "Step 4: format as..."

🎬 Director — New Paradigm

            "You are a helpful agent."

            "Your goal is X."

            "You have these tools."

            "Make good decisions."

Comprehensive evaluations outweigh the prompt. You cannot just write a good prompt and ship. You must measure, evaluate, and iterate systematically.

💜 Recall Hook

For agents, "testing" means evaluation datasets + LLM judges, not assert statements.
Every bug → new test case. Models rotate every 6 months. Build the CI/CD pipeline from day one.

REF

Sources & References

All primary sources used across this learning guide, verified May 2026

Memory Architecture (Part 4)

State of AI Agent Memory 2026

Covers: memory taxonomy (episodic/semantic/procedural), memory scopes, vector vs graph memory, actor-aware memory in multi-agent systems, OpenMemory MCP, production lessons from 18 months of releases.

mem0.ai/blog/state-of-ai-agent-memory-2026 · April 2026

Introducing the Token-Efficient Memory Algorithm

Covers: single-pass ADD-only extraction, agent-generated facts as first-class, entity linking, multi-signal retrieval (semantic + keyword + entity), benchmark results on LoCoMo / LongMemEval / BEAM.

mem0.ai/blog/mem0-the-token-efficient-memory-algorithm · April 2026 · Author: Deshraj Yadav

AI Memory Benchmarks in 2026

Covers: memory benchmark vs long-context benchmark distinction, LoCoMo, LongMemEval, BEAM benchmarks explained, what a complete memory benchmark must test, gaps in current benchmarks.

mem0.ai/blog/ai-memory-benchmarks-in-2026 · May 2026 · Author: Himanshu Sangshetti

ProMem: Proactive Memory for Long-Horizon Agent Tasks

Covers: retrospective vs prospective memory, session-start scan pattern, scheduled reflection pattern.

arXiv:2601.04463

PASK: Proactive Agent with Selective Knowledge

Covers: context-trigger scan pattern, intent classifier gate (DEMAND / NO_DEMAND), knowing when NOT to fire memory retrieval.

arXiv:2604.08000

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Covers: head-to-head comparison of 10 memory approaches on LoCoMo. Published at ECAI 2025.

arXiv:2504.19413 · ECAI 2025 · Chhikara, Khant, Aryan, Singh, Yadav

KV Cache & Distributed Memory (Part 6)

Sebastian Raschka — Understanding and Coding the Self-Attention Mechanism from Scratch

Covers: Q/K/V weight matrices, unnormalized attention scores (ω), scaled dot-product formula, why divide by √d_k, softmax normalization, context vector output, multi-head attention, cross-attention. Code walkthrough in PyTorch.

sebastianraschka.com/blog/2023/self-attention-from-scratch.html · Feb 2023 · Sebastian Raschka

LMCache: KV Cache as Shared Infrastructure

Covers: tiered KV storage (GPU HBM → CPU RAM → S3), cross-node cache sharing, prefill/decode separation.

NetApp (Mar 2026)

NVIDIA Unified Memory for LLMs

Covers: RMM unified memory pool, CUDA managed memory for large model inference.

NVIDIA Developer Blog (Sep 2025)

HPCWire: Distributed Memory Architecture for LLM Inference

Covers: disaggregated prefill/decode, CXL memory expansion, production cluster architecture.

HPCWire (May 2026)

Prompt Caching — Claude & Gemini (Part 7)

Claude Code Blog — Lessons from Building Claude Code: Prompt Caching is Everything

Covers: the 5 production lessons (prompt layout, system-reminder pattern, never change tools mid-session, fork prefix sharing, monitor hit rate like uptime), compaction pattern, cache write vs read pricing, break-even analysis.

⚠ Note: this is the Anthropic engineering blog post about how their team built Claude Code — not the Claude Code user documentation at code.claude.com.

claude.com/blog/lessons-from-building-claude-code-prompt-caching-is-everything · Apr 30, 2026 · Anthropic Claude Code team

Google Gemini API — Context Caching & Implicit Caching Docs

Covers: explicit context caching (1-hour TTL), implicit/automatic caching (Gemini 2.5+), cached_token_count, cache naming and lifecycle.

Google Gemini API Docs

Semantic Caching (Part 8)

Microsoft Azure Cosmos DB — Semantic Cache for LLMs

Covers: vector similarity caching, threshold tuning, multi-tenant isolation, hybrid semantic + exact-match cache strategies.

Microsoft Azure Cosmos DB Docs

LiteLLM Proxy — Semantic Caching Documentation

Covers: semantic_cache_type configuration, similarity_threshold, caching across providers, Redis + vector store backends.

LiteLLM Proxy Docs

ARTICLES

Thought Leadership

In-depth analysis on AI topics — for technical, business, and leadership audiences. Updated as new research and tools emerge.

Security

🔒 Security & Risk

Prompt Injection is Now a Measurable Metric — And Your Agents Are at Risk

Anthropic's Opus 4.6 System Card puts hard numbers on prompt injection. 57.1% attack success rate on GUI agents. How to build quantifiable security posture.

~8 min read · Source: Anthropic Opus 4.6 System Card

🔒 Security & Risk

Why Your Agentic AI System Is Only as Secure as Its Weakest Token

One token passed through your entire agent chain isn't security — it's a single point of catastrophic failure. Five real threats, five architectural mitigations.

~10 min read · Enterprise AI Security

Leadership & Strategy

🎯 Leadership & Strategy

AI is an Execution Risk: Why Nandan Nilekani's Warning is the Wake-Up Call Leaders Need

The Aadhaar architect and Infosys co-founder: "It's not an opportunity risk, it's an execution risk." Three systemic risks most leaders are ignoring.

~7 min read · Source: Infosys AI Investor Day, MIT, Google DORA 2025

🏗️ Engineering Leadership

The 5 Real Capabilities Your Engineering Team Needs Before AI Actually Works

70–85% of AI initiatives fail. 42% abandoned most AI projects in 2025. The gap isn't the technology — it's five organizational foundations you need first.

~9 min read · Source: Deloitte, Google DORA 2025

🏗️ Engineering Leadership

How to Deploy Generative AI Safely: The System Thinking Framework Every Enterprise Needs

Building the feature is the easy part. Building the system around it that doesn't fail, hallucinate, discriminate, or get exploited — that's the real challenge. Five error classes, a practical hardening example, and what safe deployment actually looks like.

~10 min read · Enterprise AI Deployment

Tools & Comparisons

🛠️ Tools & Comparison

Claude Code vs OpenCode: Two Philosophies, One Goal

Open-source & model-agnostic vs Anthropic-native & optimized for reasoning. A detailed comparison across architecture, security, pricing, and enterprise use.

~10 min read · Source: OpenCode & Claude Code docs, personal experience

🔒 SECURITY & RISK

Prompt Injection is Now a Measurable Metric

And Your Agents Are at Risk. Anthropic's Opus 4.6 System Card proves it can be tracked as a rigorous engineering metric — not just a theoretical concern for whitepapers.

~8 min read Source: Anthropic Opus 4.6 System Card May 2026

If your engineering team is building autonomous agents without quantifying prompt injection risk, you are deploying a liability.

For the last two years, prompt injection was mostly treated as a theoretical problem for security whitepapers and conferences. That's over. We aren't just talking about a chatbot generating inappropriate text. We are talking about agents processing emails, browsing the web, and executing code — where a single malicious payload hidden in a shared document can compromise any agent that reads it.

⚠️ "Prevention of prompt injection remains one of the highest priorities for the secure deployment of models in agentic systems." — Anthropic Opus 4.6 System Card

The Numbers — Opus 4.6 Evaluation Data

14.8%

Attack success rate without extended thinking (100 adaptive attempts)

21.7%

Attack success rate WITH extended thinking — counterintuitively higher

0.0%

Attack success rate for pure coding tasks (bounded environment)

57.1%

Attack success rate for GUI / Computer Use agents with safeguards enabled

1 Static Benchmarks Are Not Useful

Testing agents against a fixed dataset of known attacks doesn't work. Models easily deflect known patterns while failing entirely against novel approaches.

The only reliable test is adaptive red-teaming — an advanced security assessment process that uses AI-driven, automated systems that iteratively simulate attacks, dynamically adjusting tactics based on the target system's defenses. Static benchmarks become stale within weeks.

💡 Adaptive red-teaming is the difference between "we tested it once" and "we continuously verify against evolving attacks."

2 The "Extended Thinking" Paradox

You might assume giving a model more time to "think" makes it more secure. The data says the opposite.

Without extended thinking: 14.8% attack success rate. With extended thinking enabled: 21.7% — more compute gave the model more room to talk itself into executing the malicious payload.

— Anthropic ART (Agent Red Teaming) Benchmark, Opus 4.6 System Card

This is a critical architectural lesson: don't assume that more capable or more thorough reasoning translates to better security. A chain-of-thought that spends 10 steps analyzing a malicious instruction may ultimately comply with it.

3 GUIs Are a Security Nightmare

An agent's vulnerability depends heavily on its environment. The same model, the same safeguards, wildly different outcomes:

Environment	Attack Success Rate	Attempts	What This Means
Pure coding tasks	0.0%	100 adaptive	Well-bounded environment — attacker can't inject through external content
No extended thinking	14.8%	100 adaptive	Baseline risk for any text-processing agent
With extended thinking	21.7%	100 adaptive	More reasoning = more surface area for manipulation
GUI / Computer Use (with safeguards)	57.1%	200 adaptive	Attacker controls the screen — visual content is attack surface

4 You Can't Trust the Model to Police Itself

Model-level robustness is necessary but not sufficient. You need external classifiers designed to detect prompt injection attempts before the LLM even processes the data.

Using this layered approach, Anthropic dropped their false-positive rate for browser-use tools by 15×. The lesson: the model reading the malicious payload should not be the same model tasked with ignoring it — that's a single point of failure.

How to Actually Secure Production Agents

1

Demand Quantitative Security KPIs

Integrate an adaptive attacker framework (Microsoft PyRIT, Giskard, or Gray Swan) directly into CI/CD. Throw 50–100 dynamic injection variations at agent endpoints every time a system prompt updates. Set a hard Attack Success Rate (ASR) threshold — breach it and the deployment fails, exactly like a failing unit test.

↓

2

Deploy Multi-Agent System Observability

Standard APM tools are blind to agentic workflows. Deploy observability that captures the complete thought → action → observation loop (LangSmith, Arize Phoenix, Datadog). If an agent ingests a compromised page and tries to access an unauthorized API, your stack must see that trajectory shift and kill execution instantly.

↓

3

Decouple Your Guardrails

Use independent sanitization layers (NVIDIA NeMo Guardrails, Meta Llama Guard, Lakera Guard) as a firewall. Deploy a smaller, high-speed classifier that scans incoming data before the primary agent ever touches it. Put a secondary validator on the output side to check intended actions before tools fire.

✅ The shift that matters: Prompt injection is now a measurable engineering metric. Track it like uptime. Gate deployments on it like test coverage.

Sources

Anthropic — Claude Opus 4.6 System Card (PDF) Microsoft PyRIT — Python Risk Identification Toolkit Giskard — AI Quality & Security Testing Gray Swan — AI Red Teaming LangSmith — LLM Observability Arize Phoenix — ML Observability NVIDIA NeMo Guardrails Meta Llama Guard Lakera Guard — AI Agent Security

🎯 LEADERSHIP & STRATEGY

AI is an Execution Risk

Why Nandan Nilekani's warning — from the architect of Aadhaar and co-founder of Infosys — is the wake-up call leaders need. Three systemic risks most organizations are ignoring right now.

~7 min read Source: Infosys AI Investor Day, MIT Research, Google DORA 2025

Nandan Nilekani — architect of Aadhaar (India's digital identity system powering 1.3 billion people) and co-founder of Infosys — spoke at Infosys' recent AI Investor Day. He made a statement that perfectly summarizes the current state of the industry:

"It is not an opportunity risk, it's an execution risk."

— Nandan Nilekani, Co-Founder Infosys & Architect of Aadhaar

He argued that while the opportunity AI presents is massive, the real challenge lies in the messy reality of execution — modernizing legacy systems, breaking down silos, and retraining talent. The software industry still works on deterministic workflows. The AI world is inherently non-deterministic.

💡 Core Insight

AI Is a Mirror

AI reflects your organization exactly as it is. If your engineering culture is high-performing and well-aligned, AI will amplify that velocity. But if your deployment pipelines are broken, your data is messy, and your teams are siloed — AI will simply help you ship the wrong things faster, and break production more often.

Here are three systemic risks that most leaders are ignoring.

1 The Throughput Trap

⚠️ The Reality: AI adoption is driving higher software delivery throughput — but statistically linked to lower stability (higher change failure rates). Teams develop faster, but testing, QA, and security haven't evolved to safely manage this speed.

The complexity of reviewing AI-generated code is becoming unmanageable. Researchers describe it as a "dangerous and unsustainable proposition." AI writes code that passes initial review but contains subtle logic errors that only surface in production weeks later.

The Fix: Brace for Rapid Recoveries

Enforce CI behaviors where AI-generated code is committed in extremely small, frequent increments
Destigmatize rolling back — in AI-native teams, a rollback is standard operating procedure, not a failure
Measure "Time to Restore Service" specifically for AI-generated changes. If an AI agent hallucinates a configuration error, your pipeline should revert it automatically or with a single click

2 Ownership & Burnout

⚠️ The Reality: Despite massive productivity gains, AI adoption has zero measurable impact on reducing burnout or friction. MIT research (late 2025 / early 2026): without proper management, AI adoption leads to work intensification, cognitive fatigue, and increased burnout.

When AI saves 20% of a team's time, most organizations immediately fill that void with 20% more tickets. Even worse, AI erodes Psychological Ownership — people stop feeling like "owners" and start acting like "operators." This creates a fragile culture that looks fast but lacks deep expertise.

0×

Measured reduction in burnout or friction from AI adoption alone

Multi×

Performance gain for teams with strong user-centric focus who adopt AI

↓ Drop

Performance for teams without user-centric focus who adopt AI

The Strategic Fix

Stop asking "How much faster can we ship?" Start asking "Where are we reinvesting the saved time?"

The Reinvestment: Explicitly align teams to reinvest AI-saved time into Valuable Work (innovation, design, user research) — not just more tickets
The User-Centric Approach: Connect teams to customer outcomes, not output metrics. This restores the ownership AI erodes and is the single biggest predictor of AI adoption success

3 A Hidden Talent Crisis

⚠️ The Reality: Junior developers traditionally learned by doing "boring" work — fixing minor bugs, writing boilerplate, reviewing simple PRs. AI now does that work. Matt Beane's research in Google's 2025 DORA report: default AI usage can block skill development for novices, breaking the "three-generation model" of knowledge transfer.

The Fix: "Joint Optimization" Governance

Sustainable performance requires simultaneously managing for productivity and skill development.

Skill-First KPI: If velocity increases 40% but "Stylistic Diversity" (a proxy for original thought) drops, flag the AI rollout as a risk — not a success
The Learning Loop Mandate: For critical architectural components, mandate AI for drafting but prohibit it for final reasoning. Human-in-the-loop verification isn't just quality control — it's a mechanism to force knowledge retention

💜 The Bottom Line

The technology is ready, but your organization likely isn't. AI doesn't bring revenue or productivity gains on its own — it reflects them. It will not fix your broken culture or messy data; it will magnify them. Stop treating AI as a plug-and-play solution. Treat it as a forcing function to fix your organizational foundations first.

Sources

MoneyControl — Nandan Nilekani flags implementation gap Harvard Business Review — AI doesn't reduce work, it intensifies it (MIT Research, 2026) Google Cloud — 2025 DORA Report

🏗️ ENGINEERING LEADERSHIP

The 5 Real Capabilities Your Team Needs Before AI Actually Works

70–85% of AI initiatives fail to meet expectations. The gap between pilot and production has become the primary bottleneck. Here's what actually needs to be in place first — based on hands-on deployment experience and Google's research.

~9 min read Source: Deloitte, Google DORA 2025, Google Platform Engineering

70–85%

of AI initiatives fail to meet expectations (Deloitte)

42%

of companies abandoned most AI initiatives in 2025 — a 17% YoY increase

AI is not a silver bullet. It amplifies your existing engineering problems. If your deployment pipelines are broken and your data is a mess, AI will just help you ship the wrong things faster. Leaders need to treat AI adoption as an opportunity to clean up their mess first — then begin AI implementations.

1 Quality Internal Platforms — The "Digital Factory"

An internal developer platform removes underlying complexity, allowing teams to focus on delivering user value rather than navigating infrastructure, security, and operational hurdles.

✅ Google's research finding: The positive effect of AI adoption on organizational performance depends directly on the quality of the internal platform.

Real-World Example — A Major Telco

A major telco company's Internal Developer Platform (IDP) eliminates complexity so developers focus on building software, not managing infrastructure. It is vendor-agnostic, follows agile methodology, and automatically detects potential issues and bottlenecks.

Start with a "minimum viable platform." Identify the 'golden path' for the most common workflow and build just enough to make that journey better. Apply Google's "shift down" strategy — move security, compliance, and infra responsibilities down into the platform itself instead of pushing them onto individual developers.

2 Healthy Data Ecosystems

Building a high-quality, unified data ecosystem drives better organizational performance than simply adopting AI on its own. No LLM can magically fix problems caused by wrong or incomplete data.

⚠️ Even state-of-the-art models like Opus 4.6 — which leads benchmarks for real-world professional tasks — will hallucinate or underperform with siloed or outdated data. Downstream AI agents will simply hallucinate at scale.

Platform	Best For
Microsoft Purview + SharePoint + Fabric	Organizations already in the Office 365 ecosystem
Google Dataplex + BigQuery + Vertex AI	Custom customer-facing apps or massive petabyte-scale product data
Atlan	Modern data stacks (Snowflake, Slack) where non-technical teams act as data stewards

3 AI-Accessible Internal Data — Context Engineering & RAG

To unlock real value in multi-agent architectures, you must move to context engineering. Generic LLMs don't know your business. RAG turns them into specialized internal experts.

Real-World Implementation — Telco/BSS Domain

A RAG system implemented where context was stored in siloed DBs, Confluence pages, and docs. When a marketing professional asks "What are the current product offers?", the system retrieves exact, up-to-date documentation to generate a highly accurate, cited answer. A generic LLM becomes a specialized internal expert.

Critical Requirement: RAG implementation is a process of trial and error. A successful implementation must involve both technical and functional teams. Functional teams provide your "ground truth" and help ground responses — they are not optional.

4 Clear and Communicated AI Stance

A clear policy provides the psychological safety needed for effective experimentation. Without it, teams either avoid AI (fear of policy violations) or misuse it (unclear boundaries).

Real-World Example — A Major Telco

A major telco company launched a GenAI platform guided by four Responsible AI principles: secure, accessible, frugal, and ethical. A named platform with a clear charter signals organizational commitment and makes the policy tangible.

Establishing an AI stance requires executive sponsorship and must be crafted by a cross-functional group of engineering, legal, security, IT, and product leaders. Assign long-term owners to verify updates and handle feedback loops.

5 User-Centric Focus

Teams must continuously align their priorities in service of the end user. A user-centric focus also improves developer quality of life — lifting job satisfaction and productivity while reducing burnout.

→

Make User Metrics Visible

If dashboards only show velocity and deployment frequency, the user is forgotten. Display user experience metrics alongside engineering metrics in planning meetings. Consider Google's H.E.A.R.T. framework (Happiness, Engagement, Adoption, Retention, Task Success).

→

Consider Spec-Driven Development (SDD)

An emerging paradigm that orients LLMs toward user needs via structured specifications. GitHub's Spec-kit creates workspace setups for common coding assistants, oriented around the spec (user requirements) rather than the implementation.

💜 The Bottom Line

The current AI hype cycle will eventually stabilize, but the disruption to software engineering is here to stay. If you want to be among the small fraction of organizations that successfully scale AI into production, stop treating it as a plug-and-play solution. Treat it as a forcing function to fix your engineering foundations.

Sources

Deloitte — AI Initiatives Research Google Cloud — Guide to Platform Engineering Google Cloud — Crafting an Acceptable Use Policy for Gen AI Google Cloud — DORA + H.E.A.R.T. Framework Martin Fowler — Spec-Driven Development GitHub Spec-Kit

🛠️ TOOLS & COMPARISON

Claude Code vs OpenCode

One open-source and model-agnostic, the other Anthropic-native and optimized for reasoning precision. A detailed comparison for teams — and enterprises — choosing their AI coding workflow.

~10 min read Source: OpenCode & Claude Code docs, personal experience

OPENCODE — Anomaly Innovations

Open-source & Model-Agnostic

Written in Go. Treat the AI model as a swappable component. Claude, GPT-5, Gemini, or local Ollama — bring your own API keys. Created by the team behind SST and OpenTUI.

CLAUDE CODE — Anthropic

Reasoning-First, Anthropic-Native

Node.js-based CLI. Gold standard for reasoning and speed within the terminal. Deep MCP integration, Sub-agent hierarchies. Strictly tied to Claude models.

1 Core Philosophy & Vendor Lock-In

Dimension	OpenCode	Claude Code
Model Flexibility	Any model via BYOK — Claude, GPT-5, Gemini, Ollama (local)	Strictly Claude models only
Source	Open-source, fully auditable execution loop	Closed-source, Anthropic-managed security
Vendor Risk	Workflow survives regardless of which provider is "winning"	Tied to Anthropic's pricing, terms, and model availability
Cost Strategy	Route easy tasks to cheap models, hard tasks to strong models	Single-provider pricing — can be token-hungry

2 Architecture & Capabilities

OpenCode Architecture

Claude Code Architecture

→

Client/Server Architecture (Go)

Decouples the UI from execution. Run the OpenCode Server on a powerful AWS instance (e.g., 100GB RAM) and connect from a lightweight laptop. Sessions persist even if you close the terminal.

→

Hybrid Interface

Primarily a terminal tool, but opencode web spins up a localhost dashboard to visualize complex diffs and manage active background agents.

→

Multi-Provider Cost Strategy

Route "easy" logic tasks to cheap models (GPT-5 mini, Gemini 2.5) and "heavy" architecture tasks to Claude Opus — optimize cost per task automatically.

→

Agentic Search ("Terminal Velocity")

Node.js CLI optimized for codebase exploration using bash tools (grep, find). Explores codebase topology on the fly — no upfront indexing required.

→

Sub-Agent Hierarchies

A main agent spawns Sub-agents for parallel tasks — one maps dependencies while another writes tests. Drastically reduces time for complex refactors.

→

Native MCP Support

Plug directly into PostgreSQL, Sentry, or GitHub. Example: "Claude, check Sentry for the latest auth error and write a fix." Zero configuration once connected.

→

Agent Skills (Lazy-Loaded)

Unlike context files (always loaded), Skills in SKILL.md only activate when needed. Keeps context lean, enables powerful domain-specific automation on demand.

3 Configuration & Extension

Feature	OpenCode	Claude Code
Project Context	Uses `AGENTS.md` — an emerging open standard for AI context across tools	Uses `CLAUDE.md` for project-specific rules and guidelines
Customization	Custom commands, SDK integration, and TUI theming	Skills (lazy-loaded workflows in `SKILL.md`) and Hooks (automated triggers like pre-commit)
Tooling	MGrep (4× faster grep) and direct LSP (Language Server Protocol) support for semantic code understanding	MCP (Model Context Protocol) for native connections to Postgres, Sentry, GitHub, and more
UI	Rich Terminal UI (built with BubbleTea) + a local Web Dashboard	Minimalist CLI for high speed + a Cloud-based Web Environment

4 Enterprise & Security

OpenCode Enterprise

Zero-Trust Design

SSO integration and an internal AI gateway. Being open-source means you can audit the entire execution loop — critical for high-security environments that require on-premise models (Llama, Mistral).

Claude Code

Managed Security

Relies on Anthropic's managed security. Provides interactive permission prompts for every file write or command — some find this "chatty," others value it for safety and audit trails.

5 Pricing & Realistic Utility

Dimension	OpenCode	Claude Code
Cost Model	"Economical" — pay for raw API tokens + infrastructure if self-hosting	"Premium" — requires Claude Pro or Max subscription
Cost Control	Mix-and-match models to optimize cost per task type	Sub-agents and deep reasoning burn tokens rapidly; complex refactors can cost $5–$10 in an afternoon
Free Tier	Truly free — bring your own API keys	Requires Anthropic Console account
Setup Time	Requires configuring providers, models, and optionally a server	`brew install --cask claude-code` → coding in 30 seconds

Verdict — Which One to Choose?

Use OpenCode if...

You fear lock-in — your workflow must survive regardless of which AI provider is winning
You need remote architecture — run AI on a powerful server, connect from a lightweight device
Security is paramount — you need to audit every packet or run open-weight models on-premise
Cost optimization matters — route tasks to the right model at the right price per task

Use Claude Code if...

You want "magic" — the smoothest, most intelligent experience acting like a senior engineer with full SaaS access
You already pay Claude Enterprise — seamless integration with existing billing and legal agreements
Speed of setup matters — start coding in 30 seconds, zero provider configuration
Deep MCP integration — native connections to your Postgres, Sentry, GitHub, Slack

Sources

OpenCode — Official Documentation Claude Code — Anthropic Documentation

🔒 SECURITY & RISK

Why Your Agentic AI System Is Only as Secure as Its Weakest Token

A token that traverses your entire agent chain without restriction isn't an access credential. It's an open invitation — one that grants any attacker who intercepts it the keys to your entire system.

~10 min read Enterprise AI Security May 2026

The rate at which organisations are wiring AI agents into their core systems — CRM platforms, internal APIs, MCP-connected tools — is outpacing their security thinking. The architecture has changed. The security model hasn't. And the gap between those two realities is exactly where attackers will find the door.

0 The Standard Agentic Architecture

Effective security begins with knowing what you're defending. Map the topology before you write a single policy.

👤 Human User · Chat Interface

↓

⚙️ Orchestrator · Breaks task down & delegates

↓

Agent A

Agent B

Agent C

↓

🔌 MCP Servers · Bridge to tools & company data

↓

CRM · Databases · APIs · External Systems

LLMs sit at every layer of this stack — embedded in the chat interface, inside the orchestrator, and inside each specialised agent. That is where the intelligence lives. It is also where the risk enters. An Identity Provider verifies who the human is and issues an access token that is supposed to govern what that user can reach downstream. Most security thinking stops here — right at the moment the token enters the chain. That is the mistake.

⚠️ A static token that passes unmodified through every node in your agent chain gives any single compromised hop the same access as the original user. Every junction becomes a liability of equal weight.

1 Credential Replay & LLM Extraction

The Risk

A valid token obtained by any means — MitM interception on an unencrypted transport layer, or a crafted prompt that coaxes an LLM into surfacing credentials from its context window — can be replayed indefinitely. The token does not expire at the point of theft.

✅ Mitigation: Enforce TLS/mTLS on all transport and encrypt credentials at rest. More critically — keep identity tokens entirely outside the LLM's context. The model needs task instructions, not authorisation tokens. These are separate concerns and must be kept architecturally separate. An LLM that receives a token can be prompted to reveal it.

2 Rogue Agents (Spoofing)

The Risk

Multi-agent systems assume each participant is who it claims to be. Without explicit verification, a malicious agent mimicking the signature of a trusted peer can slip into the chain undetected. MCP servers that accept requests from unverified callers are not acting as boundaries — they are acting as entry points.

✅ Mitigation: Apply the same identity requirements to agents that you apply to employees. Authentication against an Identity Provider is a prerequisite to action, not a convenience. Every link — orchestrator to sub-agent, sub-agent to MCP server — requires explicit verification. Entry-only checks are architecturally insufficient.

3 Impersonation (False Representation)

The Risk

Proving identity and proving delegation authority are distinct requirements. An agent that has passed authentication can still fabricate claims about whose instruction it is executing — unless the delegation itself is cryptographically anchored back to the Identity Provider at the time of issuance.

✅ Mitigation: Delegation must be enforced by the Identity Provider — not self-reported by the agent. The IdP issues a composite token that mathematically binds the subject (the originating human) and the actor (the executing agent) into a single verifiable claim. Agents that assert delegation independently, without IdP involvement, represent an unvalidated trust assumption.

4 Vulnerable Token Propagation

The Risk

A token that travels unchanged across the full length of a multi-hop agent chain carries the same permissions at every hop regardless of the task being performed. Whoever intercepts it at any point inherits the same authority as the original issuing event — with no restriction to the specific action that was authorised.

✅ Mitigation: Enforce Token Exchange at every hop. Each node surrenders the token it received and requests a freshly issued one — scoped to the specific action ahead. The IdP validates the request, issues a narrowed credential, and the original broad token is discarded. Nothing static propagates. Nothing accumulates.

5 Over-Permissioning

The Risk

The access rights of an enterprise user often span multiple systems — finance, HR, internal communications. A single inherited token exposes that entire surface area to every agent in the chain, regardless of which agent actually needs which resource. One rogue node sees everything the originating user can see.

✅ Mitigation: Embed precise Scopes in every Token Exchange. The target tool, the permitted action, and the authorised audience for that single task must be encoded into the newly issued token. A token for a marketing data query is valid for that query only. When the task completes, the token has no further utility — by design.

6 The Last-Mile Vulnerability

There is one more attack surface that rarely appears in architecture reviews: the connection between the MCP server and the business tool it proxies. MCP servers typically hold long-lived credentials for each tool they connect to — stored persistently, rarely rotated. A successful attack on the MCP server does not compromise only the MCP server. It compromises every tool whose credentials it holds.

✅ Mitigation: Remove persistent credentials from the MCP server entirely. Replace them with vault-based retrieval — when a tool connection is required, the MCP server requests a short-lived credential from a secure vault, uses it for the duration of that specific operation, and discards it. The vault is the only system that holds credentials. The MCP server is never a credential store.

7 How to Think About This End-to-End

1

Token never touches the LLM. The LLM gets instructions, not authorization. Transport is encrypted with mTLS.

2

Every agent authenticates. No unverified agent can join the chain.

3

Every hop uses Token Exchange. Broad initial tokens are swapped for narrow, task-specific ones at each step. Delegation is cryptographically proven.

4

Every token is scoped. Least privilege is enforced mathematically, not just by policy.

5

The MCP server holds no secrets. The vault issues temporary credentials just-in-time, and they expire automatically.

Agentic AI systems don't execute deterministic code paths — they make decisions. Securing a non-deterministic system means you cannot enumerate every threat vector in advance. What you can do is architect it so that no single point of failure, wherever it occurs, produces a catastrophic outcome.

8 What This Means for Enterprise AI Deployments

The organisations that deploy agentic AI at scale without significant incidents will be those that treated security as a design-time constraint — not a pre-launch audit. The attack patterns described above are not theoretical edge cases. They are the predictable outcome of applying a perimeter-security mindset to a distributed, non-deterministic, multi-hop architecture.

⚠️ The AI security incidents that will define the next two years will not be caused by model vulnerabilities. They will be caused by an access token that should have expired travelling through five agents unchanged — and a credential store in an MCP server that was never designed to be one.

🏗️ ENGINEERING LEADERSHIP

How to Deploy Generative AI Safely

The System Thinking Framework Every Enterprise Needs. Building a generative AI feature is the easy part. Building the system around it that doesn't fail, hallucinate, discriminate, or get exploited — that is the real engineering challenge.

~10 min read Enterprise AI Deployment May 2026

The instinct to ship generative AI features using the same process as traditional software is understandable. It is also wrong. The moment your system is non-deterministic, draws on external data at runtime, and influences decisions that affect real people, the conventional develop-test-ship cycle stops being a reliable safety mechanism.

1 The Foundational Shift: Design Systems, Not Just Software

The most common mistake in AI deployment is treating the model as the system. It isn't. The model is one component. The system is the model, the people who act on its outputs, the processes that route those outputs into decisions, and the data that flows through all of it. Failures almost never originate inside the model — they originate at the interfaces between these layers.

Seams

Where AI failures actually happen — between model, humans, data, and process — not inside the model itself

Living

A Safety Plan must be a living document — not a Confluence page written once and never revisited

This has three immediate implications for how you structure your AI programme:

Map every failure mode — component by component — before you build

This is not a task for a shared document that no one maintains. Every failure mode — software bugs, data quality gaps, adversarial misuse, operator bias, public exploitation — needs a named owner, a documented mitigation, and a review cycle. An undocumented failure mode is an unowned risk. An unowned risk will eventually materialise.

The threat surface doesn't close after launch

Each capability extension, each new data integration, each new user segment you enable creates a fresh attack surface. Failure analysis is ongoing work — not project initiation work. Build it into your sprint cadence, not your kickoff deck.

Constrain scope the way you would for a capable but unseasoned team member

Assign the model a specific, bounded function. Build escalation paths and review checkpoints around it the same way you would for a junior analyst with elevated data access. The broader and vaguer the mandate, the more unpredictable the output — this applies equally to humans and to language models.

2 The Five Classes of AI Errors — and Why All Five Matter

Effective defence starts with precise categorisation of the problem. Treating all AI failures as variants of “the model was wrong” prevents you from building targeted mitigations. There are five structurally different failure modes in generative AI systems — each demanding a distinct response.

#	Error Class	What It Looks Like	Severity
1	Garbage In, Garbage Out	Bad, corrupted, or stale input data produces faulty outputs regardless of model capability	High
2	Misinterpreted Data	A proxy metric looks correct but isn't — e.g. "accused of a crime" used instead of "convicted of a crime". Introduces systemic bias that's nearly invisible until it surfaces in outcomes	Critical
3	Hallucinations (False Positives)	The model confidently generates information never present in the source. In high-stakes settings — credit, medical, legal — this is a liability	High
4	Omissions (False Negatives)	Critical context is silently dropped. The human receiving the summary has no way to know something important is missing — which can reverse the safety of a decision	High
5	Unexpected Preferences	The model develops unintended patterns in how it weights or summarises data — reflecting demographic skews or embedded biases. Only visible statistically across a large sample. Can persist undetected for months	Insidious

💡 The dominant error type in your deployment context should be the primary input to your test case design, your interface decisions, and the signals you build into your monitoring stack.

3 Testing AI Systems is Fundamentally Different From Testing Software

Writing an AI-powered feature is frequently quicker than writing the equivalent deterministic logic. The difficulty is not in the build — it is in the verification.

❌ Traditional Software

Given a specific input, the output is always the same. Test suites map inputs to expected outputs. Pass or fail is binary and deterministic.

✓ AI Systems

The same input can produce different outputs across runs. Testing requires adversarial thinking and iterative red-teaming. Edge cases must be adjudicated by humans — not just asserted in a test file.

Critical Best Practice

Separate test case authorship from answer authorship. The team that built the system will unconsciously write tests the system can pass. An independent group evaluating the same scenarios — without knowledge of how the system works — produces a far more adversarial and useful test set. Where the two groups disagree on the correct answer for an edge case, the system cannot resolve that ambiguity. Resolve it in your evaluation criteria first.

4 A Practical Example: Hardening an AI Copilot for Bank Loan Officers

Take a concrete case: an AI tool that pulls applicant information, produces an analysis, and surfaces a structured summary for a lending decision made by a human. It is high-stakes — regulated, consequential, and dependent on external data sources outside your direct control.

Data Gathering

→

Revision Flow ⟳

→

AI Analysis

→

Presentation

→

Human Decision

The Revision Flow is the component teams most often skip in early design and most regret later. When upstream data is updated, the revision flow ensures the change is handled explicitly — not silently absorbed into stale outputs the decision-maker has no way to detect.

Mitigating data risks — ask four questions of every data source:

Is it complete?

Resolved at pipeline level — schema validation, completeness checks, mandatory field enforcement

Is it correct?

Resolved through verification against source-of-truth systems and structured revision handling

Is it current?

Resolved through timestamp validation and explicit re-fetch logic when the source data changes

Is it adversarial?

All external data is untrusted by default. Red-team every document ingestion path for XPIA — cross-prompt injection attacks where malicious instructions are embedded in content the model will read and act on

Mitigating hallucinations and omissions: Prompt the model to cite its sources and walk through its reasoning before returning results — this reduces, but does not eliminate, fabrication. The more robust control lives in the presentation layer: make source material visible alongside the AI's summary. When the model generates something that wasn't in the original document, a user who can see that document will catch the discrepancy. A user who cannot has no mechanism to detect it.

⚠️ Interface design is a mitigation layer. Treat it as one.

5 The Human-in-the-Loop Assumption

The assumption that a human reviewer makes the system safe is incorrect. Humans carry their own cognitive biases into the loop. Presented with an AI recommendation, they will tend to anchor on it rather than reason from first principles. And unless the system explicitly captures the human's rationale and final decision, the audit trail reflects only the model's output — not the judgment call that followed it.

→

Give reviewers calibrated reference cases before they see the AI's recommendation. A reviewer with a reference standard is less likely to defer uncritically to the model's output.

→

Run a sample of decisions through two independent reviewers simultaneously and escalate where they disagree — this surfaces both model errors and gaps in your evaluation criteria.

→

The audit record must capture the human's reasoning and final determination alongside the AI's recommendation — as a single linked entry. A log that captures only the model's output does not document the decision. It documents the input to one.

6 What This Means for Your AI Programme

The organisations that deploy AI without serious incident are not the ones with the most thorough pre-launch review process. They are the ones that made safety a design constraint at the architecture stage.

Red teaming, adversarial test construction, interface design for transparency, and observability instrumentation are engineering deliverables — not process milestones. Teams that categorise them as governance overhead discover, consistently, that governance catches failures after they have already happened.

⚠️ The AI failures that carry the largest cost are rarely dramatic. They are the ones that ran for six months before anyone noticed — a bias pattern embedded in outputs, a hallucination repeated at scale across decision records, a data quality gap that no pipeline alert was built to surface.

✉ CONTACT

Get in Touch

Questions, feedback, or just want to connect?

✉

Whether you spotted something worth fixing, have a topic you'd like covered, or simply found this useful — feel free to reach out.

sac.anand2@gmail.com

Click to open in your email client

AI Systems Master Learning Guide Agents · Memory · Caching · Production

AI Systems — From Theory to Production

What is an AI Agent?

Autocomplete vs. Employee

Shortest Definition

The Anatomy

The Think → Act → Observe Loop

ReAct Pattern (Most Common)

Agent Taxonomy: Levels 0–4

Level 2 Example — Context Engineering

Level 3 Example — Multi-Agent Coordinator

Multi-Agent Design Patterns

Decision Table — Which Pattern for What?

ADK Code

Memory Architecture Deep Dive

The Three Memory Types

ADK State Scopes

Mem0 — Proactive Memory (2026)

The Full Memory Taxonomy — A Holistic Map

How Mem0 Actually Works — The 2026 Algorithm

How We Measure Memory — The Benchmark Landscape

Google ADK Deep Dive

The Two ADK Principles

The Runtime Architecture

Runner Responsibilities

The Event Loop — How Crash Safety Works

Runtime Controls

Crash Recovery Config

Agent Types

Session, State & Storage

The Session Object — Key Properties

State Scope Prefixes — How Long Does Data Live?

SessionService — Where State is Stored

Session Lifecycle — One Conversation Turn

ADK Context Types

Tools & MCP

Scale & Deployment

The App Class

Cross-Network Agents

KV Cache — For Beginners, Read This First

KV Cache — GPU-Level Inference Caching

The Memory Budget Problem

The Multi-Node Problem

NVIDIA Unified Memory — The Hardware Solution

Prompt Caching — Claude, Gemini & Beyond

The System Prompt Layout

How Claude Code Does It — and What You Can Apply Everywhere

Pricing — Why This Changes Everything

Google Gemini Caching — Two Modes

Semantic Caching

How It Works — Step by Step

🚨 The Context Window Problem (Critical Production Gotcha)

Similarity Score Tuning

LiteLLM — What It Is and Why It Matters Here

The Full 4-Layer Production Caching Stack

Security & Interoperability

The Three Security Concerns

Defense-In-Depth — Two Layers

Agent Identity — The Third Security Principal

A2A Protocol — Agent-to-Agent Communication

MCP — Model Context Protocol

Agent Ops & Production Architecture

The Agent Ops Stack

CI/CD for Agents

Full Production Architecture

The Developer Mindset Shift

Sources & References

Thought Leadership

Security

Leadership & Strategy

Tools & Comparisons

Prompt Injection is Now a Measurable Metric

The Numbers — Opus 4.6 Evaluation Data

1 Static Benchmarks Are Not Useful

2 The "Extended Thinking" Paradox

3 GUIs Are a Security Nightmare

4 You Can't Trust the Model to Police Itself

How to Actually Secure Production Agents

Sources

AI is an Execution Risk