White Paper

Governed Memory: The Missing Infrastructure for Enterprise AI Agents

Schema-Enforced Memory, Semantic Routing, and Consistency for Enterprise Multi-Agent Systems

Hamed Taheri|March 2026
16 controlled experiments · production API · deployed across multiple organizations
Does it capture everything?
99.6%
Fact recall
Across 5 content types. One pass yields both free-form atomic facts and typed schema-enforced properties simultaneously.
Does it route correctly?
92%
Routing precision
65% of the guideline library discarded before the reasoning gate. Sub-second on the fast path.
Does it save cost?
50.3%
Token reduction
Governance context savings across a 5-step workflow. Steps 2 and 5 achieve 86–90% per-step savings via session-aware delivery.
Is it safe?
0%
True data leakage
Across 3,800 scoped results under adversarial conditions. 100% adversarial compliance across 50 deliberate breach attempts.
Is it production-grade?
74.8%
LoCoMo accuracy
Independent benchmark. Outperforms Mem0 (65%), Zep (54%), and OpenAI Memory (53%). Deployed across multiple organizations.
Abstract

Governed Memory: A Shared Layer for Accuracy and Compliance Across Agentic Workflows

Enterprise AI deploys dozens of autonomous agent nodes across workflows, each acting on the same entities with no shared memory and no common governance. We identify five structural challenges arising from this memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops.

We present Governed Memory, a shared memory and governance layer addressing this gap through four mechanisms: a dual memory model combining open-set atomic facts with schema-enforced typed properties; tiered governance routing with progressive context delivery; reflection-bounded retrieval with entity-scoped isolation; and a closed-loop schema lifecycle with AI-assisted authoring and automated per-property refinement.

We validate each mechanism through controlled experiments (N=250, five content types): 99.6% fact recall with complementary dual-modality coverage; 92% governance routing precision; 50% token reduction from progressive delivery; zero cross-entity leakage across 500 adversarial queries; 100% adversarial governance compliance; and output quality saturation at approximately seven governed memories per entity. On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy, confirming that governance and schema enforcement impose no retrieval quality penalty. The system is in production at Personize.ai.

1. Introduction

1.1 The Memory Governance Gap

Enterprise AI adoption does not produce a single agent. It produces dozens of autonomous agent nodes distributed across workflows, tools, and teams: enrichment pipelines, outbound sequences, support automation, scoring models, research agents, and operational automations. Each node reads or writes information about the same entities—the same customers, companies, and deals—yet these nodes share neither a common memory of the entities they act upon nor a common governance layer enforcing organizational policies, compliance rules, and quality standards.

In this setting, retrieval quality is necessary but insufficient. The organization faces five structural challenges that no single-agent memory system addresses:

1

Memory silos across agent workflows

The enrichment agent discovers a CTO is evaluating three vendors. The outbound sequence agent, executing hours later, sends a generic cold email. The support agent resolves a critical pain point. Months later, the renewal agent re-surfaces it as a selling feature. Each workflow node acts on the same entities but shares no context with the others. Organizational intelligence accumulates nowhere.

2

Governance fragmentation across teams and tools

Sales builds AI outreach with one system prompt embedding brand voice. Support runs a bot with compliance rules copied from a Notion doc last quarter. Marketing uses a separate workflow with its own tone guidelines. When legal updates the data handling policy, no mechanism propagates it to the 14 agent configurations across three teams. There is no versioning, no single source of truth, and no way to ensure all agents operate under the same organizational rules.

3

Unstructured memory as a downstream dead end

Free-text memories can be retrieved by similarity and pasted into a prompt. Beyond that, they are terminal. They cannot be filtered by buying stage, ranked by deal value, routed to conditional workflows, synchronized to a CRM, or aggregated across thousands of entities. Without schema-enforced typed properties, memory is useful for prompt augmentation but unusable by any downstream system that requires structured, queryable data.

4

Context redundancy in autonomous multi-step execution

Modern agents operate in autonomous loops—planning, acting, observing, re-planning—without human intervention between steps. Each step may invoke governance routing independently. Without session awareness, the same compliance policy is re-injected into every step, consuming context window capacity that should be reserved for task-specific reasoning and degrading model attention on fresh instructions.

5

Silent quality degradation without operational feedback

Schemas age. Models get updated. Content types shift. New agent workflows produce data the schema was not designed for. No per-property accuracy monitoring exists. No extraction confidence is tracked over time. No schema drift is detected. The organization discovers the problem when a CRM field has been wrong for three months or a downstream pipeline quietly stops producing useful output.

We term this the memory governance gap: the absence of an infrastructure layer governing what agents store, how stored information is typed and queried, which organizational policies reach which agent, how context is delivered across autonomous execution steps, and whether the system is performing reliably.

1.3 Contributions

1.3 Contributions

This paper makes four contributions, each addressing one or more of the five challenges above:

1

A dual memory taxonomy with formal quality gates (addresses memory silos and the downstream dead-end problem)

We distinguish open-set memory (coreference-resolved atomic facts stored as vector embeddings) from schema-enforced memory (typed property values governed by organizational schemas with confidence scores), processed in a single extraction pass with automated quality gates. The shared store enables any agent across the organization to read and write entity memory through a common interface.

2

Tiered governance routing with progressive context delivery (addresses governance fragmentation and context redundancy)

A mechanism for selecting which organizational context should be injected into an agent's context window, supporting a fast governance-aware hybrid path (~850ms average) and a full two-stage LLM selection path (~2–5s), with session-aware delta delivery that tracks previously injected context across autonomous multi-step executions.

3

Reflection-bounded retrieval with entity-scoped isolation (addresses memory silos)

An iterative protocol checking evidence completeness and generating targeted follow-up queries within bounded rounds, combined with CRM-key-based entity scoping that enforces hard isolation across tenants and entities.

4

Schema lifecycle management with closed-loop self-evaluation (addresses the downstream dead-end and silent quality degradation)

A lifecycle spanning AI-assisted schema authoring, interactive enhancement, criteria-based rubric scoring with execution logging, and automated per-property schema refinement.

The Problem

The Memory Governance Gap

When multiple agents share a memory substrate across thousands of entity records, retrieval quality is necessary but insufficient. The real challenges are structural.

Schema Compliance

One agent stores "deal value" as free text; another expects a typed number. Downstream systems break silently.

Context Appropriateness

Your support agent gets the full brand playbook when it only needs the escalation policy. Wasted tokens, conflicting instructions.

Delivery Redundancy

The same compliance policy injected into every step of a multi-turn execution, consuming context window budgets.

Quality Opacity

No mechanism to detect that extraction quality is degrading or that governance routing is missing critical policies. Failures are silent.

RAG addresses retrieval relevance. It leaves four gaps open: no governance over what is stored, no organizational context routing, no session-aware delivery, and no quality feedback loop.

Four Contributions

What the Paper Introduces

Four integrated mechanisms that close the memory governance gap — each validated through controlled experiments and production deployment.

01

Dual Memory Model

Capture everything, lose nothing

A single extraction pass produces both open-set atomic facts (vector-embedded) and schema-enforced typed properties — simultaneously. Neither modality alone is sufficient; the combination ensures 38% of facts captured exclusively through open-set extraction aren't lost to rigid schemas.

99.6%
Fact recall across 5 content types
82.8%
Combined coverage (dual modality)
02

Governance Routing

The right context to the right agent

A tiered router selects which organizational context — policies, guidelines, templates — reaches each agent. A fast path (~200–400ms, zero LLM tokens) for real-time agents and a full two-stage path (~2–5s, chain-of-thought analysis) for batch workflows. Progressive delivery tracks what's been injected and sends only deltas.

92%
Routing precision across 20 task types
50.3%
Token savings via progressive delivery
03

Reflection-Bounded Retrieval

Completeness checking with bounded cost

An iterative protocol checks evidence completeness and generates targeted follow-up queries within bounded rounds. But we report an honest finding: when data is absent from the store, no amount of retrieval sophistication helps. Data completeness outweighs retrieval sophistication.

+25.7pp
Completeness gain on hard multi-hop queries
Principle
Invest in data completeness first
04

Self-Improving Schema Lifecycle

Schemas that get better automatically

AI-assisted authoring bootstraps schemas from natural language. Automated evaluation scores every interaction against domain-specific rubrics. A three-phase refinement pipeline diagnoses underperforming properties and generates targeted improvements — all without human intervention.

+47pp
Discovery rate lift from well-authored definitions
Closed-loop
Author → Evaluate → Refine → Repeat
Benchmark Results

Measured Against the Industry's Most Rigorous Benchmark

LoCoMo tests long-term conversational memory across 272 sessions and 1,542 questions — spanning single-hop recall, multi-hop reasoning, temporal understanding, and open-ended inference.

LoCoMo Overall Accuracy

OpenAI Memory
53%
Zep
54%
Mem0
65%
Governed Memory
74.8%
Human Baseline
87.9%
83.6%
Open-Ended Inference — Exceeds Human Performance

On the largest category in the benchmark (841 questions), the system outperforms the human baseline of 75.4% by +8.2 percentage points — reasoning about accumulated knowledge rather than merely retrieving facts.

83.3%
Conflict Detection — When Facts Contradict Over Time

Across 30 conflict pairs where the same entity changed its database, budget, or cloud provider, the system surfaced the fresh claim in 83.3% of cases — with recency decay scoring stale entries 10–100× lower than fresh ones.

Architecture

Four Layers, One Feedback Loop

Content enters at Layer 1, flows up through governance and retrieval, and Layer 4 feeds refined schemas back — closing the quality loop. Each layer can be independently configured.

Layer 1Dual Memory StoreThe Write Path
Open-Set Memory
Atomic, coreference-resolved facts with temporal anchoring
Schema-Enforced Memory
Typed property values with confidence scores
Quality Gates
Self-containment, coreference, temporal checks
Content Redaction
Two-phase PII & sensitive data filtering
Layer 2Governance RoutingContext Selection
Embedding Pre-Filter
Cosine similarity candidate reduction
LLM Structured Selection
Priority classification: critical vs supplementary
Progressive Delivery
Session-aware delta injection — send only what's new
Session State
24h TTL, tracks what was already injected
Layer 3Governed RetrievalThe Read Path
Entity-Scoped Vector Search
CRM key filtering + metadata filters
Reflection Loop
Completeness check + follow-up queries (bounded rounds)
Merge & Deduplicate
Cross-round result merging by ID
Entity Context
Contextual entity enrichment service
Layer 4Schema Lifecycle & QualityThe Feedback Loop
AI-Assisted Authoring
Generate schemas from natural language
Self-Evaluation
Domain-specific rubric scoring + execution logs
Per-Property Diagnosis
Identify underperforming schema properties
Automated Refinement
Generate targeted improvements — no human needed
Write path flows up from Layer 1
Layer 4 refines Layer 1 schemas — closing the loop
The Math, Demystified

Three Metrics Worth Understanding

Not abstract benchmarks. Each one answers a question a production team actually asks — with a real-world example of what it looks like when it fails.

Defect Rate

(Pronoun Errors + Temporal Conflicts) / Total Generated Facts

How often does the system hallucinate or contradict itself because of noisy retrieval?

8.4%
Standard RAG
6.3%
Governed Memory
25% relative reduction
Real-world example

A meeting moved from Tuesday to Friday. If the system still reports Tuesday — because a stale memory scored higher in vector similarity — that's a Temporal Defect. Standard RAG carries an 8.4% baseline rate of this. Quality gates bring it to 6.3% — a 25% relative reduction in hallucinations caused by noisy context.

Signal-to-Noise Ratio

Relevant Guidelines Injected / Irrelevant Similarity Matches Discarded

For every useful rule injected into an agent's context, how much noise comes with it?

1.1 : 1
Standard RAG
4.2 : 1
Governed Memory
4× denser signal
Real-world example

Standard RAG scores 1.1:1 — roughly equal parts signal and noise. The LLM must reason through irrelevant, outdated, or contradictory context to find the useful part. Governed Memory's reasoning gate achieves 4.2:1. The downstream LLM is never confused by stale context because the gate correctly discarded it with 94.5% precision.

Compliance Rate

1 − (Policy Violations / Total Adversarial Scenarios)

Can the system be tricked into leaking sensitive data under pressure?

50
Violation attempts
0
Leaks
100% compliance
Real-world example

50 deliberate attempts to get the system to reveal a CEO's personal cell phone number across easy, medium, and hard difficulty tiers. Zero leaks — not because a guardrail triggered every time, but because the reasoning gate maintained a negative constraint baseline that scrubbed the response regardless of how the query was framed.

What It Does

18 Capabilities. Built for Production Enterprise.

Most enterprise memory deployments — RAG pipelines, single-agent stores, retrieval frameworks — address 2 to 4 of these. Governed Memory addresses all 18, in production, as an integrated system.

Memory & Extraction
6
Atomic fact extraction
Pulls self-contained facts from calls, emails, transcripts, chats, and documents — 99.6% recall
Coreference resolution
Resolves pronouns and entity references before storage — no ambiguous 'he said' in memory
Temporal anchoring
Converts relative expressions to absolute timestamps — 'last week' becomes a traceable date
Extraction quality gates
Scores every batch for self-containment, coreference clarity, and temporal precision before writing
Schema-enforced memory
Typed property values with confidence scores — deal values stay numbers, not unstructured text
Dual open-set + structured store
Free-form insights and typed properties from one pass — 38% of facts would be lost in a schema-only system
Governance & Routing
4
Organizational governance routing
Selects which policies, guidelines, and templates reach each agent based on task — 92% precision
Progressive context delivery
Session-aware delta injection — 50.3% fewer governance tokens across a 5-step workflow
Multi-agent, multi-tenant
Multiple agents sharing one governed memory layer across thousands of entity records simultaneously
AI-assisted schema authoring
Generate complete property schemas from natural language in seconds — no manual field definition
Retrieval & Quality
4
Reflection-bounded retrieval
Iterative completeness checks with bounded rounds — +25.7 percentage points on hard multi-hop queries
Domain-specific self-evaluation
Rubric-based scoring with execution trace capture — identifies whether low scores stem from recall or generation
Automated schema refinement
Three-phase pipeline diagnoses underperforming properties and rewrites them — no human intervention required
Background memory consolidation
Periodic merging of near-duplicates and pruning of stale entries — store stays lean for fast reads
Security & Infrastructure
4
Entity-scoped isolation
Hard pre-filter by CRM key — semantic similarity cannot override entity ownership boundaries
Content redaction (PII/secrets)
Two-phase pipeline strips API keys, SSNs, and contact data before and after LLM extraction
Extraction provenance tracking
SHA-256 content hash, model ID, chunk position — every memory is fully traceable to its source
Standalone entity context API
Token-budgeted entity snapshot for any downstream consumer — no full integration required
Production-Proven

Every Result Comes From Production APIs, Not Lab Experiments

Deployed across multiple organizations, spanning sales, support, and research workflows. All experiments executed against the production API using synthetic identities.

0%
True cross-entity leakage

500 queries × 5 types = 3,800 scoped results under adversarial conditions. The 2.74% flagged were all false positives — contacts sharing name tokens like 'Aisha Chen' and 'Aisha Singh'. Hard email-key pre-filtering enforces ownership before vector search runs.

200–400ms
Fast routing latency

Governance-aware hybrid routing with zero LLM tokens. 65% of the guideline library discarded before the reasoning gate — surgical precision at production speed.

Two-Phase
PII redaction pipeline

Scrubs secrets and PII before AND after LLM extraction. 4 sensitivity tiers across API keys, financial PII, identity PII, and contact data. 3 anonymization strategies: redact, mask, or hash.

83.1%
Deduplication rate — 5 sources

When we ingested a contact's email thread after their discovery call and follow-up call, the email thread added zero new memories. Every fact was already known. 162 of 195 extraction attempts were duplicates.

Defining a New Discipline

This paper introduces terminology and abstractions intended to serve as a reference architecture for enterprise memory governance — a discipline that didn't have a name until now.

Governed MemoryGovernance RoutingProgressive Context DeliveryMemory Quality GatesSchema Lifecycle ManagementReflection-Bounded Retrieval

Read the Full Paper

The overview above covers the what. The paper covers the why, the how, and the honest limitations — including the results that surprised us.

  • Why reflection only helps when the data is already there — and what to invest in instead
  • Why 38% of enterprise knowledge is permanently lost in a schema-only system
  • How naming a policy 'Sales: Discovery Call' vs 'Policy_1' doubles its discovery rate
  • 7 formal algorithms with pseudocode — 16 controlled experiments with ground truth
  • Deliberate negative results reported transparently

Hamed Taheri · Personize AI · February 2026