Domain 5: Context Management & Reliability

Before You Begin

Executive Summary

Small Domain — Large Consequences

Progressive summarisation destroys transactional data. Fix: persistent case facts block, never summarised, included in every prompt.

"Lost in the middle" effect: place key summaries at the beginning of long inputs.

Three valid escalation triggers: explicit human request, policy gaps, inability to progress. Two unreliable triggers the exam uses as distractors: sentiment analysis, self-reported confidence scores.

Error anti-patterns: silent suppression and workflow termination on single failures.

TASK STATEMENT 5.1

Context Preservation

The Progressive Summarisation Trap

Condensing conversation history compresses numerical values, dates, and customer expectations into vague summaries.

"Customer wants a refund of $247.83 for order #8891 placed on March 3rd" becomes "customer wants a refund for a recent order." The specific data is lost forever.

Fix: Extract transactional facts into a persistent case facts block. Include it in every prompt unchanged. Never allow the summarisation pipeline to touch it.

The Lost in the Middle Effect

Models process the beginning and end of long inputs reliably. Findings buried in the middle of a long context may be missed entirely. Fix: place key findings summaries at the beginning of the input. Use explicit section headers throughout.

Tool Result Trimming

When an order lookup returns 40 or more fields but you need only 5, trim the result to relevant fields before appending it to context. Accumulated verbose results exhaust the token budget before useful work is complete.

TASK STATEMENT 5.2

Escalation and Ambiguity Resolution

The Three Valid Escalation Triggers

Customer explicitly requests a human — honour immediately. Do not attempt to resolve first.
Policy exceptions or gaps — the request falls outside documented policy.
Inability to make meaningful progress — the agent cannot advance the resolution.

Do Not Use These as Escalation Triggers

Sentiment-based escalation: frustration does not correlate with case complexity. A frustrated customer with a simple issue should receive a resolution, not an escalation.

Self-reported confidence scores: the model is often incorrectly confident on hard cases and uncertain on easy ones. Do not use these as escalation triggers.

Ambiguous Customer Matching

When multiple customers match a search query, ask for additional identifiers — email address, phone number, or order number. Do not select a customer based on heuristics like "most recent" or "most active."

TASK STATEMENT 5.3

Error Propagation

Structured Error Context

When propagating an error upward, always include: the failure type, what was attempted with specific parameters, partial results gathered before the failure, and potential alternative approaches.

Error Anti-Patterns — Both Are Wrong

Silent suppression: returning empty results marked as success. The coordinator cannot recover because it does not know a failure occurred.

Workflow termination: killing the entire pipeline on a single failure. This throws away all partial results gathered before the failure.

Coverage Annotations

Synthesis output should explicitly note which findings are well-supported and which areas have gaps. Stating "this section on geothermal energy is limited due to unavailable journal access" is far more useful than silently omitting the section.

TASK STATEMENT 5.4

Codebase Exploration

Strategy	How It Helps
Scratchpad files	Write key findings to a file; reference it for subsequent questions rather than keeping everything in context.
Subagent delegation	Spawn subagents for specific investigations; main agent keeps high-level coordination only.
Summary injection	Summarise findings from one phase before spawning subagents for the next.
/compact	Reduce context usage when it fills with verbose discovery output.

TASK STATEMENT 5.5

Human Review and Confidence Calibration

The Aggregate Metrics Trap

A 97% overall accuracy figure can hide a 40% error rate on a specific document type. Always validate accuracy by document type and field segment before automating any extraction pipeline.

Field-Level Confidence Calibration

The model outputs a confidence score per field.
Calibrate thresholds using labelled validation sets with verified ground truth.
Route low-confidence fields to human review.
Prioritise limited reviewer capacity on the highest-uncertainty items.

TASK STATEMENT 5.6

Information Provenance

Structured Claim-Source Mappings

Every finding should carry: the claim, the source URL, the document name, a relevant excerpt, and the publication date. Downstream agents must preserve and merge these mappings through synthesis. Without this structure, attribution dies during summarisation.

Conflict Handling

When two credible sources report different statistics, do not arbitrarily select one. Annotate the output with both values and their source attributions, and let the consumer make the informed choice.

Content Type	Preferred Format
Financial data	Tables
News and narrative content	Prose
Technical findings	Structured lists

Hands-On Build Exercise

Build: Coordinator with Persistent Context

Build a coordinator with two subagents.
Implement a persistent case facts block that is never summarised — include it verbatim in every prompt.
Simulate a timeout in one subagent and verify the coordinator receives structured error context with partial results.
Test with two conflicting sources and verify the synthesis output preserves both values with attribution.
Verify the case facts block survives the full conversation without compression or data loss.