Before You Begin

Executive Summary

Two Words Govern This Entire Domain

Be explicit. Vague instructions produce vague results. Specific categorical criteria produce reliable, predictable behaviour.

Few-shot examples are the highest-leverage technique for consistency — not more instructions, not confidence thresholds.

tool_use with JSON schemas eliminates syntax errors — but not semantic errors or fabrication. Schema design matters.

Batch API: latency-tolerant workflows. Synchronous API: blocking workflows. Never swap these.

TASK STATEMENT 4.1

Explicit Criteria

Vague vs Explicit — The Core Distinction

Wrong: "Be conservative." — The model cannot operationalise this.

Wrong: "Only report high-confidence findings." — Self-reported confidence is poorly calibrated.

Right: "Flag comments only when claimed behaviour contradicts actual code behaviour. Report bugs and security vulnerabilities. Skip minor style preferences and local conventions."

The False Positive Trust Problem

High false positive rates in one category destroy trust in all categories. The fix is to temporarily disable the high false-positive category while iterating on its prompts. This restores trust in the remaining categories while you improve the problematic one.

Severity Calibration

Define explicit severity criteria with concrete code examples for each level — not prose descriptions. Show what "critical" looks like in actual code versus what "minor" looks like. The model generalises from examples far more reliably than from abstract prose.

TASK STATEMENT 4.2

Few-Shot Prompting

Few-shot examples are the single most effective technique for achieving consistency. When detailed instructions alone produce inconsistent results, adding 2 to 4 targeted examples resolves the ambiguity.

When to Deploy Few-Shot Examples

  • Detailed instructions alone produce inconsistent formatting or judgment calls.
  • The model handles ambiguous cases inconsistently across invocations.
  • Extraction tasks produce empty or null fields for information that exists in the document.

How to Construct Good Examples

  • Provide 2 to 4 targeted examples covering the ambiguous cases specifically.
  • Each example must show the reasoning for why one action was chosen over plausible alternatives — not just the correct output.
  • This teaches generalisation to novel patterns, not just pattern-matching pre-specified cases.

TASK STATEMENT 4.3

Structured Output with tool_use

ApproachEliminates Syntax Errors?Eliminates Semantic Errors?
tool_use with JSON schemaYesNo — field placement, fabrication, and value errors remain possible.
Prompt-based JSON requestNo — malformed JSON is possibleNo

Schema Design Principles

{ "type": "object", "properties": { "invoice_number": { "type": "string" }, // required "total_amount": { "type": "number" }, // required "discount_code": { "type": ["string", "null"] }, // nullable — prevents fabrication "payment_method": { // enum with fallback "type": "string", "enum": ["card", "bank_transfer", "cash", "other"] }, "payment_notes": { "type": "string" } // freeform for "other" cases }, "required": ["invoice_number", "total_amount", "payment_method"] }

TASK STATEMENT 4.4

Validation-Retry Loops

When extraction fails validation, send back the original document, the failed extraction, and the specific validation error. The model uses this structured feedback to self-correct.

ScenarioRetry Effective?
Format mismatches — value in wrong fieldYes — the model can reposition values with error context.
Structural output errorsYes — the model can fix structure when shown the specific error.
Information genuinely absent from the source documentNo — retrying cannot produce information that does not exist.

The exam presents both fixable and unfixable scenarios. Identify which is which before recommending a retry approach.

TASK STATEMENT 4.5

Batch Processing

Message Batches APIValue
Cost saving50% compared to synchronous API
Processing windowUp to 24 hours
Latency SLANone guaranteed
Multi-turn tool callingNot supported within a single request
Request correlationUse custom_id to match requests to responses

The Matching Rule

Synchronous API: blocking workflows — pre-merge checks, anything developers or automated pipelines wait for.

Batch API: latency-tolerant workflows — overnight reports, weekly audits, nightly test generation.

Do not use batch processing for anything that blocks a waiting human or automated process.

TASK STATEMENT 4.6

Multi-Instance Review

A model reviewing its own output in the same session retains the reasoning context that produced it — making it less likely to question its own decisions. An independent instance with no prior context catches significantly more subtle issues.

Multi-Pass Architecture

  • Per-file local analysis passes: consistent depth per file.
  • Separate cross-file integration pass: catches data flow issues that span multiple files.
  • Prevents attention dilution and contradictory findings across files.

Confidence-Based Routing

  • The model self-reports confidence per finding.
  • Route low-confidence findings to human review.
  • Calibrate thresholds using labelled validation sets with verified ground truth data.

Hands-On Build Exercise

Build: Extraction Pipeline with Validation-Retry

  • Create an extraction tool with a JSON schema including required, optional, nullable fields, and enum values with an "other" option.
  • Implement a validation-retry loop that sends back the original document, failed extraction, and specific validation error.
  • Process 10 documents through the Message Batches API using custom_id for correlation.
  • Add few-shot examples for documents with varied structures (inline citations vs bibliographies).
  • Compare extraction quality before and after adding the few-shot examples. Document the difference.