Before You Begin
Executive Summary
Two Words Govern This Entire Domain
Be explicit. Vague instructions produce vague results. Specific categorical criteria produce reliable, predictable behaviour.
Few-shot examples are the highest-leverage technique for consistency — not more instructions, not confidence thresholds.
tool_use with JSON schemas eliminates syntax errors — but not semantic errors or fabrication. Schema design matters.
Batch API: latency-tolerant workflows. Synchronous API: blocking workflows. Never swap these.
TASK STATEMENT 4.1
Explicit Criteria
Vague vs Explicit — The Core Distinction
Wrong: "Be conservative." — The model cannot operationalise this.
Wrong: "Only report high-confidence findings." — Self-reported confidence is poorly calibrated.
Right: "Flag comments only when claimed behaviour contradicts actual code behaviour. Report bugs and security vulnerabilities. Skip minor style preferences and local conventions."
The False Positive Trust Problem
High false positive rates in one category destroy trust in all categories. The fix is to temporarily disable the high false-positive category while iterating on its prompts. This restores trust in the remaining categories while you improve the problematic one.
Severity Calibration
Define explicit severity criteria with concrete code examples for each level — not prose descriptions. Show what "critical" looks like in actual code versus what "minor" looks like. The model generalises from examples far more reliably than from abstract prose.
TASK STATEMENT 4.2
Few-Shot Prompting
Few-shot examples are the single most effective technique for achieving consistency. When detailed instructions alone produce inconsistent results, adding 2 to 4 targeted examples resolves the ambiguity.
When to Deploy Few-Shot Examples
- Detailed instructions alone produce inconsistent formatting or judgment calls.
- The model handles ambiguous cases inconsistently across invocations.
- Extraction tasks produce empty or null fields for information that exists in the document.
How to Construct Good Examples
- Provide 2 to 4 targeted examples covering the ambiguous cases specifically.
- Each example must show the reasoning for why one action was chosen over plausible alternatives — not just the correct output.
- This teaches generalisation to novel patterns, not just pattern-matching pre-specified cases.
TASK STATEMENT 4.3
Structured Output with tool_use
| Approach | Eliminates Syntax Errors? | Eliminates Semantic Errors? |
|---|---|---|
tool_use with JSON schema | Yes | No — field placement, fabrication, and value errors remain possible. |
| Prompt-based JSON request | No — malformed JSON is possible | No |
Schema Design Principles
TASK STATEMENT 4.4
Validation-Retry Loops
When extraction fails validation, send back the original document, the failed extraction, and the specific validation error. The model uses this structured feedback to self-correct.
| Scenario | Retry Effective? |
|---|---|
| Format mismatches — value in wrong field | Yes — the model can reposition values with error context. |
| Structural output errors | Yes — the model can fix structure when shown the specific error. |
| Information genuinely absent from the source document | No — retrying cannot produce information that does not exist. |
The exam presents both fixable and unfixable scenarios. Identify which is which before recommending a retry approach.
TASK STATEMENT 4.5
Batch Processing
| Message Batches API | Value |
|---|---|
| Cost saving | 50% compared to synchronous API |
| Processing window | Up to 24 hours |
| Latency SLA | None guaranteed |
| Multi-turn tool calling | Not supported within a single request |
| Request correlation | Use custom_id to match requests to responses |
The Matching Rule
Synchronous API: blocking workflows — pre-merge checks, anything developers or automated pipelines wait for.
Batch API: latency-tolerant workflows — overnight reports, weekly audits, nightly test generation.
Do not use batch processing for anything that blocks a waiting human or automated process.
TASK STATEMENT 4.6
Multi-Instance Review
A model reviewing its own output in the same session retains the reasoning context that produced it — making it less likely to question its own decisions. An independent instance with no prior context catches significantly more subtle issues.
Multi-Pass Architecture
- Per-file local analysis passes: consistent depth per file.
- Separate cross-file integration pass: catches data flow issues that span multiple files.
- Prevents attention dilution and contradictory findings across files.
Confidence-Based Routing
- The model self-reports confidence per finding.
- Route low-confidence findings to human review.
- Calibrate thresholds using labelled validation sets with verified ground truth data.
Hands-On Build Exercise
Build: Extraction Pipeline with Validation-Retry
- Create an extraction tool with a JSON schema including required, optional, nullable fields, and enum values with an "other" option.
- Implement a validation-retry loop that sends back the original document, failed extraction, and specific validation error.
- Process 10 documents through the Message Batches API using custom_id for correlation.
- Add few-shot examples for documents with varied structures (inline citations vs bibliographies).
- Compare extraction quality before and after adding the few-shot examples. Document the difference.