The Anatomy of Agentic AI Task Examples
How to build an example library that actually makes your agents smarter.
While building Moltin I have learned that the difference between an agent that ships and one that dies in a Jupyter notebook often has nothing to do with the algorithm. It's about the examples you collect.
Now I watch teams make the same mistakes we did. They spin up agents, connect APIs, write elaborate prompts. Then they wonder why the system can't handle basic variations of the task it's supposed to automate. The problem isn't the AI. It's that nobody captured the right examples when it mattered.
Most organizations building agentic AI workflows don't fail because their models are weak. They fail because they never built a proper library of task examples. Or they captured examples that were too generic to be useful, too specific to generalize, or missing the context that makes the difference between success and hallucination.
This isn't a theoretical problem. When your agent starts routing customer support tickets to the wrong department or approving expense reports it should flag, you'll trace the failure back to the same root cause: you didn't document the edge cases when you first saw them. You assumed the AI would figure it out.
It won’t.
What Makes a Good Agentic Task Example
A good task example isn't just input and output. That's what most people capture, and it's why their examples don't help. You need the full anatomy.
Start with the input, but be specific about format and constraints. If your agent processes customer emails, don't just paste "customer inquiry about refund." Include the actual email with all its typos, the subject line, the timestamp, whether it's a reply in a thread. Real inputs are messy. Your examples should reflect that.
Context is where most teams fall short. An agent deciding whether to escalate a support ticket needs to know more than the ticket content. It needs the customer's lifetime value, their support history, the current queue depth, whether it's during business hours. In agentic systems, this is called context engineering, and you need to document every piece that influences the decision.
Decision Points
Decision Points
1. Check Vendor Approval Status
Condition: Vendor must be in approved list OR have exception approval
Data sources: vendor_database, exception_log
Possible outcomes: approved_vendor / unapproved_vendor / vendor_not_found
Actual outcome: approved_vendor
Reasoning: Vendor found in approved list with active statusDecision points are the skeleton of your example. At what moment does the agent need to choose between paths? What information triggers that choice? If you're building a procurement agent that approves or rejects purchase requests, the decision point isn't just "approve or reject." It's the moment the agent checks whether the vendor is on the approved list, whether the amount exceeds department budget, whether similar requests were recently approved. Each decision point needs documentation.
Expected Output
Expected Output
Primary Decision
Decision: APPROVED
Confidence: 95%
Intermediate Steps
Validated vendor: Acme Software Ltd (ID: VND-4521)
Checked budget: $2,500 against $8,000 available in software_licenses
Verified approval authority: Auto-approve under $5,000 threshold
Generated PO: PO-2024-001234
Actions Taken
Create Purchase Order
System: ERP
PO number: PO-2024-001234
Status: Approved
Send Notifications
System: Email
Recipients: john.doe@company.com, finance@company.com
Template: expense_approved
PO attached: Yes
Update Budget
System: budget_tracker
Category: software_licenses
Amount reserved: $2,500
Audit Trail
Decision timestamp: 2024-01-15 14:25:30
Agent version: v2.3.1
Rules applied: vendor_check_v2, budget_check_v3, auto_approve_v1
Human review required: No
Response time: 1,250 msExpected outputs should include not just the final result but the intermediate steps. If your agent approves a purchase request, what emails does it send? What fields does it update in your ERP system? What audit trail does it create? I've seen agents produce the right answer through completely wrong reasoning. Without documented intermediate steps, you can't catch that.
Failure Modes
Failure Modes
Vendor Not in Approved List (But Is Subsidiary)
Detection: Vendor name lookup returns null
Correct behavior: Check for parent company relationship before rejecting
Risk level: Medium
Frequency: OccasionalFailure modes matter more than success cases. Every good example includes what could go wrong and how the agent should handle it. The vendor isn't on the approved list but it's a subsidiary of one that is. The budget shows as exceeded but that's because of a pending reimbursement. The request came from a VP who technically doesn't have authority for this category but everyone knows she's acting for the CFO who does. These aren't edge cases. They're Tuesday.
Metadata
Edge Case Information
Is this an edge case? No
Edge case type: N/A
Related edge cases: None
Validation
Human verified: Yes by sarah.jones@company.com on 2024-01-15
Matches policy: Yes
Policy reference: EXP-POL-2024-v3 Section 4.2
Discrepancies: None
Usage Metrics
Used in training: Yes
Used in evaluation: Yes
Times referenced: 5
Similar examples: EXP-2024-008, EXP-2024-015
Success rate on similar inputs: 94%
Notes
Capture Notes
Standard case that represents typical engineering department software purchase. Good baseline example for training.
Reviewer Notes
Clean example with all decision points clearly documented. Use as template for similar expense approval examples.
Update History
2024-01-15 14:30 by jane.smith@company.com: Initial creation from production runMetadata ties everything together. Who captured this example and when? What version of the workflow does it represent? Has this example been used for training, evaluation, or both? Is it a real interaction or a synthetic one you created for testing? Six months from now, when you're debugging why the agent behaves oddly on certain inputs, this metadata will save you hours of archaeology.
When to Capture Examples
Timing is everything. Capture too early and you're documenting hypothetical scenarios that don't reflect reality. Capture too late and you've already deployed an agent that's learned from incomplete data.
Initial Scoping: Building Your Foundation
The scoping phase is when you define what the agent will do. This is your first chance to capture examples, and you should treat it like requirements gathering because that's exactly what it is.
Sit with the people who do the task manually today. Don't just ask them to describe it. Watch them do it. Record five to ten complete walkthroughs. Note every decision they make, every system they check, every time they pause to think. That pause is a decision point your agent will need to handle.
Think about if you were documenting a customer onboarding workflow. You’d likely work with an ops team. You would document based on what the ops team told you they did. But wait, it turns out what they described was the happy path. The actual process included a dozen conditional branches they'd internalized so completely they forgot to mention them. You would only catch this when the agent started onboarding customers incorrectly and someone said, "Well obviously it should check for that." in a Teams channel in front of leadership.
Not obvious if you never documented it.
During scoping, capture examples that span the full range of typical inputs. Don't optimize for the most common case only. You want the 80th percentile, the 95th percentile, and at least one example from the long tail. That outlier case will teach your agent how to handle variation.
Learning From the Weird Stuff
Edge cases emerge during testing and early deployment. This is when you discover all the scenarios your scoping didn't cover. The key is to have a system for capturing them immediately.
We built a feedback loop management system in Moltin. So, whenever someone on the team encounters unexpected agent behavior, they can capture it with one brief explanation and a click of the button. Moltin grabs the input, the agent's output, what the human would have done instead, and the person's explanation of why the agent got it wrong. Takes 30 seconds.
Before we had this system, edge cases lived in people's heads or got mentioned in passing during stand-ups. "Oh yeah, the agent did something weird yesterday, but I fixed it manually." That manual fix was free training data you just threw away.
Some edge cases signal systematic problems. If your expense approval agent consistently flags legitimate receipts from a particular vendor, that's not just one bad example. That's a pattern. Capture it, but also investigate the underlying cause. Maybe the vendor's invoice format confuses the OCR. Maybe their business name doesn't match what's in your vendor database. Fix the root cause and document both the edge case and the fix.
Mining Gold From Mistakes
When your agent fails, that failure is more valuable than ten successful runs. Failures reveal gaps in your example library that success can’t show you.
Root cause analysis for agent failures looks different than traditional software debugging. You’re not looking for a bug in the code. You’re looking for a gap in the agent’s understanding of the task.
What context was missing?
What decision point did it mishandle?
What failure mode did you not anticipate?
Document the failure in full detail before you fix anything. Capture the complete input, the context available to the agent, every intermediate step it took, and where it went wrong. Then document what the correct behavior should have been and why. This becomes a teaching example.
Here’s a pattern I’ve seen repeatedly: an agent fails, someone patches the prompt or tweaks a parameter, the agent starts working again, everyone moves on. Six months later, the agent fails the same way on a slightly different input. Nobody remembers the original failure or the fix. You end up debugging the same problem twice.
Treat failures as permanent additions to your example library. Use them in regression testing. When you update the agent, test against your failure library first.
Letting Reality Guide You
Users will tell you when your agent gets it wrong. The question is whether you’re listening in a way that generates useful examples.
Thumbs up and thumbs down ratings are useless by themselves. A thumbs down tells you that the human felt that the agent failed. It doesn’t tell you how or why. You need structured feedback that captures what the user expected versus what they got.
When you read through the feedback in Moltin as an admin, you will want to consider the following throughout your review of the feedback:
What did the agent do?
What should it have done?
Why does the difference matter?
Users can easily fill out feedback right from their chat session. Every submission should be reviewed. We are seeing about 60% become new examples that improve the agent. The other 40% reveal confusion about what the agent is supposed to do, which is also valuable.
Pay attention to the requests users make right after the agent acts. If your email triage agent routes a message to sales and the recipient immediately forwards it to support, that’s feedback. The sequence of events tells you the agent got it wrong even if the user never files a formal complaint.
Implicit feedback is everywhere if you look for it. Manual overrides of agent decisions. Cases where a user reviews the agent’s work and makes small edits. Patterns where certain types of tasks consistently get pulled back from the agent and handled manually.
All of these are examples waiting to be captured.
{
"example_id": "EXP-2024-001",
"example_type": "training | evaluation | debugging | stakeholder",
"workflow_version": "2.3.1",
"created_date": "2024-01-15T14:30:00Z",
"created_by": "jane.smith@company.com",
"last_updated": "2024-01-15T14:30:00Z",
"status": "active | archived | deprecated",
"task_description": {
"title": "Approve departmental expense request",
"category": "expense_approval",
"complexity": "standard | complex | edge_case",
"tags": ["expense", "approval", "international_vendor"]
}
}Practical Documentation Techniques
You need to build a system that captures rich detail without becoming a full-time job. The best system is one your team will actually use under deadline pressure.
Templates That Scale
Start with a structured template. Every example follows the same format. This makes examples easy to review, compare, and convert into training data later.
Your template should have sections for input, context, decision points, output, failure modes, and metadata. Within each section, use consistent field names. If one example calls something “customer_priority” and another calls it “account_tier,” you’ll waste time reconciling them later.
Make the template fill itself in where possible. If you’re capturing examples from a production system, pull the input and context automatically. Generate a timestamp, capture the user ID, record the agent version. Anything you can automate, automate.
Ask humans only for the parts that require judgment.
We offer a Markdown schema for our examples to be added within Moltin. This is converted into a JSON structure before being saved. Structured data is easier to search, analyze, and feed into training pipelines than free text. But you should also include a notes field for context that doesn’t fit the schema. Sometimes the most important detail is something you didn’t plan for.
Balancing Detail With Maintainability
Too much detail and nobody will capture examples. Too little and the examples won't help. You need to find the right level.
For common scenarios, you can use abbreviated documentation. Once you've captured three examples of standard expense approvals with all the detail, subsequent examples can reference the standard pattern and note only what's different.
"Same as example 0042 but with international vendor."
For unusual scenarios, go deep. These are the examples that teach your agent to handle variation. Link to relevant policy documents. Quote the exact clause in the employee handbook that governs this situation. Future you will be grateful.
Version your examples. When the workflow changes, you don't want to lose the old examples, but you also don't want to train on outdated data. We tag each example with the workflow version it represents. When we update the agent, we review examples from the previous version and decide whether to update, archive, or delete each one.
Collaborative Capture
Example capture can’t be one person’s job. It needs to be a team practice.
Make it easy for anyone to contribute. The procurement specialist who notices the agent mishandles a particular vendor type should be able to capture that example without filing out a ticket and waiting for someone from the AI team to get around to it.
But you also need curation. Not every contributed example is useful. Some are duplicates. Some document the same underlying pattern. Some are too specific to generalize. Assign someone to review new examples weekly and organize them into your library.
For example, create a two-tier system. Anyone can add an example to the staging area. Once a week, someone from your AI team reviews staged examples, tags them appropriately, and promotes them to Moltin. Examples that need clarification go back to the contributor with questions. This keeps the library clean without creating bottlenecks.
Structuring Examples for Different Purposes
Not all examples serve the same purpose. How you structure them depends on what you'll use them for.
Teaching the Agent
Examples used for training need to be clean, consistent, and representative. You’re teaching the agent patterns, so you want examples that clearly illustrate those patterns without too much noise.
For training data, focus on the input-output relationship and the intermediate reasoning steps. Context matters, but you can often simplify it. If your agent approves expenses based on ten different factors, but only three factors matter for 90% of decisions, your training examples should emphasize those three.
Balance your training set. If 95% of your examples are approvals and 5% are rejections, your agent will be biased toward approval. You might need to create synthetic examples or oversample the minority class.
Label your training examples clearly. Don’t just mark them “correct.” Explain why they’re correct. What principle or policy does this example illustrate? What pattern should the agent learn? These labels help you audit your training data later when the agent behaves unexpectedly.
Measuring Performance
Evaluation examples are your test suite. They tell you whether changes to the agent make things better or worse.
These examples should be stable. Don't change them unless the underlying task changes. You want to be able to compare agent performance across weeks or months, and that requires a consistent benchmark.
Include both typical cases and edge cases in your evaluation set. A good split is 70% typical, 20% challenging but not rare, 10% true edge cases. This gives you a sense of both baseline performance and how well the agent handles difficulty.
For each evaluation example, define success criteria precisely. "Correct output" is too vague. Does correct mean exact match? Semantically equivalent? Within certain parameters? The more specific your criteria, the more useful your evaluation metrics will be.
Understanding Failures
When something goes wrong in production, you need examples that help you diagnose the problem quickly.
Debugging examples should include everything. Full input, complete context, every intermediate step, the final output, and what went wrong. These are not clean training examples. They’re crime scenes. You’re preserving evidence.
Tag debugging examples with the symptoms.
“Agent approved expense over budget limit.”
“Agent routed technical question to sales.”
“Agent generated response in wrong language.”
These tags let you find similar failures quickly when you’re troubleshooting.
Cross-reference debugging examples with the fixes you apply. When you solve a problem, link the solution back to the examples that revealed it. This builds institutional knowledge. Six months from now, when a new team member encounters a similar issue, they can see not just what went wrong but how it was fixed.
Building Trust
Some examples exist primarily to show non-technical stakeholders what the agent does and how it works.
These examples should tell a story. Walk through a complete scenario from start to finish. Show the input, explain what the agent considers, describe the decision it makes, and demonstrate the outcome. Make it concrete enough that someone who doesn’t understand AI can follow along.
Use stakeholder examples to manage expectations. If your agent handles routine cases but escalates complex ones, show examples of both. If it sometimes gets things wrong in predictable ways, document those too. Transparency builds trust more than pretending the system is perfect.
Update stakeholder examples when the agent’s behavior changes. These examples shape how people think about what the agent can do. Outdated examples create confusion and erode trust when reality doesn’t match the documentation.
Building Your Library From Day One
Start capturing examples before you spin up your first workflow. The examples come first. They define what success looks like.
In your first week, capture 20 examples by hand. Sit with the people who do the task today. Watch them work. Document exactly what they do. These 20 examples become your specification. They tell you what the agent needs to learn.
Set up your capture infrastructure early. The simple Slack workflow. The JSON schema. The staging and review process. Make it easy to add examples from day one because you'll never have time to go back and add them later.
Treat your example library as a product. It needs maintenance. Examples become stale. Workflows change. Policies update. Schedule quarterly reviews where you go through your library and archive or update examples that no longer reflect current reality.
The teams that ship great agentic AI systems aren't the ones with the fanciest agents and models. They're the ones that captured the right examples at the right time and built a library that grows smarter as the agent does. Start building yours today, because the best time to capture an example is the first time you see it.


