Your AI Agent Has an Onboarding Problem
We treat AI agents like savants who should just figure it out. But the agents that actually work in production are the ones that get onboarded like new hires.
When you hire someone new, you don't hand them a laptop and say "be smart and figure it out."
You give them:
- Onboarding docs
- Standard operating procedures
- Access to the right tools
- Examples of good work
- Escalation paths when they're stuck
Yet that's exactly how most teams ship AI agents.
A system prompt. A handful of tools. And a prayer.
Then they're surprised when the agent is inconsistent, unreliable, and breaks in production.
The System Prompt Delusion
The default approach to making an agent better is predictable:
The agent isn't doing X right? Add more instructions to the system prompt.
So teams write longer prompts. Then longer ones. Then they hit 4,000 words and wonder why the agent still hallucinates, skips steps, and produces different outputs for the same input.
Here's the uncomfortable truth: a system prompt is not onboarding. It's a pep talk.
A 4,000-word system prompt is the equivalent of a hiring manager reading a monologue at a new employee for 30 minutes straight and expecting them to retain everything perfectly.
No human works this way. Neither do LLMs.
Why Prompts Fail at Scale
There are three structural reasons prompt-only agents break down.
1. Prompts Are Advisory, Not Enforcing
A prompt says "you should format responses as JSON."
The model usually does it. Except when the conversation gets long. Or the user asks something unusual. Or the model updates and behavior shifts slightly.
Prompts suggest behavior. They don't enforce it.
A new hire who reads the employee handbook once doesn't memorize every policy. They follow procedures because the procedures are embedded in their workflow — checklists, templates, approval flows.
Agents need the same thing.
2. Everything-Upfront Doesn't Scale
When you onboard a new hire, you don't dump the entire company knowledge base on them day one. You give them what they need for their first task. Then their second. Knowledge is progressively disclosed.
Most agent prompts do the opposite. They front-load everything:
- All instructions for all scenarios
- All edge cases
- All formatting rules
- All tool usage guidelines
The result:
- Token usage explodes
- The model "forgets" instructions at the end of long prompts
- Irrelevant context competes with relevant context
- Reasoning quality degrades
This is the lost-in-the-middle problem and it's well-documented. Models perform worst on information buried in the middle of long contexts.
3. Prompts Can't Carry Procedures
Some tasks require multi-step procedures. Not just "what to do" but "how to do it, in what order, with what tools, handling what edge cases."
Consider a data analysis task. The procedure might be:
- Validate the uploaded file format
- Check for missing values and decide on imputation strategy
- Run descriptive statistics
- Generate visualizations for numerical columns
- Output a structured summary
A prompt can list these steps. But it can't:
- Guarantee the agent follows them in order
- Provide the actual scripts to execute
- Ensure the right dependencies are available
- Handle failures at each step differently
Prompts describe work. Procedures encode it.
What Good Onboarding Actually Looks Like
If prompts are pep talks, what does real onboarding for an agent look like?
The same thing it looks like for humans: structured, progressive, tool-equipped training.
Structured Knowledge Packages
Instead of a monolithic prompt, break domain knowledge into discrete units — each one representing a specific capability.
Think of these as the "playbooks" you'd hand a new hire:
data-analysis/
├── SKILL.md # When and how to use this capability
├── scripts/
│ ├── validate.py # Validation procedure
│ ├── analyze.py # Analysis procedure
│ └── visualize.py # Visualization procedure
└── requirements.txt # Dependencies needed
Each package contains:
- When to use it (description and trigger conditions)
- How to do it (step-by-step instructions)
- What tools to use (executable scripts)
- What's needed (dependencies, resources)
This is the concept Anthropic formalized as Agent Skills — and it's the most important architectural shift in how we build production agents.
Progressive Disclosure
The critical design principle: don't load everything at once.
Layer 1 — Discovery: The agent sees a short name and description. Minimal tokens. Always present.
Layer 2 — Instructions: When the agent decides a skill is relevant, it loads the full workflow.
Layer 3 — Resources: Scripts, templates, and references are accessed only during execution.
This mirrors how humans work. You don't memorize the entire employee manual. You know what resources exist and where to find them when needed.
For agents, this means:
// Layer 1: Agent sees "data-analysis" with a one-line description
// Token cost: ~20 tokens per skill
// Layer 2: Agent decides to use it, loads full instructions
// Token cost: ~200-500 tokens
// Layer 3: Agent executes scripts in a sandbox
// Token cost: 0 (runs outside the context window)Compare this to stuffing everything into a system prompt:
// Traditional approach: everything upfront
// Token cost: ~2,000-5,000 tokens per capability
// × 10 capabilities = 20,000-50,000 tokens before the user says anythingProgressive disclosure keeps context small without sacrificing depth. The agent knows what it can do without paying the token cost of how to do everything upfront.
Executable Procedures
This is where agent onboarding diverges from human onboarding in a powerful way.
Humans read SOPs and interpret them. Agents can execute them directly.
Instead of telling an agent "use pandas to analyze the CSV and handle missing values with median imputation," you give it an actual script:
#!/usr/bin/env python3
import sys
import pandas as pd
import json
def analyze(file_path):
df = pd.read_csv(file_path)
for col in df.select_dtypes(include="number").columns:
df[col].fillna(df[col].median(), inplace=True)
return json.dumps({
"shape": {"rows": len(df), "columns": len(df.columns)},
"dtypes": df.dtypes.astype(str).to_dict(),
"summary": df.describe().to_dict(),
"missing_before": df.isnull().sum().to_dict(),
}, indent=2)
if __name__ == "__main__":
print(analyze(sys.argv[1]))The agent doesn't interpret the procedure. It runs it. Same input, same output, every time.
This is the difference between:
- An agent that sounds competent
- An agent that behaves deterministically
The Reliability Gap
Let's make this concrete.
Task: "Analyze this CSV and give me insights."
Prompt-only agent (no onboarding):
- Run 1: Returns a narrative summary with bullet points
- Run 2: Returns a table with statistics
- Run 3: Tries to write Python but hallucinates a library name
- Run 4: Returns bullet points but misses the missing values
- Run 5: Gives a great answer (the one you showed in the demo)
Onboarded agent (with a data-analysis skill):
- Run 1: Validates file, runs analysis script, returns structured JSON
- Run 2: Same
- Run 3: Same
- Run 4: Same
- Run 5: Same
The onboarded agent isn't smarter. It has less room to improvise on the parts that need to be deterministic, and full freedom to reason on the parts that benefit from intelligence (like interpreting results and answering follow-up questions).
This is the same principle behind good SOPs in any organization. You don't constrain thinking. You constrain process.
Why This Isn't Just "Better Prompting"
Someone reading this might think: "This is just prompt engineering with extra steps."
It's not. There are structural differences:
1. Separation of concerns. Knowledge lives outside the conversation. It's versioned, tested, and deployed independently — like code, not like text pasted into a chat window.
2. Conditional loading. Knowledge enters the context only when relevant. A prompt-only agent pays the token cost for every capability on every request.
3. Executable artifacts. Skills contain runnable scripts, not just descriptions of what to run. The agent delegates execution to deterministic code instead of generating it on the fly.
4. Portability. A well-structured skill works across models. Switch from GPT-4o to Claude to Llama — the skill doesn't change. Try that with a prompt tuned for one model's quirks.
This is the difference between configuration and architecture.
How to Start Onboarding Your Agents
You don't need a framework to start. The mental model is enough.
Step 1: Identify Repeatable Tasks
Look at what your agent does repeatedly. Data analysis. Report generation. Code review. Email drafting. These are your skill candidates.
Step 2: Write the Playbook
For each task, document:
- When should the agent use this skill?
- What are the exact steps?
- What tools or scripts are needed?
- What does good output look like?
- What edge cases exist?
Step 3: Make It Executable
Turn descriptions into scripts. Instead of "calculate statistics using pandas," write the actual Python script. Instead of "format the output as JSON," write a template.
Step 4: Load Progressively
Don't stuff all playbooks into the system prompt. Give the agent an index of available skills (name + description), and load the full playbook only when the agent decides it's relevant.
const systemPrompt = `You are a data assistant. You have access to these skills:
- data-analysis: Analyze CSV files with statistical summaries
- visualization: Generate charts from datasets
- data-cleaning: Clean and transform messy data
When a user's request matches a skill, use it.`;
// Full skill instructions are injected only when the agent
// calls the corresponding toolStep 5: Isolate Execution
If your skills include executable scripts, run them in sandboxes. Don't execute user-influenced code on your production server.
This is where infrastructure matters. You need isolated environments per session, automatic dependency installation, file handling, and cleanup.
You can build this yourself with Docker and a queue, or use managed infrastructure like Bluebag that handles sandbox orchestration, session isolation, and skill execution out of the box.
The Bigger Picture
The AI agent ecosystem is going through the same evolution that web development went through.
Early web apps were monolithic scripts with everything inline — HTML, CSS, JavaScript, SQL queries, business logic. It worked for demos. It didn't scale.
Then we got separation of concerns. MVC frameworks. Component architectures. CI/CD. Testing. The code got more structured, and applications got more reliable.
AI agents are at the "monolithic script" stage. Everything lives in one prompt. Knowledge, behavior, formatting rules, tool instructions, edge cases — all inline, all competing for attention in the context window.
Agent Skills are the separation of concerns moment for AI agents.
Knowledge becomes modular. Behavior becomes testable. Capabilities become portable. The agent prompt shrinks to what it should have always been: personality and routing logic.
Conclusion
The agents that work in production aren't the ones with the longest prompts or the biggest context windows.
They're the ones that got onboarded.
Structured knowledge. Progressive disclosure. Executable procedures. Isolated execution.
The same principles that make human teams reliable make AI agents reliable.
Stop writing longer prompts. Start onboarding your agents.
Resources
- Anthropic: Equipping Agents with Skills — The original engineering post on Agent Skills
- Agent Skills Specification — Open standard for cross-platform agent skills
- Lost in the Middle (Arxiv) — Research on how LLMs handle long contexts
- Bluebag — Production infrastructure for agent skills
Building agents that need to work beyond the demo? Start here.