Anthropic Was Right About Agent Skills

Why prompts were never enough for production agents

Here's something that most people building AI agents run into:

Ask the same agent the same question twice => You don't always get the same answer

Nothing changed. Same input, but different outputs.

This is the nature of large language models.

LLMs are probabilistic systems. They don't guarantee identical outputs for identical inputs, especially when you introduce multi-step reasoning, tools, or long conversations.

This explains why so many "impressive" agents quietly fall apart in production.

The problem with Prompt-Only Agents

The industry's first instinct was obvious:

Add a better system prompt.

So we did:

Longer Instructions
Stricter rules
Examples packed into context

prompt-only-agents

System Prompts started looking like mini textbooks. While this approach has some impact, it also has two hard limits.

1. Prompts Don't Enforce Behavior

Prompts suggest what an agent should do. They don't encode what it must do.

If an agent is asked to

Generate a promo video script

A prompt-only agent improvises:

structure
tone
CTA
priorities

Run it twice and you'll often get two very different answers.

The agent isn't wrong. It's inferring, because nothing tells it how this task is actually done.

2. More Context ≠ More Reliability

When prompts fail, teams add more context.

But context windows are fragile:

too much text
too many competing instructions
irrelevant details always loaded

Eventually:

token usage explodes
reasoning degrades
reliability still doesn’t improve

At some point, the problem becomes how knowledge is structured and loaded.

The Gap Everyone Missed

knowledge-gap

LLMs are excellent at general knowledge.

They know:

what React is
what marketing is
what a promo video looks like

But production systems require procedural knowledge:

how your team writes React
how your company structures promos
how decisions are made, step by step

This gap, general knowledge vs procedural expertise, is where most agents fail.

What Anthropic Got Right

architecture-shift

In their engineering blog post, Equipping agents for the real world with Agent Skills, Anthropic didn't try to solve this by "prompting harder".

They changed the unit of intelligence.

Instead of putting everything into one prompt, they introduced Agent Skills: structured packages that represent real workflows, not just instructions.

A Skill is a representation of how work is actually done.

Progressive Disclosure (The Core Insight)

The most important part of Anthropic's approach is progressive disclosure.

Rather than loading everything all the time, Skills are accessed in layers:

Discovery

Name + description
Minimal tokens
Always present

Instructions

Full workflow

Resources

Scripts, templates, references
Accessed only when needed

This design does something subtle but critical:

It keeps the model's context small without sacrificing depth.

Why This Changes Agent Reliability

Let's use a simple example.

Task: Generate a promo video script

Without Skills:

the agent invents structure
output varies between runs
no enforced format
no consistent decision logic

With a Skill:

the structure is predefined
the workflow is explicit
edge cases are handled consistently

The agent isn't "thinking harder".

It's executing a known procedure.

That's the difference between:

an agent that sounds smart
an agent that behaves predictably

The Real Shift Anthropic Made

Anthropic didn't try to make agents smarter.

They made them more reliable.

They accepted the reality that:

LLMs are non-deterministic
prompts are advisory
context is fragile

And designed a system that works with those constraints instead of fighting them.

Why This Matters

Most agents today fail for the same reason: they rely on text to do the job of structure.

Agent Skills showed a different path:

encode workflows
load knowledge conditionally
separate behavior from conversation

Rather than a prompt optimization.

It was an architectural correction.

Conclusion

The industry tried to solve probabilistic systems with more text.

Anthropic didn't.

They recognized that reliable agents come from structured, procedural knowledge.

That insight is why Agent Skills matter, and why they've since been published as an open standard for cross-platform portability.

And it's why, going forward, agents that work in production will look very different from the ones we're building today.