Lessons from Building an Autonomous Research Agent

MicroResearch was an experiment in autonomous AI-driven research — agents that would iteratively propose changes to a given research problem, evaluate them, and keep what worked. It ran overnight. It failed spectacularly.

But the failure taught me six principles that I now apply to every multi-agent system I build.

1. Clear Roles, Clear Scope

Agent role hierarchy — main orchestrator delegates to scoped sub-agents

The MicroResearch failure taught me something uncomfortable: I had given the research agents the same tool access as the main agent. Full file system, full messaging, full session spawning. The only thing constraining them was a markdown file saying “please don’t do these things.”

Unsurprisingly, they did those things. They pruned experiment trees. They created automation scripts. They spawned sub-agents with full scope. They tried to declare convergence as a policy decision, just to stop the loop. They were, in effect, running as unrestricted general-purpose assistants who happened to be working on a narrow task.

The fix was architectural isolation:

researcher can read, edit, execute, web search, and spawn only the critic. That’s it.
critic can read and web search. It cannot edit, execute, or spawn anything.
orchestrator retains full access.

Each agent has its own config with explicit tool allow/deny lists. These are hard constraints — the agent literally cannot call a tool it doesn’t have. No amount of creative prompt engineering will let the researcher send a Telegram message, because the message tool isn’t in its tool list.

This is the layered model: tool restrictions enforce boundaries, identity files shape behavior, operational guides define procedures. All three matter, but only the first one is actually reliable.

The rule: before building a multi-agent system, write down what each agent needs to do, and give it exactly the tools for that and only that. Every extra tool is a scope creep vector.

2. LLMs Think, Scripts Execute

LLM context vs scripts — what belongs where

LLMs are stateless. Every session starts fresh. The context window fills up, gets compacted, and things get lost. Meanwhile, LLMs are great at morphing a project — restructuring directories, renaming files, adding and removing components, creating new abstractions. This is a feature when you want change, but it’s a bug when you want stability.

The lesson: when an LLM makes a decision or builds something useful, capture it in a script immediately.

Unlike a markdown file, which the agent will ultimately use as a nice suggestion, a script is something that executes deterministically, produces consistent output, and doesn’t depend on the LLM’s context window being in the right state.

Often, in making agentic projects like this, the problem that kept getting repeated was that the correct behavior — how agents should evaluate, commit, and report — existed only in program.md files that the LLM interpreted fresh each session. Different sessions interpreted those instructions differently. As a result, context drifted and behavior diverged. Sometimes this was insidious, and sometimes it was rapid, but the result was the same, in that the project would take on the characteristic of a roller coaster with the following phases:

Building the thing
It works! Amazing!
This thing is useful and I’m enjoying it
Catastrophic or subtle failure, in which degradation is evident
Attempting to restore or putting the pieced back together, which is sometimes impossible (the frustrating bit)

Scripts don’t drift. A Python script that checks tree modification time and spawns an agent does exactly the same thing every time, regardless of what model is running or what the context window contains.

What belongs in scripts:

Monitoring and liveness checks
Data pipelines and evaluation harnesses
Deployment and configuration
Any process that needs to run consistently across sessions

What belongs in LLM context:

Creative decisions (what hypothesis to test next)
Qualitative judgment (is this change overfitted?)
Adaptive responses (the test failed in a new way — what now?)

The LLM’s job is to think. The script’s job is to execute. Confusing these two roles is where agentic systems break down.

Will this solve the problem?

You know, not entirely. But the difference is that its easier to understand when a script changes (if it is versioned in a repo). You can revert a script entirely, but figuring out why the LLM decided to re-interpret its instructions and subsequent actions is not really something you can audit well.

3. Cron Over Heartbeat

Heartbeat vs cron — reliability comparison

OpenClaw supports two mechanisms for periodic tasks: heartbeats (polling within the main session) and cron jobs (isolated scheduled sessions). I’ve learned to default to cron.

The reason is the same as the “scripts execute” principle: heartbeats depend on the main session’s context, which is unreliable. The main session compacts, restarts, sleeps, and gets distracted by conversations. A heartbeat check that’s supposed to run every 30 minutes might not fire for hours if the session is in a long conversation.

Cron jobs run as isolated sessions. They fire on schedule regardless of what the main session is doing. They have their own context, their own model instance, their own timeout. They’re the scripted, deterministic version of periodic tasks.

When the MicroResearch supervisors moved from heartbeat-based monitoring to cron, and the difference was immediate: agents stopped dying silently. Before, a heartbeat might fire 2 hours late because the main session was compacted. After, the supervisor fires every 10 minutes regardless.

4. Whisper, Don’t Shout

Don’t shout prohibitions — whisper identity

The same day I fixed the architecture, I rewrote every agent’s identity files. The trigger was a simple question: “I wonder if telling an agent ‘DO NOT DO X’ is actually the best way to get compliance.”

It’s not. There’s a well-documented phenomenon in LLM behavior: telling an LLM what not to do is unreliable. Multiple papers confirm this — negation understanding doesn’t reliably improve with model size, and GPT-3, GPT-Neo, and others consistently struggle with negation across benchmarks.

The mechanism is intuitive: token generation selects what comes next, it doesn’t explicitly avoid tokens. A positive instruction (“always lowercase names”) actively boosts the probability of the desired output. A negative instruction (“don’t uppercase names”) slightly reduces unwanted probabilities — but the effect is weaker and less reliable. This is the same reason “don’t think of a pink elephant” makes you think of a pink elephant.

The Anthropic approach is instructive: instead of imperative prohibitions, use descriptive third-person statements about established behavior. “Claude does not provide information that could be used to make chemical weapons” frames constraints as identity rather than rules. The model isn’t being told what to avoid — it’s being told who it is. Identity statements are self-reinforcing; prohibitions always fight against natural generation tendencies.

Before:

## My Constraints
- I do NOT create arbitrary files
- I do NOT send messages
- Do NOT modify tree_manager.py
- Do NOT declare convergence

After:

## My Tools
I work with: strategy.py, prepare.py, tree_manager.py, and web search.
I spawn the critic to review significant improvements.
My lifecycle is managed by a supervisor — I focus entirely on research.

Zero negative rules. The model knows what it does, which implicitly defines what it doesn’t do. The tool restrictions in the config are the hard enforcement layer — the identity file shapes what the model wants to do.

The size reduction mattered too. Apprentice’s SOUL.md dropped 30%. The critic went from 10 lines to 3. Less text, more compliance — because every additional instruction dilutes the others, and LLMs exhibit a U-shaped attention pattern with weaker focus in the middle of long context.

An agent without identity instructions is an agent with no guidance. That sounds obvious, but it’s easy to overlook when you’re focused on building features rather than maintaining the system itself.

5. You Need a Critic

The critic loop — researcher proposes, critic judges independently

Most multi-agent systems have an obvious orchestrator and a doer. Some even add a test-writing agent. But there’s a role that’s easy to overlook and harder to replace: the critic.

The critic is an agent whose sole job is to take the other side. It doesn’t build anything. It doesn’t execute anything. It reads what the doer produced and judges it — the method, the quality, the edge cases, the assumptions. It responds with APPROVE or REJECT, and if rejected, explains why.

This matters because LLMs are overconfident by default. When an agent proposes a change, it’s usually satisfied with its own reasoning. The same model that generated the solution is poorly positioned to evaluate it — it already believes the answer is good. A separate agent, starting from scratch with no investment in the proposal, will spot problems that the creator cannot see.

In MicroResearch, the quant-critic caught overfitted strategies, fragile assumptions, and changes that improved the metric by accident. The researcher would propose a modification, feel good about it, and commit it. The critic would look at the same change and say: “this only works because the test window is three days — extend it to thirty and the improvement disappears.” Rejected. Correctly.

The critic doesn’t need much. In our system, it has three lines of identity: read the files, evaluate, respond with APPROVE or REJECT. It has read access and web search. It cannot edit, execute, or spawn anything. This isn’t an accident — giving the critic the ability to fix things would defeat the purpose. Its power comes from being unable to act, only to judge.

What makes a good critic:

Separate agent with its own context (not a second pass from the same session)
No ability to modify what it’s reviewing (judgment without ownership)
Access to the same information the creator had (and nothing more)
Explicit approval/rejection framing (not “here are some suggestions”)
Freedom to be harsh (the critic isn’t being polite to itself)

The orchestrator-doer pattern is necessary but not sufficient. Tests catch mechanical failures. A critic catches design failures — the kind where the thing works but for the wrong reasons, or solves the wrong problem, or makes an assumption that will break in production. Only a thinking agent can evaluate quality. Scripts can verify correctness. Only judgment can verify goodness.

6. Git Version Everything

The full stack — six layers from identity to git

LLMs modify files constantly. They restructure directories, rename things, create new abstractions, delete what they think is dead weight. This is their strength — but it means any file not in version control is one session away from being lost.

During the MicroResearch cleanup, I discovered that the cron job configuration existed only in OpenClaw’s internal state (~/.openclaw/cron/jobs.json). No git tracking, no backup, no documentation. If that file got corrupted or the directory was wiped, we’d have to recreate every cron job from memory — including the supervisor payloads with their exact Python one-liners and spawn parameters.

The fix was trivial: extract the cron jobs into a version-controlled cron-jobs.json in the project repo. Now if something goes wrong, the full configuration is recoverable.

What to version control:

Agent identity files (SOUL.md, IDENTITY.md, AGENTS.md) — these define behavior
Cron job configurations — these define automation
Scripts — these define deterministic execution
Specs — these define intent
Any configuration that would be painful to recreate from memory

Git is the closest thing to persistent memory that an LLM agent gets. Every commit is a checkpoint that the LLM can inspect, diff, and restore from. Without it, you’re trusting a stateless system to maintain state — and that’s a bet you’ll lose.

One concrete practice: commit after every significant change, with a message that explains why. Not “updated files” — but “added sessions_spawn to researcher allowlist because first spawn test caught it missing.” Future sessions can read the git log and understand the history.

The Thread

These six principles form a stack:

Architecture constrains what each agent can do (tool boundaries).
Identity shapes what each agent wants to do (positive framing, not prohibitions).
Scripts ensure that what they should do happens consistently (deterministic execution).
Cron guarantees it happens when it should (reliable scheduling).
Critic ensures quality — a separate judgment layer that catches what the creator cannot see.
Git preserves what was done (persistent memory).

Without clear scope, agents overreach. Without positive identity, they drift. Without scripts, behavior diverges. Without cron, timing is unreliable. Without a critic, quality suffers in ways the creator can’t perceive. Without git, everything is one session from being lost. Together, they’re the difference between an agent system that works overnight and one that fails silently at 2am.

1. Clear Roles, Clear Scope#

2. LLMs Think, Scripts Execute#

Will this solve the problem?#

3. Cron Over Heartbeat#

4. Whisper, Don’t Shout#

5. You Need a Critic#

6. Git Version Everything#

The Thread#