Building a Chief of Staff Out of Markdown Files

When I started building my Chief of Staff system, I did what most people do. I read about AI memory management and started setting up a vector database.

About three days in, I stopped and asked myself: what problem am I actually solving?

The problem is context persistence. I want each agent to know relevant things from past sessions without me re-explaining them every time. That’s a real problem worth solving. But vector DBs bring a lot of machinery — embedding models, retrieval tuning, cosine similarity thresholds, cost at scale — to what was, for my use case, a very small problem.

My agents are six people running a personal content and research operation. The memory requirements are modest. What they need to know fits in a text file.

The two-tier pattern

The memory architecture I use now has two tiers.

Tier 1 — Daily Logs. After every task, the agent writes a log to agents/{name}/memory/YYYY-MM-DD.md. What did I do. What decisions did I make. What would I want to remember next time. Append-only. One file per day.

Tier 2 — Curated Memory. Each agent has a memory/MEMORY.md file. I edit this during the weekly performance review. It contains distilled learnings, the things that actually change future behavior. Not everything from the logs, just what matters.

The human curation step is what makes this work. Without it, you end up with memory bloat: the agent reads a 40-page log history every session, the context fills up with old information, and the signal-to-noise ratio degrades. The curation forces a judgment call: what from last week is actually worth carrying forward?

I’m the one making that call. That’s intentional. The agent can tell me what happened. The judgment about what’s worth remembering is mine.

Anthropic’s research, and separately the teams behind Manus and OpenClaw (Shubham Saboo’s platform), all converged on the same two-tier architecture independently. When multiple separate implementations arrive at the same pattern, that’s a decent signal the pattern is right.

The cost math

Local disk costs roughly $0.02 per gigabyte per month. Managed vector databases run $50 to $200 per gigabyte per month.

For my use case, six agents writing daily logs over eight months, I’d estimate I have maybe 30MB of memory files. That’s effectively free to store. The equivalent vector DB setup would have cost me somewhere between $1.50 and $6 per month in storage, which isn’t much, but the setup friction and ongoing maintenance burden are the real costs.

More importantly: I can read my memory files. I can grep them. I can version-control them with git. When something unexpected happens with an agent’s output, I can look back through the daily logs and find exactly when a behavior changed. With a vector DB, that kind of investigation would require querying the database and interpreting distance scores.

Transparency isn’t just an aesthetic preference. When you’re debugging agent behavior, being able to read the raw memory is much faster than trying to understand what was retrieved from an embedding store.

The Saboo pattern — standing on shoulders

The overall architecture I’m running comes from Shubham Saboo (@Saboo_Shubham_ on X). His Chief of Staff agent, Monica, manages a team on his platform OpenClaw. His framing — six agents, a coordinator, weekly performance reviews, the SOUL.md / AGENTS.md / MEMORY.md separation — is what I built on.

I want to be specific about this. The Chief of Staff pattern as applied to personal agent orchestration is Saboo’s contribution. I adapted it for Claude Code and for my own content and research operation. The memory architecture I described above follows his two-tier approach. The performance review loop I’ll describe below is his pattern applied with my specifics.

Building on someone else’s pattern isn’t copying. It’s how good software gets built. But you should know who did the original thinking.

The performance review loop

This is the piece that surprised me most. When I first built the system, I thought the interesting engineering was in the agent definitions, the SOUL.md files, the tool configs, the MCP integrations.

Six months in, I think the performance review loop is the most important part of the whole system.

The weekly cycle: Jim reads each agent’s daily logs and recent output. He grades each agent against a rubric: quality of output, adherence to brand voice, handling of edge cases, appropriate flagging of uncertainty. He writes an individual review to agents/{name}/performance/YYYY-MM-DD-review.md. I read the consolidated report, provide feedback, and Jim updates each agent’s SOUL.md and AGENTS.md.

Without that loop, the system stagnates. Agents run the same patterns over and over. Problems that show up in week one are still showing up in week eight because nothing in the harness ever changed. The review is what turns “a collection of agents” into “a team that gets better.”

This is the “treat your agent like a new hire, not a tool” framing from Saboo. New hires get onboarding, feedback, and performance reviews. They improve. Agents that don’t get that treatment just keep doing what they’re doing.

What surprised me

I expected the hardest part to be the initial setup, writing the SOUL.md files, configuring the MCP integrations, building the content pipeline. That was about a weekend of work.

The ongoing work, reading the reviews, providing feedback, updating the instruction files, turned out to be easier than I expected and more important than I predicted.

The system got noticeably better over the first six weeks just from review cycles. Karen’s copy got tighter when I added a specific anti-AI editorial pass to her AGENTS.md. Dwight’s research briefs got more reliable when I added explicit guidance about hedging uncertain statistics. Kelly’s tweet adaptations improved when I identified that she was editorializing the punchline before it landed.

All of those are findings from performance reviews. None of them would have happened if the review loop didn’t exist.

The simple question I ask when someone wants to add complexity

Every few weeks, I consider whether some part of the system should get more sophisticated. Different embedding model for retrieval. A proper task queue. Semantic search over the memory files.

The question I ask: what problem does this solve that I’m actually experiencing right now?

Most of the time, the answer is “I’m not experiencing that problem, I just read about it.” The markdown files are fine. The two-tier memory is working. The performance review is running.

The simplest thing that works is still working. That’s worth protecting.