• Why I Stopped Defaulting to OpenAI

    ·

    Why I Stopped Defaulting to OpenAI

    If you look at my GitHub history, there’s an arc.

    The earliest projects use OpenAI. AI Resume Improver — one of my first AI-integrated projects — runs on GPT-4. It uses the OpenAI SDK, the /api/analysis/analyze route, multi-phase pipeline. Classic early-2025 pattern.

    The middle-period projects switch to Gemini. AI Newsy, Personal Newsletter, the content-gen project — all Gemini. gemini-2.0-flash for summarization, Google Generative AI library, GitHub Actions on a cron schedule.

    The current projects are almost all Claude. IndieOS, PillarPost, SocialButter — all using @anthropic-ai/sdk. The Chief of Staff system is pure Claude. RSS Feeds uses Claude Code in the README.

    I didn’t make these switches based on benchmarks. I made them based on what kept working better for the specific problem in front of me. That’s a different thing.

    Why I started with OpenAI

    In 2025, GPT-4 was the default for anyone building AI-integrated products. The API was stable, the documentation was good, the examples were everywhere. If you wanted to build something and ship it, OpenAI was the obvious starting point.

    That’s not a criticism. The defaults exist for reasons. I built AI Resume Improver because I wanted to learn how to integrate an AI API into a web product. GPT-4 was the right choice for that, well-documented, mature SDK, lots of examples. I wasn’t trying to pick the best model for production at scale. I was trying to learn.

    Why I moved to Gemini

    AI Newsy was a different kind of project. I wanted something that ran on a schedule, automated news digestion, AI-generated summaries, delivered by email. GitHub Actions as the orchestration layer, Python as the execution layer.

    At the time, Gemini Flash was faster and cheaper for the summarization tasks I was doing, and Google’s infrastructure for scheduled Python jobs integrated well with the GitHub Actions setup I was using. It wasn’t a philosophical switch, it was a performance-and-cost decision for a specific workload.

    I used Gemini for Personal Newsletter too, mostly out of inertia from AI Newsy. They were similar architectures solving similar problems. Same stack made sense.

    Why I landed on Claude

    The shift started with content work. When I built PillarPost, a tool that takes raw ideas and generates LinkedIn post variations, I needed something that could hold a nuanced content framework, understand subtle tone differences between post types, and produce output that actually sounded like me rather than like an AI.

    I’d been following Shubham Saboo’s work (@Saboo_Shubham_), who originated the Chief of Staff agent pattern I’ve built on. His stack runs on Claude. That gave me a practical reason to test it seriously.

    The difference I noticed, and this is personal experience rather than benchmark data: Claude handled ambiguous or open-ended writing tasks better than what I’d seen from GPT-4 at the time. The outputs felt more considered. When I gave it a content framework and asked it to generate variations, it actually used the framework rather than pattern-matching to a generic LinkedIn post format.

    For the Chief of Staff system, I wasn’t making a model choice, I was choosing where to build the harness. Claude Code is a native environment for Claude models. The agent primitives (subagents, SOUL.md, AGENTS.md, slash commands) are Claude-native. Using Claude models in a Claude Code harness isn’t a coincidence; it’s the intended pattern.

    Where OpenAI still wins for me

    This isn’t a conversion story. I still use OpenAI for two things.

    YouTube Transcript uses OpenAI Whisper for audio transcription. Whisper is still the best tool I’ve found for turning YouTube audio into clean, punctuated transcripts. I tested Claude’s audio capabilities for this workload. Whisper produced better results. I’m not switching for the sake of consistency.

    For long-running terminal tasks and agentic shell loops, GPT-5.5 (released April 23, 2026) is genuinely stronger, Terminal-Bench 82.7% versus Claude Opus 4.7’s 69.4%. If I have a task that involves extended shell execution, unattended browsing, or multi-step command-line work, I’ll route it there. The benchmark gap at that specific job is real.

    The point is picking the right tool for the job, not picking a team and staying loyal. I was defaulting to OpenAI because that’s what I knew. Now I’m defaulting to Claude for the orchestration and content work because that’s what keeps working better. Both of those defaults are provisional.

    The trust angle

    There’s a dimension to this that goes beyond benchmarks. Anthropic’s April 2026 pricing split tests, Claude Code was removed from the $20 Pro plan before being partly rolled back, were a reminder that trust in a vendor involves more than whether their model performs. It involves whether they communicate well, whether they grandfather existing customers when plans change, whether they treat their users like adults.

    The April 23 postmortem (three Claude Code regressions, acknowledged honestly after weeks of community frustration) was, ironically, a trust-builder. A company that publishes a thorough postmortem and resets subscriber usage limits is doing something many vendors don’t. It doesn’t erase the regression. But it’s a different response than denial.

    I wrote separately about what I took from that incident. For this series, the short version: the model I use most matters less than whether I can build on the vendor’s platform with reasonable confidence that the platform is being maintained honestly.

    What this means if you’re just starting

    If you’re building your first AI-integrated project today, OpenAI is still a defensible default. The API is mature, the documentation is excellent, the SDKs are well-maintained.

    If you’re building agent systems, orchestrated, multi-turn, persona-driven, Claude Code is where I’d start. The agent primitives feel native there in a way they don’t elsewhere.

    If you’re doing audio transcription, Whisper is still the best option I’ve tested.

    None of that is permanent. The model landscape shifts fast enough that “best for this job right now” is probably the only honest framing. Build the harness with portability in mind, LiteLLM as the model-routing layer, MCP for tool portability, so swapping models doesn’t require rewriting your system.

    The model I started with eight months ago is not the model I run my most important work on today. That’s fine. It’s supposed to change.


  • MCP Is the Glue I Didn’t Know I Needed

    ·

    MCP Is the Glue I Didn’t Know I Needed

    I want to tell you about a Tuesday morning.

    Dwight finishes a research brief. The brief lands in intel/data/. Jim dispatches Karen to write a blog post from it. Karen drafts the post, saves it to content/drafts/, and publishes it to Notion, a real Notion URL, in my publishing workspace, with the right page type and the right parent. Jim sends me a Telegram message: “Draft ready. Here’s the link.”

    I’m making coffee. My phone buzzes. I open the link.

    Nothing I just described required me to open a browser, copy text, navigate to Notion, paste anything, or check a queue. The agents went from research to draft to published page to notification with no manual handoff.

    That’s MCP. Not the spec, the experience.

    What MCP actually is

    Model Context Protocol is an open standard Anthropic published in December 2024. It defines how AI models connect to external tools and data sources. The analogy that works best for me: it’s like a universal adapter. Instead of each AI vendor building custom integrations for every tool (Notion, Slack, GitHub, Gmail, Google Calendar, etc.), MCP defines a standard protocol that any tool can implement once.

    As of early 2026, MCP had 97 million monthly downloads. I’m personally one of those downloads. Knowing you’re one of 97 million users of something is a different relationship to a statistic than reading it in a press release.

    The spec has been adopted widely: Google’s A2A protocol and Microsoft’s Agent Framework 1.0 both interoperate with MCP. The Linux Foundation governs the standard now. It’s not going anywhere.

    How it works in my setup

    In Claude Code, MCP servers are configured in .mcp.json at the project root. Each agent can also specify mcpServers in its YAML frontmatter — scoped so that agent only gets the tools it needs.

    Here’s what my current integrations look like:

    Notion — the publishing gate. Every piece of finished content (blog drafts, research briefs, newsletter sections) gets published to Notion before a task is considered done. The Notion MCP server handles the API calls. Karen publishes blog drafts. Dwight publishes research briefs. Rachel publishes newsletter drafts. None of them need to open a browser to do this.

    This matters more than it sounds. Before this integration, “done” meant “file saved to repo.” After it, “done” means “Notion URL exists.” That’s a meaningfully different standard. A file in a repo is a working artifact. A Notion page is a deliverable I can open, share, and review.

    Telegram — mobile access. The Telegram MCP integration lets agents send me messages, and lets me dispatch tasks from my phone. When a long overnight task finishes, I get a Telegram message with the result. When I’m away from my desk and want to queue something, I can send a message and it lands in the session.

    The Claude Code Channels docs describe this as the canonical mobile-access pattern for Claude Code. /plugin install telegram@claude-plugins-official is the command. It requires Bun. Setup takes maybe 20 minutes.

    Gmail and Calendar — context. Not for sending emails — just for reading. When an agent needs to know what’s on my calendar today, or whether I’ve already addressed something in an email thread, it can check. This kind of ambient context makes briefings more relevant. Dwight’s daily intel summary is more useful when it knows I have calls in the morning and can flag high-priority items accordingly.

    What I tried that didn’t work

    Canva.

    The vision was: agent designs the visual header for a blog post, publishes it to Canva, links it to the Notion draft. Design integrated into the publishing workflow.

    In practice: the designs had garbled text, wrong brand colors, and elements I couldn’t edit in the way I expected. The integration worked technically — the agent could interact with Canva — but the outputs weren’t usable. I’d fix them manually and think “this is saving me no time.”

    I stopped using Canva through MCP for finished design work. I use it occasionally for rough templating, with the expectation that I’ll fix it before anything gets published.

    This is worth saying out loud: MCP makes the connection possible, but the quality of the connection depends on what the external tool can actually do programmatically. Some tools (Notion, Telegram, GitHub) are well-suited to API-driven interaction. Others (design tools that require pixel-level judgment) aren’t there yet.

    The failure taught me something: MCP is most valuable for information exchange and structured actions. It’s less valuable for anything where the quality of the output depends on subjective visual judgment.

    The protocol layer view

    Here’s the part that’s easy to miss from inside the day-to-day: MCP changes the architecture of what’s possible, not just what’s convenient.

    Before MCP, each AI vendor had to build custom integrations for every tool. Claude had a Notion integration. GPT had a Notion integration. They were separate, maintained separately, with different APIs and different behaviors.

    MCP means you build the Notion server once, and any MCP-compatible model can use it. You build the Slack server once, and it works across vendors. You build the GitHub server once, and it works in Claude Code, Cursor, the OpenAI Agents SDK, Microsoft Agent Framework.

    This is what makes harnesses portable. If you switch from Claude to GPT-5.5 for a specific task, the tools stay the same. You’re swapping the model, not the entire integration stack.

    That’s not theoretical. It’s what LiteLLM plus MCP actually enables in practice: model-portable, tool-portable agent systems. Build the harness once, route to the best model for the job.

    Why the number matters

    97 million monthly downloads of the MCP spec.

    What that means, practically: every major tool worth connecting to has either already built MCP support or is building it. The network effects are in place. The bet has been made. You’re not an early adopter anymore, you’re using the thing that won.

    The right mental model is infrastructure. When you use HTTPS, you’re not thinking about the spec. You’re thinking about whether your browser connected. MCP is becoming that, the protocol you don’t think about because it just works.

    I think about it a little, because I built the connections. But when I’m dispatching a task to Karen and a Notion URL appears in my Telegram 20 minutes later, I’m not thinking about the Model Context Protocol. I’m thinking about whether the draft is good.

    That’s the point.


  • What Cowork Changed About How I Run Agents

    ·

    What Cowork Changed About How I Run Agents

    In March 2026, I asked Claude to analyze my code projects, all 16 repos, and generate a structured intelligence report. Complexity estimates, patterns across projects, the AI model evolution visible in my GitHub history over eight months.

    The report came back while I was doing something else. I opened it later and it was exactly what I’d asked for: organized by project, cross-referenced for patterns, with an executive summary and a “content flywheel” diagram that traced how several of my projects fed into each other.

    The file header says: “Report generated by Claude in Cowork mode — 2026-03-24.”

    I want to talk about what that meant, and why it changed how I think about which tasks are worth assigning to agents.

    What Cowork actually is

    Cowork is Anthropic’s managed-agent platform. The relevant difference from a regular Claude Code session: agents run in the cloud. Your laptop doesn’t need to be on. You can close the lid, and the work continues.

    Claude Code Routines, launched April 14, 2026, currently in research preview, is Anthropic’s answer to the same problem at the task-scheduling level. A Routine is a saved configuration: a prompt, one or more source repos, a set of connectors. You set a trigger (scheduled, API, GitHub webhook) and it runs without you.

    OpenAI Codex has had similar functionality in its Automations, scheduled tasks, API triggers, parallel execution, which went generally available in April 2026. Three major platforms converged on the same pattern at almost the same time: define the agent, set the trigger, cloud handles execution.

    The architecture converging is not a coincidence. The constraint being removed is real.

    The constraint I didn’t know I had

    Before cloud execution, every agent task required a running session. Which meant: me at my computer, a terminal open, actively waiting.

    That constraint shaped what I asked agents to do more than I realized. A 40-minute research task is different when you have to sit there than when you can walk away. A task that requires fetching and processing 16 repos is something you’d think twice about on a Wednesday afternoon but wouldn’t hesitate to kick off before dinner if you know it’ll be done by morning.

    I wasn’t making a principled decision to only assign certain tasks to agents. I was making a friction-based decision. The friction was masquerading as judgment.

    When the friction goes away, you start discovering the actual scope of what’s worth automating. For me, that meant asking agents to do things I’d previously considered too long to wait for: comprehensive project analyses, multi-source research syntheses that pull from a dozen different documents, weekly reporting that takes 30+ minutes but runs on a schedule.

    What’s now scheduled

    My current Routines setup (still expanding): the daily research pipeline runs on a cron schedule, Dwight scans Reddit trending topics, produces research briefs, and updates the intel briefing. That used to require me to kick it off. Now it runs whether I’m at my desk or not.

    Routine limits on different plans: Claude Code Pro (now removed from $20 tier; Max at $100/month minimum) allows 5 Routines/day. Max 5x allows 15. Team and Enterprise go to 25. For my use case, 5 is enough for daily operations. If I were running this for a team, I’d want more.

    The practical ceiling isn’t the plan limit, it’s what you can realistically review. A Routine that runs and produces output you don’t read isn’t useful. I check my briefings every morning. Five scheduled tasks I actually read beats twenty I scroll past.

    What it means for how you think about agent work

    The mental shift I didn’t expect: when tasks run asynchronously, you start thinking about agents more like employees and less like tools.

    When you’re running a session and watching the output appear, you’re supervising. You’re in the loop at each step. The agent is running for you, in your presence.

    When you set a Routine to run overnight and review the output in the morning, the dynamic shifts. The agent did work. You’re reviewing what it produced. That’s a management relationship, not a supervision relationship.

    Shubham Saboo’s framework for this, he’s the originator of the Chief of Staff pattern I’ve built on, is to treat agents like new hires: give them a workspace, scoped credentials, clear instructions, and a review cadence. The cloud execution model makes that analogy more literal. The new hire didn’t need you watching over their shoulder all day. They just needed to know what to do.

    The honest limitations

    Async execution introduces new failure modes. A synchronous task fails and you see it immediately. An async task fails and you might not find out until you look at the output the next morning.

    Stopping conditions matter more. Max turns. Token budgets. Clear success criteria in the Routine’s prompt. Without those, a stuck agent can run for a long time before you notice.

    I’ve had two failed overnight Routines in the last two months. One got stuck in a loop fetching a URL that kept timing out. One produced output that was technically correct but missed the scope, it answered the question I asked rather than the question I meant to ask. Both were harness failures: the first needed a retry limit, the second needed a better task boundary.

    I fixed both. They’re in the AGENTS.md now.

    The report that started this

    That code projects analysis from March, 16 repos, structured intelligence report, generated while I was away from my desk, was the moment that made this concrete for me. Not an exciting demo. Just a task I’d been meaning to do for months that was too tedious to sit through, now done.

    The file is at intel/data/aj-code-projects.md. I still reference it. It’s where the model evolution table in Post 6 of this series comes from.

    Cloud execution didn’t change what agents can do. It changed which tasks I was willing to ask them to do. That’s a bigger shift than it sounds.


  • Building a Chief of Staff Out of Markdown Files

    When I started building my Chief of Staff system, I did what most people do. I read about AI memory management and started setting up a vector database.

    About three days in, I stopped and asked myself: what problem am I actually solving?

    The problem is context persistence. I want each agent to know relevant things from past sessions without me re-explaining them every time. That’s a real problem worth solving. But vector DBs bring a lot of machinery — embedding models, retrieval tuning, cosine similarity thresholds, cost at scale — to what was, for my use case, a very small problem.

    My agents are six people running a personal content and research operation. The memory requirements are modest. What they need to know fits in a text file.

    The two-tier pattern

    The memory architecture I use now has two tiers.

    Tier 1 — Daily Logs. After every task, the agent writes a log to agents/{name}/memory/YYYY-MM-DD.md. What did I do. What decisions did I make. What would I want to remember next time. Append-only. One file per day.

    Tier 2 — Curated Memory. Each agent has a memory/MEMORY.md file. I edit this during the weekly performance review. It contains distilled learnings, the things that actually change future behavior. Not everything from the logs, just what matters.

    The human curation step is what makes this work. Without it, you end up with memory bloat: the agent reads a 40-page log history every session, the context fills up with old information, and the signal-to-noise ratio degrades. The curation forces a judgment call: what from last week is actually worth carrying forward?

    I’m the one making that call. That’s intentional. The agent can tell me what happened. The judgment about what’s worth remembering is mine.

    Anthropic’s research, and separately the teams behind Manus and OpenClaw (Shubham Saboo’s platform), all converged on the same two-tier architecture independently. When multiple separate implementations arrive at the same pattern, that’s a decent signal the pattern is right.

    The cost math

    Local disk costs roughly $0.02 per gigabyte per month. Managed vector databases run $50 to $200 per gigabyte per month.

    For my use case, six agents writing daily logs over eight months, I’d estimate I have maybe 30MB of memory files. That’s effectively free to store. The equivalent vector DB setup would have cost me somewhere between $1.50 and $6 per month in storage, which isn’t much, but the setup friction and ongoing maintenance burden are the real costs.

    More importantly: I can read my memory files. I can grep them. I can version-control them with git. When something unexpected happens with an agent’s output, I can look back through the daily logs and find exactly when a behavior changed. With a vector DB, that kind of investigation would require querying the database and interpreting distance scores.

    Transparency isn’t just an aesthetic preference. When you’re debugging agent behavior, being able to read the raw memory is much faster than trying to understand what was retrieved from an embedding store.

    The Saboo pattern — standing on shoulders

    The overall architecture I’m running comes from Shubham Saboo (@Saboo_Shubham_ on X). His Chief of Staff agent, Monica, manages a team on his platform OpenClaw. His framing — six agents, a coordinator, weekly performance reviews, the SOUL.md / AGENTS.md / MEMORY.md separation — is what I built on.

    I want to be specific about this. The Chief of Staff pattern as applied to personal agent orchestration is Saboo’s contribution. I adapted it for Claude Code and for my own content and research operation. The memory architecture I described above follows his two-tier approach. The performance review loop I’ll describe below is his pattern applied with my specifics.

    Building on someone else’s pattern isn’t copying. It’s how good software gets built. But you should know who did the original thinking.

    The performance review loop

    This is the piece that surprised me most. When I first built the system, I thought the interesting engineering was in the agent definitions, the SOUL.md files, the tool configs, the MCP integrations.

    Six months in, I think the performance review loop is the most important part of the whole system.

    The weekly cycle: Jim reads each agent’s daily logs and recent output. He grades each agent against a rubric: quality of output, adherence to brand voice, handling of edge cases, appropriate flagging of uncertainty. He writes an individual review to agents/{name}/performance/YYYY-MM-DD-review.md. I read the consolidated report, provide feedback, and Jim updates each agent’s SOUL.md and AGENTS.md.

    Without that loop, the system stagnates. Agents run the same patterns over and over. Problems that show up in week one are still showing up in week eight because nothing in the harness ever changed. The review is what turns “a collection of agents” into “a team that gets better.”

    This is the “treat your agent like a new hire, not a tool” framing from Saboo. New hires get onboarding, feedback, and performance reviews. They improve. Agents that don’t get that treatment just keep doing what they’re doing.

    What surprised me

    I expected the hardest part to be the initial setup, writing the SOUL.md files, configuring the MCP integrations, building the content pipeline. That was about a weekend of work.

    The ongoing work, reading the reviews, providing feedback, updating the instruction files, turned out to be easier than I expected and more important than I predicted.

    The system got noticeably better over the first six weeks just from review cycles. Karen’s copy got tighter when I added a specific anti-AI editorial pass to her AGENTS.md. Dwight’s research briefs got more reliable when I added explicit guidance about hedging uncertain statistics. Kelly’s tweet adaptations improved when I identified that she was editorializing the punchline before it landed.

    All of those are findings from performance reviews. None of them would have happened if the review loop didn’t exist.

    The simple question I ask when someone wants to add complexity

    Every few weeks, I consider whether some part of the system should get more sophisticated. Different embedding model for retrieval. A proper task queue. Semantic search over the memory files.

    The question I ask: what problem does this solve that I’m actually experiencing right now?

    Most of the time, the answer is “I’m not experiencing that problem, I just read about it.” The markdown files are fine. The two-tier memory is working. The performance review is running.

    The simplest thing that works is still working. That’s worth protecting.


  • My Claude Code Setup, In Plain English

    ·

    My Claude Code Setup, In Plain English

    I want to walk through my actual Claude Code setup. Not what Claude Code can do in theory, what mine does in practice, what every file is for, and what I’d do differently if I were starting today.

    This is a specific setup built on a specific pattern. The pattern was originated by Shubham Saboo (@Saboo_Shubham_ on X), whose Monica agent, a Chief of Staff that manages a team of specialized AI agents, is the conceptual source of what I built. Monica runs on OpenClaw, Saboo’s platform. My version runs on Claude Code. The architecture is his. I want to be clear about that.

    The shape of the thing

    The repo is called chief_of_staff. At the root: CLAUDE.md (instructions for the main session), AGENTS.md (shared rules every agent follows), MEMORY.md (Jim’s curated long-term memory), HEARTBEAT.md (system health).

    Inside .claude/agents/: six subagent definition files. Jim, Dwight, Karen, Kelly, Erin, Rachel, Ross.

    Each agent file is a markdown document with a YAML frontmatter block and a system prompt. The frontmatter specifies: name, description, which tools the agent can use, which model it runs on, whether it has MCP server access.

    That’s the architecture Paweł Huryn described in April 2026: “Sub-agents (.claude/agents/). Each folder is a self-contained agent. Its own CLAUDE.md.” One sentence that captures the whole pattern. I’d add: and its own tool list, its own memory, its own identity.

    The three files every agent has

    Each agent gets three supporting files beyond the definition file. This is Saboo’s pattern, and I’ve found reasons to keep all three.

    SOUL.md — identity, principles, persona. Who is this agent. What does it care about. How does it talk. For my research agent Dwight, SOUL.md describes someone intense and thorough. For Karen the copywriter, it’s sharp and direct. These files almost never change. When they do, it’s because a performance review revealed that an agent’s identity was producing the wrong outputs.

    AGENTS.md — operational instructions. What this agent does, how it does it, what it should never do. This file changes all the time. Every time something goes wrong with an agent’s output, it drifts off-task, it uses a format I didn’t want, it estimates when it should have flagged uncertainty, I update AGENTS.md. The line I added after Karen started speculating on data gaps: “Flag research gaps explicitly rather than estimating. Coordinate with Dwight before finalizing data-dependent claims.” That’s a harness decision born from a specific failure.

    memory/MEMORY.md — curated long-term memory. I edit this. The agent reads it. It captures things the agent needs to know about its role, the brand, past decisions, lessons from past tasks. Not everything, just what actually changes future behavior.

    The SOUL.md / AGENTS.md split is the piece I’d explain most to someone starting out. SOUL.md is who the agent is. AGENTS.md is how it works. One is identity, the other is procedure. When something goes wrong with output quality, it’s almost always an AGENTS.md issue. When something goes wrong with tone or voice, it might be SOUL.md. Keeping them separate means I can tune one without touching the other.

    The daily log and why it exists

    Every agent writes a daily log after completing a task. The format is loose: what did I do, what decisions did I make, what would I want to remember next time.

    These files go to agents/{name}/memory/YYYY-MM-DD.md. They accumulate. During the weekly performance review, Jim (my chief of staff agent) reads them, grades each agent’s output, and writes up a review. I read the review, provide feedback, and Jim updates the agent’s SOUL.md and AGENTS.md based on my notes.

    That loop — task → daily log → performance review → updated instructions — is what Saboo means when he talks about managing agents like employees. His explicit advice: don’t hand them the keys to everything on day one. Give them a workspace. Give them scoped access. Review their work. Update their instructions. Saboo put it this way: “Treat your agents like new hires, not tools. Give them just enough context. Then get out of their way.”

    The daily log is how I keep the review from being based on my memory of what happened. It’s the agent’s own record.

    Slash commands

    Slash commands live in .claude/commands/. Each one is a markdown file. When I type /review in a Claude Code session, it reads review.md and runs the performance review workflow.

    I have three: /review (weekly review for all agents or a named one), /status (dashboard of current priorities and system health), /feedback {agent} {notes} (apply feedback to a specific agent’s files).

    The review one is the most important. Before it existed, I would periodically look at agent output, notice something was off, and maybe update the instructions. No real cadence. When /review existed as a slash command — meaning there was a defined process I could trigger in one step — the review actually happened weekly. The command didn’t create discipline. It lowered the friction enough that discipline could form.

    MCP

    Model Context Protocol is how Claude Code connects to external systems. In my setup, the main integrations are Notion (publishing pipeline, completed drafts go here before they’re considered done), Telegram (mobile access, I can dispatch tasks from my phone and get updates in a channel), and Gmail/Calendar for context that helps agents understand my schedule and priorities.

    Each agent’s frontmatter can specify which MCP servers it has access to. Karen has Notion access because she needs to publish drafts. Dwight has web search access because research requires it. Not every agent needs every integration. The scoping is intentional. Post 7 in this series goes deeper into how MCP changed the way I think about workflows.

    What I’d do differently

    Two things.

    First: I’d write the task boundary paragraphs earlier. For each agent: this agent handles X. It succeeds when Y. It fails when Z. It should never touch W. I didn’t write those until I had specific failures to respond to. I could have written them before the failures if I’d thought harder about scope upfront.

    Second: I’d start with one agent, not six. Saboo’s warning about this is explicit: “Do not try to build six agents on day one.” I more or less followed it, I built Jim first, then Dwight, then the rest. But even that felt fast. The agents that work best are the ones I’ve run through more review cycles. The ones I haven’t stressed-tested enough still surprise me.

    The file that explains everything

    If you want to understand how the system works, read CLAUDE.md at the repo root. It’s the instruction file for the Jim session, the main orchestration layer. It describes my team, my principles for dispatching work, how I conduct reviews, how I handle feedback. It’s about 180 lines.

    HumanLayer’s guidance on CLAUDE.md suggests staying under 300 lines, frontier models reliably follow 150-200 instructions before compliance degrades. I stay well under that, and I trim aggressively when something falls out of use.

    The whole setup is built on the observation from Post 1: the harness matters more than the model. Every file in this repo is part of the harness.


  • Claude Code, Codex, and Cursor: Three Tools, Three Jobs

    People ask me which tool they should use for AI coding. The honest answer is that I use all three, and the question I ask isn’t “which one is better”, it’s “which problem am I trying to solve right now?”

    Here’s how they actually divide in my workflow.

    Cursor: where I write code

    Cursor is my IDE. I’ve been using it since Cursor 3 shipped in early 2026, and the main thing I use it for is code editing, inline suggestions, multi-file edits, refactoring within a project.

    The interaction model that matters most for my work is Agent mode (Cmd+I on Mac). This is Cursor’s autonomous mode, it reads files, edits multiple files in a single pass, runs terminal commands. It’s an agent in the sense that it makes decisions, not just completions.

    But Cursor’s agent is one agent. It’s the IDE’s agent, shaped by rules you define. Mine lives in .cursor/rules/ — project-level .mdc files that specify how it should behave in this repo. When I’m working in a new project, my first step is usually writing the rules file: what’s the stack, what patterns do I want it to follow, what should it avoid.

    I use Plan Mode for anything more complex than a small edit. Before touching code, the agent researches the files, asks clarifying questions, writes a detailed plan. I review the plan. Then it executes. That discipline, research before action, saves me a lot of cleanup.

    Where Cursor shines: when the task is well-scoped and the context is the current project. Code completion, refactoring, writing tests for something I just built. It’s fast, it’s IDE-native, and $20/month is a reasonable entry point.

    Where I don’t use it: for system-level agent work. Cursor’s agent isn’t a persona. There’s no SOUL.md for Cursor’s agent. It doesn’t have its own memory between sessions in the same way Claude Code subagents do. For the kind of orchestration work I do with the Chief of Staff system, dispatching tasks to named agents, running performance reviews, managing the content pipeline, Claude Code is the right tool.

    Claude Code: where I build systems

    Claude Code is where my actual operating system lives. The repo I’ve described in earlier posts in this series, Jim, Dwight, Karen, Kelly, and the rest, that all runs in Claude Code.

    What makes Claude Code different from Cursor for this use case is the subagent architecture. Each agent in .claude/agents/ is a separate entity with its own model specification, tool access, and system prompt. When Jim dispatches a task to Dwight, Dwight spins up in his own context window with his own tool access. He doesn’t know what Karen is doing. He doesn’t have access to Jim’s full memory. The scoping is the point.

    This architecture, a named agent with a defined identity, specific tools, scoped access — doesn’t have a direct equivalent in Cursor. Cursor’s rules system gets you some of the way there, but it’s one agent shaped by context, not multiple agents with defined boundaries.

    Claude Code is also where I interact with the system on a daily basis. I open a session, Jim is already loaded from CLAUDE.md, I type a request, and the Chief of Staff decides how to route it. That experience, where the AI already knows the context because the context is written into the repo, is the practical payoff of everything I described in Posts 1 through 3.

    The trade-off: Claude Code requires Claude Max ($100/month as of March 2026). That’s a meaningful cost difference from Cursor’s $20. If I were starting fresh today with a limited budget and a single project, I’d probably start with Cursor. If I were running an agent team for an ongoing operation, Claude Code is where I’d be.

    Codex: for the work I don’t watch

    Codex is OpenAI’s cloud-based coding agent. The thing that makes it different from Claude Code and Cursor is execution model: it runs in the cloud, without your laptop, triggered by a schedule, an API call, or a GitHub webhook.

    I use Codex for tasks I want to run asynchronously. An agent-generated code project analysis. Research that takes 30-40 minutes and doesn’t need my attention. Parallel tasks where I want multiple agents working at the same time.

    The practical constraint before cloud execution was that any agent task required my laptop to be open and a session to be running. That’s a significant constraint on what you’re willing to ask agents to do. If it takes 45 minutes and requires me to be at my computer, I might decide it’s not worth it. If it runs while I’m doing something else and I review the output later, the calculus changes.

    NEW: Claude Code Routines (launched April 14, 2026, still in research preview) solves the same problem from the Anthropic side. Scheduled runs, API triggers, GitHub webhook support. I’ve started using those too — more on this in Post 5, which is specifically about Cowork and what changed when agents stopped needing my laptop to be on.

    Where I still use OpenAI in my stack: YouTube Transcript, one of my earlier projects, uses OpenAI Whisper for audio transcription. Whisper is still the best tool I’ve found for that specific job. I’m not religious about staying on one vendor. I use what works.

    The honest comparison

    The benchmark-scoreboard version of this comparison is: Claude Opus 4.7 leads on code review and architecture work (SWE-Bench Verified 87.6%), GPT-5.5 leads on terminal-based and long-running agentic tasks (Terminal-Bench 82.7%). Cursor is primarily a coding assistant, not a benchmark-optimized research subject.

    That framing matters for people choosing at the API level. For most day-to-day work, both frontier models are good enough that the harness and the task design matter more than the model.

    The more useful comparison for my actual use: each tool has a native habitat.

    Cursor’s native habitat is the code editor, file-level context, fast iterations on existing code. Claude Code’s native habitat is agent orchestration, persona-driven systems, multi-agent pipelines. Codex’s native habitat is cloud-executed async tasks, parallel work, scheduled automation.

    None of them does the other two’s jobs as well. That’s why I use all three.

    A note on frameworks

    I deliberately left LangChain, CrewAI, and the agent frameworks out of this post. The developer series has a separate piece on the framework decision (spoiler: for most use cases, the right answer is no framework at all). Claude Code, Cursor, and Codex are product surfaces, not framework choices. They’re worth understanding on their own terms.


  • The Harness Matters More Than the Model

    ·

    The Harness Matters More Than the Model

    I spent the first few weeks of building with AI doing what most people do: reading benchmarks. GPT-4 or Claude? Which one passed more coding tests? Which one was smarter?

    It took me longer than I’d like to admit to realize I was solving the wrong problem.

    The model is maybe 20% of the outcome. The harness is the other 80%. I had the ratio completely backwards.

    What I mean by “the harness”

    Update: Addy Osmani, Director at Google Cloud AI, put it directly in April 2026: “A decent model with a great harness beats a great model with a bad harness. The gap between what today’s models can do and what you see them doing is largely a harness gap.”

    The harness is everything around the model. System prompts. Memory structure. Tool access. Stopping conditions. The description you write for each tool, which the model reads before deciding what to call. The SOUL.md file that tells an agent who it is. The AGENTS.md file that tells it what it’s supposed to do and what it should never touch.

    You don’t see the harness when you demo a product. You see the model’s output. But the harness is where the work actually lives.

    Viv Trivedy put it in a line that’s been stuck in my head since I first saw it: “If you’re not the model, you’re the harness.”

    That’s your job. Not to pick the smartest model. To build the thing that makes any decent model work.

    The evidence I can’t argue with

    Here’s what moved me from “I think the harness matters” to “the harness is the whole game”:

    Terminal Bench 2.0, a standardized benchmark for how well agents complete terminal-based tasks, ranked one team in the top 30. They changed only their harness. Same model, different system prompts, different tool structure. They jumped to top 5.

    Osmani documented a different example: one developer optimized their Claude Code harness using CLI and Insforge Skills. Same Claude model before and after. The workload moved from 10.4 million tokens and 10 errors at $9.21, to 3.7 million tokens and zero errors at $2.81. A 3.27x cost reduction. Zero model change.

    The variable in both cases wasn’t which AI they picked. It was how they told the AI to work.

    What this looked like for me

    I have a repo called everything-claude-code — it’s a fork I made to study how Claude Code’s harness is structured. Then I built my own.

    The system I run now has a Chief of Staff agent named Jim, more on that in Post 2 — who manages a team of six specialized agents. Each one has a SOUL.md file (identity, principles, persona) and an AGENTS.md file (operational instructions). Jim reads both before dispatching any task.

    I didn’t design that structure because I thought it was clever. I designed it because of specific things that went wrong without it.

    Early on, I gave agents too much access and not enough context. They would drift, helpfully doing related things they weren’t asked to do. I’d ask for a research brief and get a research brief plus a draft blog post plus three suggested follow-up topics.

    So I added task boundary paragraphs to every AGENTS.md. One sentence: this agent handles X. It succeeds when Y. It fails when Z. It should never touch W.

    That’s a harness decision. It solved a real problem. Every line in a good system prompt traces to a specific past failure, what Osmani calls “The Ratchet Pattern.” You don’t design the perfect harness up front. You ship v0.1, watch it fail, write down what failed, and ship v0.2.

    Why most people don’t do this

    The model is the exciting part. New model drops, the benchmark posts come out, the comparison threads run. Nobody writes threads about refining system prompts for three weeks.

    But the product you ship is the harness. The model is infrastructure. You didn’t pick your cloud provider because it was the most interesting cloud, you picked it because the infrastructure works and you can build on top of it.

    The mental model shift that helped me: stop asking “which AI is smarter” and start asking “what does this agent need to know to do this job well?”

    One is a spec sheet comparison. The other is job design.

    The pattern I keep seeing

    When an agent produces bad output, the first question most people ask is “should I switch models?” In my experience, that’s almost never the right answer.

    The right questions are:

    • Does the agent have a clear task boundary?
    • Does the tool description actually explain when and how to use the tool?
    • Is there a stopping condition, or is the agent running until it decides it’s done?
    • What’s in the memory file — and does it reflect what the agent actually learned from past sessions?

    These are harness questions. They’re slower to answer than “use GPT-5 instead of Claude.” They’re also the ones that matter.

    I’ve shipped five production projects in the last eight months. The model versions I used on each one don’t matter much to me now. The harness decisions I made, the files I wrote, the memory structure I built, the task boundaries I enforced, those I’d make almost the same way again.


    This is the first post in a seven-part series about what I’ve actually learned building with AI tools over eight months: Claude Code, Cowork, Codex, Cursor, and the projects underneath them. The next post goes deeper into my specific Claude Code setup — what each file is for, why it exists, and what I’d do differently.

    If you’re building with AI tools and the model keeps frustrating you, it might not be the model.


  • How to Grow Your Career: 5 Tips to Take Your Skills to the Next Level

    Growing your career doesn’t just happen overnight. It takes time, effort, and dedication — not to mention a lot of patience.

    Career growth takes many employees years to achieve. But the good news is that there are ways you can accelerate the process.

    If you are ready to take your skills and career to the next level, we’ve got you covered with these 5 tips on how to grow your career as a digital marketing professional.

    Even if you feel like you are at a standstill in your current role, these tips will help you accelerate your growth as soon as tomorrow.

    Research and understand your career path

    To start growing your skills, you need to first understand your career path. What is your end goal? What are you working towards?

    If you can answer these questions, you are one step closer to achieving what you want in your career. Knowing your career path will help you create a path to get there.

    Start by researching your company’s organizational chart, or the skills hierarchies for your job title. This will give you a better sense of where you are on the organizational chart and what you need to do to progress.

    You can also use transferable skills to identify areas where you can start growing your skills. This process might feel like a lot of research, but it will be well worth it in the end.

    Knowing where you want to be in your career will help you focus your efforts and make decisions that will help get you there.

    Build your network

    A strong network is one of the most valuable assets you can have in your career. It can help you find new opportunities, get valuable advice, and even find a mentor to guide you along the way.

    It is important to create a solid foundation for your network. This includes finding and connecting with people both inside and outside of your organization.

    Start by doing some research around your organization. Are there any internal networking events or groups you can join to get to know your colleagues better?

    You may also want to consider joining industry-related groups or associations in your field. Playing an active role in your network is just as important as having one.

    Take the time to truly get to know people and what they do. Ask them about their career path, challenges they have faced, and their advice for overcoming them.

    Building relationships takes time, so be patient and persistent in this process.

    Identify a skill you need to develop

    One of the best ways to accelerate your growth is to identify a skill you want to develop. No matter what your current job title is, there are always areas to develop.

    Start with the basics: communication, management, leading others, time management, and critical thinking. You may also want to consider skill areas that are more technical, such as coding or data analysis.

    While these may not be applicable for every digital marketing role, many companies are starting to require data science skills. The best way to identify the skill you want to develop is to take an honest look at yourself and your current skill set.

    What do you excel at? What do you struggle with? What do you wish you could do better?

    Take the lead on new projects

    With so many skills to develop, you may be wondering how you will ever have time to take on new projects.

    However, taking the lead on new projects is a great way to accelerate your growth. Taking the lead on a new project can help you become more comfortable taking on more responsibility.

    It also gives you an opportunity to experiment with new skills. Taking the lead on a project may not be an option at all times, but when it is, make sure to take advantage of it.

    Discovering new opportunities to accelerate your growth, even within your current role, will help you grow your skills faster.

    Commit to growth and take action

    One of the biggest ways you can accelerate your growth is by committing to it and taking action.

    If you want to grow your skills, it’s essential that you make time for it. This could mean scheduling time for reading books or taking online courses outside of work hours.

    Taking the time to learn and develop new skills will help you grow your career and move up the organizational chart. At times, you may feel like you are not making any progress.

    But remember, it takes time to grow new skills. Be patient and don’t give up. As you become more committed to growing your skills and learning new things, you will start to see the results.

    Before you know it, you will be at the level you dreamed of when you started growing your skills.

    Conclusion

    Your career has a lot of potential, and you can grow your career as a digital marketing professional.

    To start growing your skills, you first need to research and understand your career path.

    Next, you need to build your network, identify a skill you want to develop, and take the lean on new projects.

    Finally, you need to commit to growth and take action by exploring new opportunities for growth and learning new things.


  • 5 Tips for Professional Growth: Essential Steps to Take for Professional Coaching

    As a leader, you are always looking for ways to improve yourself as well as your team.

    In this digital age where knowledge is abundant and easily accessible, staying ahead of the curve and learning new skills is easier than ever.

    I want to share with you how you can identify areas of your career that can be improved, offer ideas on how to grow in those areas, and give you some actionable steps to take so that you’ll see improvement in no time.

    I’ve outlined 5 essential steps for professional coaching that will have a positive impact on your professional growth.

    Identify Which Areas of Your Life Require Growth

    Before we dive into the tips, let’s first identify which areas of your life require improvement.

    Keep in mind that professional growth doesn’t have to be limited to your career; it can also include areas such as your relationships, health and wellness, and finances.

    It’s important to remember that you’re more than your job. If you’re feeling stuck in one area of your life, it can have an effect on other areas, so it’s wise to address them all.

    To identify areas of your life that can be improved, take a look at your strengths, weaknesses, and goals. Once you’ve examined these areas, you’ll be able to see where you can improve.

    Build a Culture of Coaching and Mentorship

    One of the best ways for you to grow is to encourage your team members to grow as well.

    By building a culture of coaching and mentorship, you can encourage your team members to learn new skills, grow in their roles, and become better versions of themselves.

    You can accomplish this by setting clear expectations that everyone on the team is responsible for helping each other grow in their roles.

    You can also offer your direct reports the opportunity to choose the coaching and mentorship method that works best for them.

    Some of the most popular coaching and mentorship methods are:

    By setting clear expectations and encouraging your team members to grow, you’ll create a culture that allows everyone to learn and grow.

    Take Time to Learn From Those Around You

    Another great way to grow is by learning from your peers and colleagues. As a leader, you have access to a variety of leaders who can help you grow in your roles.

    You can take advantage of networking opportunities and connect with others who can provide insights and offer advice. This can be done through in-person and online networking as well as attending industry events.

    By building relationships with leaders in your field, you can receive advice on improving in your roles, discover new learning opportunities, and receive feedback on your strengths and areas that can be improved.

    Establish Clear Professional Goals

    Before you can start working towards improving in an area, you first need to identify what you want to achieve. This is where setting clear professional goals comes in.

    When setting goals, it’s important to make sure they are SMART goals: Specific, Measurable, Attainable, Relevant, and Time-bound.

    By setting clear goals, you’ll be able to track your progress and stay accountable for achieving what you’ve set out to do. This is a great way to stay focused on your professional goals and continuously work towards improving.

    Seek Out Information From Trusted Sources

    There are countless resources that can offer advice and tips on how to improve in certain areas of your professional life. You can seek out information from trusted sources by reading books, attending webinars, and subscribing to podcasts.

    By reading books and articles written by thought leaders in your industry, you can gain new insights and discover new ways of improving in your roles.

    By taking advantage of webinars and podcasts, you can learn from experts and gain knowledge from their experiences.

    By seeking out information from trusted sources, you can learn new skills, discover new ways of improving, and stay on top of trends in your industry.

    Conclusion

    There are many benefits of executive coaching, but it’s up to each individual to make the most of it. This article discussed the essential steps of taking advantage of the coaching process.

    By following these steps, you can ensure that you’re getting the most out of your coaching sessions and that you’re applying the lessons you’re learning to your daily life.


  • Finding the Right Executive Coach For You

    Thinking of hiring an executive coach? It’s natural to feel a bit daunted at the thought. After all, coaching is about personal development, and it can be challenging to put yourself on the line in front of someone you don’t know very well.

    How much do you trust this person to keep your secrets? Do you feel comfortable opening up and being vulnerable with them? Are they someone you think will be able to help you see things differently?

    If any or all of these questions make you hesitant, it’s time to find out more about coaching before making that final decision. Executive coaches can be a great asset for leaders looking to take their performance to the next level.

    These professionals bring a completely unbiased perspective into your working life. They are neutral third parties who are focused on helping you grow as an individual by focusing exclusively on your needs as an employee and identifying areas where your current habits may be preventing you from achieving peak performance.

    Why Hire An Executive Coach?

    An executive coach can be a great addition to your team if you feel like you could use some extra support. Before you dive into the hiring process, it can be helpful to know what kind of coaching is out there and what kind of benefits it can bring to the table.

    Here are a few common reasons why people hire an executive coach:

    • Support and accountability when making major changes
    • Whether you’re trying to improve your work habits or make a big career change, having an outside source of accountability and support can be incredibly helpful.
    • A coach can keep you focused on your goals and provide encouragement and motivation every step of the way.
    • A better understanding of yourself and your strengths
    • As you progress in your career and take on new projects and initiatives, it can be easy to lose sight of what makes you unique and special.
    • An executive coach can help you better understand yourself and find ways to bring out your best self at work.
    • A sounding board when you’re facing challenges
    • No one is perfect, and it can be incredibly helpful to have a neutral third party to talk to when you need help figuring out how to deal with a challenging situation.

    An executive coach can act as your sounding board and help you come up with a plan of action.

    How to find the right coach for you

    As with anything in life, there is no one-size-fits-all solution when it comes to hiring an executive coach. Finding the right fit comes down to a combination of factors, including the coach’s expertise and experience, your budget, and how you feel when you speak with a few different coaches.

    If you feel like you need more direction on how to find the right coach for you, here are a few tips to keep in mind:

    • Think about your current needs and goals
    • Before you start looking for coaches, it’s important to get clear on what your needs and goals are. For example, you may be interested in a coach who specializes in helping people make career transitions.

    However, if your goal is to become a better manager in your current job, that same coach might not be the best fit for you. Narrow the search by specialty and functionality.

    Once you’re clear on your goals and needs, the next step is to narrow down your search by specialty and functionality.

    For example, you can search online for coaches who specialize in executive coaching and/or career change coaching. Once you have a list of potential coaches, you can then narrow the search further by functionality.

    In other words, you can focus on coaches who specialize in one particular area, such as assisting with career transitions.

    4 Tips For Finding The Right Coaching Fit

    Once you’ve found a few potential coaches to consider, it’s time to narrow down your search further.

    Here are a few tips for doing just that:

    • Look for coaches who specialize in executive coaching
    • When searching for coaches, make sure to clearly define the type of coaching you’re interested in finding. Many coaches offer a variety of services, but it’s important to find someone who specializes in executive coaching.
    • Consider your budget
    • It’s never easy to talk about money, but it’s important to remember that you get what you pay for. If you go bargain shopping for a coach who charges $100 per hour and doesn’t have much experience, you probably won’t get much value out of the experience. Ideally, you want to work with someone who charges a fair rate and has the appropriate level of experience and education.
    • Consider scheduling parameters and availability.
    • How flexible is the coach? Do you have to work the hours they’re available, or are they willing to be a bit more flexible?
    • Make sure the coaching relationship feels right.
    • When you talk to the coaches on your short list, it’s important to pay attention to how you feel.

    Pay attention to your gut, and don’t rush into a decision just because you’ve found someone who ticks off a few boxes. It’s important to find someone you feel comfortable working with and who you think can help you achieve your goals.

    3 Questions To Ask Before Hiring An Executive Coach

    Once you’ve narrowed down your search and are ready to move forward with interviewing prospective coaches, it can be useful to have a list of questions prepared.

    Not only does this help you clarify your needs and expectations, but it also gives you an opportunity to get a better sense of the coaches you’re considering working with.

    Here are a few questions to ask when interviewing potential coaches:

    • What are your coaching specialties? Make sure the coach is a good fit for your needs.
    • What is your coaching style and approach? Do you prefer a more direct style or are you more comfortable with a coach who is a bit more on the indirect side?
    • What have your past clients said about working with you? Ask for testimonials from past clients, and make sure they are a good fit for your needs.

    Takeaway

    Executive coaching can be a great addition to your professional life, but you have to take the time to find the right coach for you. Start by thinking about your current needs and goals and then narrow your search by specialty and functionality.

    Once you’ve found a few potential coaches, make sure to ask the right questions and get a better sense of the coaches you’re interested in working with. With a bit of preparation, you’re sure to find the right coach for you.