Welcome back to Agentic Coding Weekly. Here are the updates on agentic coding tools, models, and workflows for the week of Mar 15-21, 2026.
Executive Summary:
OpenAI released GPT-5.4 Mini and Nano. Mini hits 60.0% on Terminal-Bench 2.0, beating Sonnet 4.6. Prices have gone up compared to GPT-5 Mini and Nano.
Cursor launched Composer 2, their own coding model fine-tuned on Kimi K2.5. Scores 61.7% on Terminal-Bench 2.0, beating Opus 4.6.
Workflow of the week is Stavros’s multi-agent setup: Opus for planning, Sonnet for implementation, and multiple reviewers for critique.
Worth reading: one post on why sufficiently detailed specs turn into code, one on why codegen is not the same as productivity, and an Ask HN thread on professional AI-assisted coding.
Meanwhile, if you've been feeling you're not making progress fast enough, remember some things just take time (from Armin Ronacher, creator of Flask Python framework).
1. Tooling and Model Updates
GPT-5.4 Mini and Nano
OpenAI's latest small models. Check the announcement.
On SWE-Bench Pro (Public): GPT-5.4-nano-xhigh (52.4%) < GPT-5.4-mini-xhigh (54.4%) < GPT-5.4-xhigh (57.7%)
On Terminal Bench 2.0: Haiku 4.5 (41.0%) < GPT-5.4-nano-xhigh (46.3%) < Sonnet 4.6 (59.1%) < GPT-5.4-mini-xhigh (60.0%) < GPT-5.4-xhigh (75.1%)
Performance of these small models as improving and along with that price is increasing as well.
GPT 5 mini: Input $0.25 / Output $2.00
GPT 5 nano: Input: $0.05 / Output $0.40
GPT 5.4 mini: Input $0.75 / Output $4.50
GPT 5.4 nano: Input $0.20 / Output $1.25
For context, here's the broader price comparison between Anthropic, OpenAI, and Google:
Anthropic: Opus 4.6 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5
OpenAI: 5.4 $2.5/$15 ($5/$22.5 for >200K context), 5.4 Mini $0.75/$4.50, 5.4 Nano $0.20/$1.25
Google: 3.1 Pro $2/$12 ($3/$18 for >200K context), 3 Flash $0.50/$3, 3.1 Flash Lite $0.25/$1.50
Composer 2
Cursor released their own coding model. Composer 2 is a fine-tuned version of Kimi K2.5, priced at $0.50/M input and $2.50/M output tokens. On Terminal-Bench 2.0 it scores 61.7%, above Opus 4.6 (58.0%) and below GPT-5.4 (75.1%). Check the announcement.
2. Workflow of the Week
The workflow this week is from Stavros, who wrote up his multi-agent setup in a blog post that's worth reading in full.
Stavros uses three distinct roles in OpenCode: an architect, a developer, and one to three reviewers, each backed by a different model. The architect is Claude Opus 4.6, the strongest model he has access to. The developer is Sonnet 4.6, cheaper and more token-efficient. Reviewers rotate between Codex, Gemini, and Opus depending on how important the project is.
Planning needs strong reasoning but doesn't burn many tokens since it's mostly chat. Implementation burns a lot of tokens but doesn't need the strongest model if the plan is detailed enough. And using different models for review actually catches more issues because they have different blind spots.

Here's how a typical run looks:
Stavros talks to the architect about a specific feature or bugfix and goes back and forth, sometimes up to 30 minutes, until the plan is nailed down to individual files and functions. He actively steers the conversation, correcting the LLM when it's wrong or when it suggests something that technically works but doesn't fit his codebase. The architect is instructed to not start anything until Stavros explicitly says "approved".
Once approved, the architect splits work into task files with low-level detail and hands off to the developer.
The developer implements strictly what's in the plan, then calls the reviewers.
Reviewers independently critique the diff against the plan. If they agree, feedback gets integrated.
If they disagree, it escalates to the architect, which Stavros finds is good at filtering out pedantic suggestions that aren't worth the effort.
Each role architect, developer, and reviewer is basically a skill file. He writes the agent skill files by hand. His take is that asking an LLM to write instructions for itself is circular and doesn't actually improve output.
When he's not familiar enough with the technology to properly steer the architect, bad decisions compound. The LLM builds on its own mistakes, and ends up in that loop where it keeps saying "I know why, let me fix it" while making things worse. His fix is to invest the time understanding the architecture during planning, even in unfamiliar territory. The upfront cost saves a lot of pain later.
The post also has an annotated transcript from a real coding session which I found helpful in seeing this workflow in action.
Link to full post: How I write software with LLMs
3. Community Picks
A Sufficiently Detailed Spec Is Code
Uses OpenAI's Symphony project as a case study to argue against the misconception that specification documents are simpler than the corresponding code.
If you try to make a specification document precise enough to reliably generate a working implementation you must necessarily contort the document into code or something strongly resembling code (like highly structured and formal English).
Codegen Is Not Productivity
Argues that high-volume code generation is a liability rather than a productivity gain. Has a nice collection of links in the appendix, highly recommend. Read the post.
Ask HN: How Is AI-Assisted Coding Going for You Professionally?
432 points, 614 comments. Check the discussion on HN.
That’s it for this week. I write this weekly on Mondays. If this was useful, subscribe below:




