ACW Monthly Brief: May 2026

Welcome to the first edition of ACW Monthly Brief. It's one email to catch you up on all the meaningful developments in agentic coding from the past month.

Reading time: ~20 minutes

In this issue, we will cover:

Executive summary for the entire month
Agentic coding trends that I observed over the course of the last month
Models and benchmarks: models released last month, including Opus 4.8 and Gemini 3.5 Flash, where these models sit on the agentic coding benchmarks, and an intro to the newly released DeepSWE benchmark.
Tool updates and new features in agentic coding tools like Claude Code, Codex, and Antigravity
Interesting projects people are building with agentic coding
Best agentic coding workflows worth trying from across the web
Reading list for the month with 5 must-reads
Miscellaneous updates and some fun stuff

Non-AI disclaimer: I have written every single sentence in this article myself. See my reasoning here. If you spot any problems with word choices or writing cadence, it's solely because I am bad at writing. Unlike your AI agents, if you tell me about those issues or your feedback by replying to this email, I'll remember them and put them into practice without you having to say "make no mistakes"!

1. Executive Summary

Tokens are getting both more expensive and cheaper. The frontier proprietary models got more expensive. Gemini 3.5 Flash is 3x more expensive than its predecessor. GPT 5.5 was already 2x GPT 5.4, and Opus 4.7 was already ~1.4x Opus 4.6 because it burns more tokens. On the other hand, open-weight models like DeepSeek and MiMo cut prices and are roughly 10-30x cheaper.
There were five major model releases: Opus 4.8, Gemini 3.5 Flash, Composer 2.5, Qwen 3.7 Max, and the open-weight Step 3.7 Flash. Most were incremental updates. Gemini 3.5 Flash was disappointing, costing more than 3.1 Pro to run the Artificial Analysis suite while still ranking below it.
Both Codex and Claude Code added a /goal command that keeps the agent working toward a verifiable objective instead of stopping after a single turn.
Google dropped the IDE part with Antigravity 2.0, turning it into an agent manager like Claude and Codex desktop. Antigravity CLI is replacing Gemini CLI and Gemini CLI will stop working from June 18, 2026.
Anthropic doubled the 5-hour rate limits for Claude Code after a SpaceX compute partnership and there is a 50% higher weekly limits promotion going on until July 13. They also formalized the subscription policy so that from June 15 your plan covers only interactive TUI use. And there will be a separate credits for programmatic and third-party use.
Claude Code added a new dynamic workflow feature to turn a large multi-agent task into a JavaScript orchestration script. Instead of Claude manually coordinating work turn-by-turn, Claude writes a workflow script for the task, the runtime executes it, and many subagents can be fanned out across phases such as discovery, implementation, review, and synthesis.
A new benchmark, DeepSWE, was released for longer-horizon SWE tasks. GPT 5.5 tops this benchmarks and ranks higher than Opus 4.8. It's worth a look, but it doesn't reflect the experience you'd actually get from Claude Code or Codex CLI, so I wouldn't advise to pick your day-to-day tools based on just this benchmark.

2. Trends Over the Month

Trend 1: Tokens are getting both more expensive and cheaper

Gemini 3.5 Flash is three times more expensive than its predecessor model in the Flash series. GPT 5.5 released near the end of April was twice the price of GPT 5.4 and Opus 4.7 released around mid-April was technically the same price, but it consumed more tokens, so it was around 1.4 times more expensive than Opus 4.6.

Enterprises are now paying API pricing for both OpenAI and Anthropic along with $20 per user. They are not getting the benefits that individual users get from $200 or $100 subsidized subscriptions.

So costs are increasing, but on the other hand, the open-weight models like DeepSeek and MiMo have lowered their pricing and are offering 10-30x times cheaper models.

Given there are reports of Uber going past their entire annual AI budget just in May and Microsoft canceling their Claude usage. It'd be interesting to see how this increase in pricing is going to affect AI adoption in enterprises. Is there going to be stricter caps for AI usage or are companies going to move toward open-weight models?

Trend 2: Making coding agents work longer

There have been ongoing efforts since last year to make coding agents work for longer. This month, we saw both Codex and Claude Code add the /goal command that prevents them from stopping just after one turn, and they continue working until the stated goal is reached. The goal command is similar to the Ralph loop concept, but now it is built-in inside the harnesses.

Trend 3: No AI lab has an IDE product

Well, to be fair, only Google had an IDE product with Antigravity 1.0, but with the release of 2.0, they abandoned the IDE part, and it's now, an agents manager similar to Claude Desktop and Codex Desktop. Given the importance and the market fit the AI labs have found for coding, it is interesting to see that none of them offer an IDE product.

Do they truly see no value worth pursuing in building a good IDE? Or do they truly feel that IDEs are not going to be the future? We're only gonna interact with our codebases through agents, not IDEs?

It is also kind of interesting that we have seen demos from different labs where they use a collection of agents to build an operating system from scratch or build a C compiler or port Bun from Zig to Rust. But nobody has tried to build an IDE that provides an excellent UX and is not a resource-heavy Electron-based IDE.

3. Models and Benchmarks

There were five major releases last month: four proprietary and one open-weight. The releases were mostly incremental updates except Gemini 3.5 Flash which was a disappointment both in terms of pricing and performance.

Opus 4.8 (May 28) - An incremental upgrade over Opus 4.7 and no change in pricing. Anthropic says 4.8 is more honest and more likely to flag uncertainty than claiming unsupported progress. It's roughly four times less likely than 4.7 to let flaws in code it wrote pass unremarked. Based on personal vibes, I am finding it more usable compared to 4.7 which felt like a downgrade from 4.6. In the 4.8 announcement post, they also said that they are preparing to release Claude Mythos for all customers in the "coming weeks".
Gemini 3.5 Flash (May 19) - Beats Gemini 3.1 Pro on almost all benchmarks. Priced at $1.5 / $9 per million input / output tokes, it's 3x more expensive than the last flash model, 3 Flash, and close to 3.1 Pro's $2 / $12. It's quite fast (over 200 tokens per second) but also more verbose. To run the Artificial Analysis test suite, it costs $1551 for 3.5 Flash vs $892 for 3.1 Pro, while 3.5 Flash still ranked lower than 3.1 Pro.
Composer 2.5 (May 18) - Cursor's further fine-tuned version of Kimi K2.5. Benchmarks put it close to Opus 4.7 and GPT 5.5 but priced much lower at $0.50/M input and $2.50/M output tokens. Context window is 200k tokens.
Qwen 3.7 Max (May 20) - Latest release in the Max series, the proprietary variant of the Qwen models. It scores 69.7% on Terminal Bench 2.0 and 60.6% on SWE Bench Pro. Pricing is $2.5 / $7.5 per million input / output tokens, and the context size is 1 million tokens.
Step 3.7 Flash (May 29) - Open-weights model from StepFun. 198B Mixture-of-Experts (MoE) with 11B active parameters. 256k context window. It loves to use the word "wait" in the reasoning. Tried the car wash test with this model and this is a short excerpt of the model’s reasoning from the test: "Wait no, wait—wait, no, wait: if you drive, you get in your car, drive 50m to the car wash, then... wait, but if you drive to the car wash, you're bringing the car with you, which is the point, right? Wait wait, no, wait the goal is to wash the car. Oh right!". It got the answer right though. You can check the full model reasoning and response in the GitHub Gist. I'll wait!

Other than the new model releases, DeepSeek and MiMo reduced their prices.

Originally, DeepSeek V4 Pro was available at 75% off through the API until May 31st. After the discount, the pricing was $0.435 / $0.87 per million input / output tokens. They made the discount permanent. So, the API pricing will stay at $0.435 / $0.87 permanently.
MiMo V2.5 Pro was priced at $1 / $3 per million input / output tokens. They have discounted the price starting May 27 to match the DeepSeek V4 Pro pricing of $0.435 / $0.87. The reduced cache-hit input pricing by 98% bringing it down from $0.2 per million tokens to $0.0036. For agentic coding tasks, since most of the tokens consumed are cache reads, this model is truly cheap.

Turning to the benchmarks, here's how the newly released Opus 4.8 and Gemini 3.5 Flash did on Simon Willison's pelican question, "Generate an SVG of a pelican riding a bicycle":

Opus 4.8 xhigh - Pelican riding a bicycle

Gemini 3.5 Flash xhigh - Pelican riding a bicycle

And for comparison, this is what Opus 4.7 and GPT 5.5 produced:

Opus 4.7 xhigh - Pelican riding a bicycle

GPT 5.5 xhigh - Pelican riding a bicycle

As for the current state of agentic coding benchmarks, this is the current standings of top proprietary and open-weights models. The model names in bold were newly released last month.

Model	SWE-bench Pro	Terminal-Bench 2.1	Pricing - per 1M input / output tokens
Claude Opus 4.8	69.2%	66.1%	$5 / $25
Qwen 3.7 Max	60.6%	69.7%	$2.5 / $7.5
Step 3.7 Flash	56.3%	59.6%	$0.2 / $1.15
Gemini 3.5 Flash	55.1%	76.2%	$1.5 / $9
Composer 2.5	-	69.3%	$0.5 / $2.5
Claude Opus 4.7	64.3%	66.1%	$5 / $25
GPT 5.5	58.6%	78.2%	$5 / $30
Kimi K2.6	58.6%	-	$0.95 / $4
GLM 5.1	58.4%	-	$1.4 / $4.4
MiMo V2.5 Pro	57.2%	-	$0.435 / $0.87
DeepSeek V4 Pro	55.4%	-	$0.435 / $0.87
Gemini 3.1 Pro	54.2%	70.3%	$2 / $12

Talking about benchmarks, there is a new one in town. You might have seen something like the bar plot below floating around. Let's look into what it is first and then where it falls short.

DeepSWE Benchmark

DeepSWE benchmark leaderboard

So this is a benchmark for longer-horizon tasks compared to SWE-Bench Pro, and tasks have a higher diversity and are contamination-free. It contains a total of 113 tasks from 91 open-source repositories across five languages. And 90% of the tasks are in TypeScript, Go and Python, each about 30%. And the remaining two languages are JavaScript and Rust.

This benchmark differs a lot from SWE Bench in terms of how the tasks are defined and how they are evaluated. So instead of giving it the full GitHub issue, where there is a lot of additional context, it only gives a small task prompt and the models need to figure out the rest themselves. Then the implementation is evaluated by a handwritten set of tests. In SWE Bench, tests are derived from the repository state after the issue was fixed and the commit was merged. Passing all the tests in the repository after the relevant commit was merged doesn't accurately reflect whether the implementation was correct or it just satisfied all the tests without actually solving the problem.

During the implementation, each model gets the same system prompt and only a bash tool. The obvious limitation is that it doesn't test the true quality of the model or the kind of experience you would get in day-to-day implementation. This is because of two reasons. First, every model uses the same mini-SWE-agent harness, which is very, very minimal. They did this to keep the comparison fair and to test only the capability of the model, not the harness. But this also means that since each model is not getting all the tools that it was trained with, we cannot definitively say it's the true capability of the model. Similarly, models are not getting their own custom system prompt, so, it's hard to say the performance matches the true capability of the model.

Second, the harness is just as important as the model these days. So, for these implementation tasks, they are not getting their purpose-built harness. In my opinion, the benchmark doesn't meet the experience you would actually get with these models when using Claude Code CLI or Codex CLI. Because of the limited number of samples and limitations of the harness, I'd say it's not the ultimate benchmark for picking your day-to-day agentic coding tools and models just yet. Speaking of tools...

4. Tool Updates

Antigravity

Along with 3.5 Flash, Google also launched Antigravity 2.0 at Google I/O. AGY 2.0 is a parallel agent manager similar to Claude and Codex desktop. There's no IDE inside 2.0. They also announced Antigravity CLI (closed source) coding agent which uses same harness as Antigravity 2.0 and is replacing Gemini CLI. Gemini CLI will stop working from June 18, 2026.

The 2.0 auto-update was poorly handled and broke people's workflow without any warning.

Dynamic Workflows in Claude Code

Along with Opus 4.8, Anthropic also released a new workflow feature to turn a large multi-agent task into a JavaScript orchestration script. Instead of Claude manually coordinating work turn-by-turn, Claude writes a workflow script for the task, the runtime executes it, and many subagents can be fanned out across phases such as discovery, implementation, review, and synthesis.

To trigger a dynamic workflow, we can run a saved workflow command like /deep-research, include the word workflow in our prompt, or enable /effort ultracode which allows Claude to decide when to use workflows automatically.

From my understanding, this dynamic workflow is different from using subagents or agent teams because those are coordinated by Claude and all the task context stays in the main agent's context window or in a shared task list. In a workflow, the task context is in the JavaScript code and there is no main agent coordinating everything.

As for use cases, the docs say to use it for tasks that need more agents than one conversation can coordinate, or when we want the orchestration codified as a script that we can read and rerun. Of course, all of this consumes a shit ton of tokens. The recent rewrite of Bun from Zig to Rust was done using dynamic workflows. Boris, creator of Claude Code, also shared a few tasks for which he has been using workflows in this HN comment.

Agent View in Claude Code

New way to control and manage multiple Claude Code sessions from a single screen by running claude agents. GUI-based IDEs and ADEs like Zed, Cursor, Codex, and Claude desktop apps have had this for a while. Now it's time for TUIs to get it.

/goal Command

Both Codex and Claude Code added the /goal slash command for goal-directed autonomous work. Instead of stopping after each response, Codex/Claude Code keeps working toward a concrete objective.

To use it, run /goal with a verifiable objective. For example, /goal all tests in test/auth pass and the lint step is clean.

To decide if the goal has been achieved, after Claude finishes responding, Claude sends the goal and conversation to a Haiku model which returns a yes-or-no decision and a short reason. Yes stops the loop, "no" sends the reason back to Claude and and is asked to keep working. I'd imagine something similar happens in Codex as well.

Claude Code Limits and Subscription Changes

Anthropic partnered with SpaceX to increase compute capacity and announced higher Claude usage limits. They have doubled the five-hour rate limits for Claude Code on Pro, Max, Team, and seat-based Enterprise plans. Default weekly limits are unchanged though, so this mostly helps with bursty sessions rather than total weekly usage. However, for a short period until July 13, they have also increased the weekly limits by 50%.

After all the fiasco with using Claude subscriptions with third-party tools like openclaw and OpenCode a couple of month ago, Anthropic has also formalized the policy for such use cases. Starting June 15, 2026, a Claude subscription will cover only interactive use (TUI) of Claude Code. We will get a separate monthly credit equal to our subscription amount for programmatic usage like claude -p, third-party apps using the Claude Agent SDK, or tools like openclaw. You have to manually claim the credits once, though, then they refresh automatically each cycle.

Quick Hits

llama.cpp now supports MTP (Multi-Token Prediction) based speculative decoding. If you run Qwen3.6 27B and Qwen3.6 35B-A3B locally, you can expect about 2x speed-up in token throughput.
Codex is now available in the ChatGPT mobile app. Requires Codex desktop app running on one of your machines. Not the Codex CLI, the Codex app.
xAI launched their own proprietary CLI coding agent, Grok Build CLI. Early beta, available only to SuperGrok Heavy subscribers ($300/month plan).
antirez built a specialized native inference engine for DeepSeek V4 Flash optimized for Apple Silicon Metal. Clean, minimal codebase worth reading. Runs at 2-bit quants on 128GB Macs, 4-bit on 256GB.

And now before we move on, found this on Reddit:

5. What People are Building

Interesting projects where agentic coding played a major role:

The Emacsification of Software - Agentic coding allows people to build hyper-specific software for themselves i.e., Emacsification. The author built a markdown viewer specific for their needs.
I Let AI Build a Tool to Help Me Figure Out What Was Waking Me Up at Night - A bit over-engineered but who am I to judge. The project is pretty cool and journey is often more important than the destination.

6. Workflows to Try This Month

5 best workflow patterns I found on HN/Reddit/Twitter/Web:

Workflow rom HN user @momojo:

When exploring a new idea or tool, my go to prompt is

In a single index.html, no dependencies, sparse styling, create an app that <idea>

Workflow from Vicki Boykis:

It takes extra concerted effort to move from just generating answers to using the tool deliberately. Here’s what’s worked for me so far:

Writing the initial implementation myself and asking the agent to review the code, then going through comment by comment and manually making the changes
Using the agent to keep asking questions about pieces of the code I don’t understand instead and pull up relevant documentation and PRs.
Asking the agent to think about implementing two approaches and choosing between them and then critiquing the other approach
Discussing an agent’s proposed implementation with another person instead
Starting to use the agent only after I’ve spent 20 minutes on the problem

Workflow from Thariq: The Unreasonable Effectiveness of HTML

I’ve started preferring HTML as an output format instead of Markdown and increasingly see this being used by others on the Claude Code team. Just ask just ask Claude to “make a HTML file” or “make a HTML artifact”.

To start with some examples, you can see a bunch here: https://thariqs.github.io/html-effectiveness

Example Prompts:

I'm not sure what direction to take the onboarding screen. Generate 6 distinctly different approaches — vary layout, tone, and density — and lay them out as a single HTML file in a grid so I can compare them side by side. Label each with the tradeoff it's making.
Create a thorough implementation plan in a HTML file, be sure to make some mockups, show data flow and add important code snippets I might want to review. Make it easy to read and digest.

Use Cases:

Exploration and planning: When you're not sure what you want yet. Ask the agent to fan out across several directions and lay them next to each other so you can point at one.
Code review & understanding: Let the agent render the change as an annotated diff, draw the module as boxes and arrows, or write the PR description your reviewers actually want.
Design: HTML is the medium your design system ships in, so it's the natural format for talking about it.
Prototyping: A throwaway page with the real easing curve or the real click-through tells you in five seconds what a paragraph of prose never could.
Illustrations & diagrams: Ask for the figures for a post or a flowchart of a process and get vector art you can tweak by hand or paste straight into the final document.

Workflow from Simon Willison:

Something I've been trying recently for non-throwaway code is extensive refactoring, without typing any code myself but by closely directing the coding agent.

Prompts like "move the code relating to SQL query analysis into a new file", "look for opportunities to use pytest parametrize to remove duplication in that test", "rename method X to Y".

Early indications are that this is helping a lot with the problem where it's easy to churn out thousands of lines of code and not really have it stick in my head, even if I review every line of it.

Reviewing code and actively refactoring it is less tedious and more mentally engaging than reviewing code without changes.

If this was a human collaborator I'd be worried that I'm just creating busywork for them, but I don't care about busywork for LLMs!

The goal is to produce code that I understand and that I can remember just well enough that I get an updated mental model to help me productively make future decisions about the codebase.

Workflow from HN user @bottlepalm:

I've hit this point with AI where it's not a simple process, but a long drawn out back and forth.

I'll use AI to design the implementation of a medium sized, cross cutting feature. Review all the details, maybe iterate on just that. Then implement with Claude 4.7 Max - which runs slower, but does a better job. Then review the implementation, then have Codex GPT 5.5 xhigh fast review it - which almost always finds corner cases. Have Claude fix those - Claude is better at writing intuitive maintainable code versus Codex overengineered/shortcut filled code. Codex is better at finding/fixing bugs and doing reviews.

Then repeat with fresh Claude/Codex instances having them both review the current staged changes and getting feedback, handling the feedback. Then covering it in tests. I mean overall I still implement the feature faster than coding it manually, but I spend a majority of the time going back and forth with reviews, handling corner cases and at the finish end up with what I feel a really solid implementation of whatever feature I'm working on.

7. Reading List

5 must reads for this month:

Using AI to write better code more slowly - Agentic coding is not only for shipping as fast as possible. Take your time.
Human Bottlenecks - Addresses this issue: "if only I could wire up the right prompts and the right tools in the right harness, I could have an agent that would boost my productivity 10x, or fix my problems with therapy, or make me more social, or more knowledgeable". This article was an absolute joy to read. Perfect 5/7.
Is this sustainable? - What it's like to be a senior engineer at an org, all-in on Gen AI and agentic coding, three years in. This excerpt is unusually funny: "The other thing that gave way was thinking time. There's very little of it in my working day now".
Why senior developers fail to communicate their expertise - Explains the two conflicting needs in a company: the need to keep existing services stable and the need to move fast with new services and products and how these two affect developers.
Local AI Needs to be the Norm - We should reach for cloud models only when the task truly needs it. Shows how to use on-device Apple foundation models.

If you have more time:

8. And the Rest

Miscellaneous updates and some fun stuff:

Andrej Karpathy is now at Anthropic. He is joining the pre-training team. My guess is that he will work on expanding on the autoresearch idea he published a couple months ago to automate LLM training experiments.
Burn, baby, burn (those tokens) - Script to burn tokens to get to the top the token leaderboards.
Continue? Y/N - A 60-second game about AI agent permission fatigue.
How fast is N tokens per second really? - A simple html page that visualizes how fast 10, 100, or 800 TPS looks in practice.
I'll get back to you on that one, and that one, and this one - Because of the higher velocity, senior engineers don’t have mental models of what’s happening in the codebase.This makes it harder to articulate things when they're in decision meetings.
Our AI spending has gotten so high that layoffs wouldn't make a meaningful difference.
Tell me about your agentic coding setup - This question and its variations are popping up in interviews. If you're interviewing, you should prepare for this one in advance.

That's all the important agentic coding updates from May 2026. If you remember just one thing from this issue, remember, you don't have to ship as fast as possible with agentic coding. Take your time and ship higher-quality code and higher-quality products that sparks joy in the user.

This month, keep an eye out for:

An email from Anthropic on June 8 about claiming the credits for programmatic and third-party uses of Claude subscription. You have to do this manually once.
Gemini 3.5 Pro. At Google I/O, they announced the Pro model will come in June.
GPT 5.6. Based on some rumors, seems like the next GPT model is coming soon.
Claude Mythos. In Opus 4.8 release announcement, Anthropic said that they are preparing for the Mythos release in "coming weeks" which means it could be available in June.

If you want the full monthly brief every month, upgrade to the paid tier for $120/year.

That's all. See you on the first of next month.

— Prashant