This is the monthly premium issue for paid subscribers, distilling the previous month of agentic coding updates into a single email. Read this one email every month to stay on top of all things agentic coding. To read the full issue:

You should be able to expense this under your team's learning and development budget. You can send this email or Slack message to your manager to ask for approval.

Welcome to the latest edition of ACW Monthly Brief. It's one email to catch you up on all the meaningful developments in agentic coding from June 2026.

Reading time: ~20 minutes

In this issue, we will cover:

  1. Executive summary to catch up on the whole month in two minutes, covering model releases, benchmarks, tool updates, and the main takeaway from June.

  2. Agentic coding trends that I observed over the course of the last month: loops are new way to work with coding agents, LLM costs are going up in 4 different ways, and Anthropic is no longer the low drama company.

  3. Important model releases from June, including Fable 5/Mythos 5, GLM 5.2, GPT 5.6, Sonnet 5, MiniMax M3, Qwen 3.7 Plus, Ornith 1.0, Nemotron 3 Ultra, and a few smaller releases.

  4. Brief explanation of FrontierCode, a new benchmark from Cognition that tries to measure whether an LLM-generated PR would actually be merged.

  5. Tool updates from Claude Code, Codex, GitHub Copilot, Cursor, and Anthropic’s vulnerability detection harness.

  6. Trending open-source tools from June that I found useful, including projects for running multiple agents, tracking agent state, generating technical tutorials, doing semantic diffs, and reviewing code from the CLI.

  7. 5 best agentic coding workflows from across the web, including manual compaction, multi-agent workspaces, agent-agnostic planning artifacts, and loops for staying in sync with coding agent implementations.

  8. Reading list for the month with 5 must-reads on TDD agent skill, token compression, local coding agents, AI etiquette, and using goal loops. Plus a longer list if you want to go deeper.

  9. Miscellaneous updates, including IPO filings, enterprise token budgets, Cursor acquisition news, and a few fun links from the agentic coding world.

Non-AI disclaimer: I have written every single sentence in this article myself. See my reasoning here.

1. Executive Summary

  1. June was a busy month for agentic coding. We got Fable 5 / Mythos 5, GLM 5.2, GPT 5.6, Sonnet 5, MiniMax M3, Qwen 3.7 Plus, and several other models. Frontier models are still ahead, but open-weight models are now good enough that you should be evaluating them for your use cases.

  2. Fable 5 / Mythos 5 is the new top model on agentic coding benchmarks. It scores 80.3% on SWE-bench Pro and 88.0% on Terminal Bench 2.1, which is a big jump over Opus 4.8. There are caveats though: higher pricing, at least 30-day data retention, no ZDR, export-control drama, and Anthropic’s willingness to silently serve worse models.

  3. GLM 5.2 is the leading open-weights models now. It scores 62.1% on SWE-bench Pro and 81.0% on Terminal Bench 2.1, and its performance is somewhere between Opus 4.7 and 4.8, while being much cheaper. It is text-only, so it cannot replace multimodal workflows, but for many coding agent tasks it is now a serious alternative.

  4. GPT 5.6 is not publicly available yet. The Sol variant scores 88.8% on Terminal Bench 2.1, slightly ahead of Fable 5 / Mythos 5. Pricing for Sol is same as GPT 5.5 though, which is lesser than the Anthropic’s top model.

  5. Sonnet 5 is competitive with GLM 5.2 on the benchmarks, but much more expensive. The model has a new tokenizer which produces around 30% more tokens for the same input text. So even though the listed price is the same as previous Sonnet models, the effective price is higher.

  6. FrontierCode is a new benchmark from Cognition. It tries to measure whether an LLM-generated PR would actually be merged, not just whether it passes correctness check. Opus 4.8 scored only 13.4% on the hardest Diamond subset when the benchmark launched, and Fable 5 / Mythos 5 later reached 29.3%.

  7. GitHub launched a Copilot desktop app for running multiple agents in parallel, Copilot moved to usage-based billing, Cursor launched an iOS app with a privacy-mode caveat, Anthropic published a vulnerability detection harness, and Codex CLI had a logging bug that could burn through SSD endurance.

  8. One thing to take away from June is that do not build your entire coding workflow around a single frontier provider. Claude and GPT are still extremely good, but the pricing, policy, availability, and trust issues are real operational risks now. Open-weight models, especially GLM 5.2, are now good enough to benchmark seriously against your own use cases.

Current state of agentic coding benchmarks:

This is the current standings of top proprietary and open-weights models. The model names in bold were newly released in June.

Model

SWE-bench Pro

Terminal Bench 2.1

Pricing

(per 1M input / output tokens)

Fable 5 / Mythos 5

80.3%

88.0%

$10 / $50

Sonnet 5

63.2%

80.4%

$3 / $15

GLM 5.2

62.1%

81.0%

$1.4 / $4.4

MiniMax M3

59.0%

66.0%

$0.3 / $1.2

Qwen 3.7 Plus

57.6%

70.3%

$0.32 / $1.28

GPT 5.6 Sol

-

88.8%

$5 / $30

GPT 5.6 Terra

-

84.3%

$2.5 / $15

GPT 5.6 Luna

-

82.5%

$1 / $6

Claude Opus 4.8

69.2%

66.1%

$5 / $25

Qwen 3.7 Max

60.6%

69.7%

$2.5 / $7.5

GPT 5.5

58.6%

78.2%

$5 / $30

Kimi K2.6

58.6%

-

$0.95 / $4

GLM 5.1

58.4%

-

$1.4 / $4.4

MiMo V2.5 Pro

57.2%

-

$0.435 / $0.87

DeepSeek V4 Pro

55.4%

-

$0.435 / $0.87

Gemini 3.1 Pro

54.2%

70.3%

$2 / $12

Before we dive into the entire month of updates, enjoy this banger:

Trend 1: Using loops is so fetch

We use loops with coding agents now, no more manual prompting every single turn. Armin Ronacher best explained it his post The Coming Loop:

The pattern is the same everywhere though: work is put into a queue of sorts, a machine picks it up, attempts it, stops, and then some harness decides whether that was actually the end. If not, the harness continues the same session, injects another message, starts a fresh session with modified context, or sends the task to another machine. The task stays alive beyond the point where the model by itself would normally have said: “I am done.”

We are progressing from simple turn-based conversation to goal-based, time-based, and proactive loops. This post from Anthropic is good starting point to dive into loops use cases.

Trend 2: LLMs are getting expensive in 4 different ways

Bigger model class are being released. We have models like Mythos now which are much larger in size that what was available so far. And of course pricing of such models is higher. Mythos 5/Fable 5 costs $10 / $50 per million input / output tokens which is double of Opus models.

Smaller ones are getting phased out. The smallest model in GPT 5.6 series is Luna which costs $1 / $6 per million input / output tokens. GPT-5.4-mini, on the other hand, costs $0.75 / $4.5 and GPT-5-mini costs $0.25 / $2. When these older smaller models get discontinued, we have to move to the newer ones and spend more money, even for tasks that don't need that much intelligence.

Models produce more tokens and more tokens means more cost. Opus 4.7 and Sonnet 5 use a new tokenizer which produces 30% more tokens. Even if base pricing for the model remains same, the effective pricing goes up.

Base prices are increasing. This is the most obvious way where the pricing for same class of model goes up. Gemini 3.5 Flash is priced at $1.5 / $9 per million input / output tokes while Gemini 3 Flash was priced at $0.50 / $3.00 and Gemini 2.5 Flash was priced at $0.30 / $2.50.

Trend 3: Anthropic is no longer the drama free company

I gradually switched from GPT models to Claude about two to year and a half ago because there was just too much drama and hype driven marketing at OpenAI. Anthropic, on the other hand, just focused on shipping great models for coding silently.

Now, it seems like Anthropic has adopted same techniques and added much more user hostility and unclear communication. Over the last six months, banning third party coding harnesses, banning openclaw users without clear guidance over usage policy, grand-standing over government contract where they were also willing to do mass surveillance for non-US people, constant overhyping and anthropomorphizing their models, silently serving a nerfed model, steganographically marking requests, etc. are just a few things that have convinced me to use open-weights models for more and more tasks.

3. Models Releases

June was a busy month and there were several model releases but 4 noteworthy ones: Mythos5/Fable 5, GLM 5.2, GPT 5.6, and Sonnet 5. Out of these four major ones, GPT 5.6 is not publicly available yet.

logo

Upgrade to Paid Tier to continue reading.

Become a paying subscriber of Agentic Coding Weekly to get access to this post and other subscriber-only content.

Upgrade to Paid Tier

A paid subscription gets you:

  • Monthly brief on the 1st of each month
  • Weekly email every Monday

Reply

Avatar

or to participate