Welcome back to Agentic Coding Weekly. Here are the updates on agentic coding tools, models, and workflows for the week of May 31 - Jun 6, 2026.
Executive Summary:
MiniMax released M3. 1M context. 59.0% on SWE-Bench Pro. $0.60 / $2.40 per million input / output tokens.
Nvidia released Nemotron 3 Ultra, an open-source model with 550B total / 55B active parameters and 1M context.
Qwen 3.7 Plus relelased. 57.6% on SWE-Bench Pro.
GitHub Copilot launched a desktop app for managing multiple coding agents in parallel. Copilot also switched to usage-based billing.
Worth reading: A hand-written TDD skill for coding agents and /goal loop use cases.
Meanwhile, if you work with coding agents on a Python project, you must have noticed that they always directly edit pyproject.toml and add whatever dependency version they know about. I add this line to my AGENTS.md file whenever I'm working on a Python project, which fixes this issue.
We use uv to manage dependencies.
Use `uv add` and `uv add --dev` to manage dependencies.
Don't edit `pyproject.toml` directly.1. Tooling and Model Updates
MiniMax M3
Newest model release from MiniMax. 1M context window. $0.60 / $2.40 per million input / output tokens. Weights not public yet.
On coding benchmarks, M3 scores 59.0% on SWE-Bench Pro, ahead of GPT-5.5 (58.6), Gemini 3.1 Pro (54.2), and the open-weight DeepSeek V4 Pro (55.4), GLM 5.1 Thinking (58.4), and Kimi K2.6 Thinking (58.6), but behind Claude Opus 4.8 (69.2). On Terminal-Bench 2.1 it reaches 66.0, behind GPT-5.5 (78.2), Opus 4.8 (74.6), and Gemini 3.1 Pro (70.3).
The pelican legs are too short unfortunately:

MiniMax M3 xhigh - Pelican riding a bicycle
Nemotron 3 Ultra
New release in Nvidia's fully open model series where weights, training recipe, and training data are all public. 550 billion total parameters, 55 billion active, with 1M context support.
Qwen 3.7 Plus
New proprietary model from the Qwen team. Scores 57.6% on SWE-Bench Pro and 70.3% on Terminal-Bench 2.0.
MAI-Code-1-Flash
A new proprietary coding model from Microsoft. It's a mixture of expert with 137 billion total parameters and 5 billion active parameters. The context window size is 256k tokens. Scores 51.2% and 54.8% on SWE Bench Pro and Terminal Bench 2.0. And both of these scores are lower than Qwen 3.6 27B which you can run on your own machine. Doesn't sound worth using for day-to-day coding tasks.
GitHub Copilot Updates
GitHub Copilot launched desktop app similar to Claude and Codex desktop and Antigravity 2.0. It's an agent manager where you can run multiple agents in parallel in git worktrees.
Also, GitHub switched Copilot to usage-based billing for both individuals and enterprises starting June 1. You'll pay API pricing after consuming base credits, which are tiny.
Anthropic's Vulnerability Detection Harness
Anthropic published a reference implementation for automatically finding and fixing vulnerabilities in C/C++ codebases. The docs says, "expect ~10K uncached input tokens/min and ~2K output tokens/min per agent. You can scale parallelism up to your account's ITPM limit (roughly 10 agents per 100K ITPM)". It will be expensive.
Quick Updates
paseo is an open-source orchestrator to run coding agents in parallel from desktop and mobile.
Alibaba released an open source code review CLI tool, Open Code Review. Can also be installed as a skill in Claude Code and Codex CLI.
Uber put a $1,500 limit in monthly token spend per AI coding tool for each employee. Considering enterprises now pay the API cost for OpenAI and Anthropic, this limit roughly translates to an individual Claude Max 20x ($200/month) subscription.
2. Community Picks
My Agent Skill for Test-Driven Development
TDD skill based on what author calls the specify-encode-fulfill loop. Opposed to most of the popular agent skills which are AI-written slop, this one is actually hand-written, concise, and useful. Link to the SKILL.md.
Backpressure is All You Need
Covers how to let coding agents validate more of their own work before we have to step in. Good examples of using the /goal loop. Read the post.
Ask HN: What was your "oh shit" moment with GenAI?
Good thread. Found this comment interesting:
I (co-author of InstructGPT/RLHF/ChatGPT) helped train some of the first "magic" models at OpenAI and it was a wild ride. We were a pretty sane + skeptical team and we weren't totally convinced the models were as general as they seemed, but the query that convinced me (and later got included in the RLHF paper) was "Why is it important to eat socks after meditating?" (something that almost certainly did not appear on the internet before).
That’s it for this week. I write this weekly on Mondays. If this was useful, subscribe below:
— Prashant




