Welcome back to Agentic Coding Weekly. Here are the updates on agentic coding tools, models, and workflows worth your attention for the week of Feb 1-7, 2026.

Executive Summary:

  • Opus 4.6 launched with 1M context. SWE-bench Verified at 80.8%, Terminal-Bench 2.0 at 65.4%.

  • GPT-5.3-Codex released within 30 minutes of Opus 4.6 release. 57% on SWE-Bench Pro, 77.3% on Terminal-Bench 2.0. Available now on paid ChatGPT plans, API coming soon.

  • Claude Code gets agent teams feature to run multiple Claude Code instances as a coordinated team. One session acts as the lead, the others work as teammates. Useful for parallel reviews, competing hypotheses, or building modules simultaneously.

  • Qwen3-Coder-Next is a new open-weight model that runs locally on 46GB at 4-bit. Scores 70.6% on SWE-bench Verified.

  • Codex app for macOS ships. Manage multiple agents, built-in worktree support, run parallel work without terminal juggling.

  • Worth reading: Mitchell Hashimoto's AI adoption journey and Anthropic's experiment building a C compiler with 16 parallel Claudes ($20k, 100k lines, builds the Linux kernel).

1. Tooling and Model Updates

Claude Opus 4.6

Anthropic released Claude Opus 4.6, newest Opus-class model. Opus 4.6 sustains agentic tasks longer, operates more reliably in larger codebases, and has improved code review and debugging skills. Scores 80.8% on SWE-bench Verified and 65.4% on Terminal-Bench 2.0, compared to Opus-4.5's 80.9% and 59.8% respectively. Check the announcement.

What's new:

  • First Opus-class model with 1M token context, though only via API. Premium pricing past 200k tokens ($10/$37.50 per million input/output).

  • Four effort levels now: low, medium, high (default), and max. Max effort level is Opus 4.6 exclusive.

  • US-only inference available at 1.1× pricing for compliance-sensitive workloads.

Within half an hour of Opus 4.6 launch, OpenAI released GPT-5.3-Codex.

GPT-5.3-Codex

OpenAI's most capable agentic coding model to date, 25% faster than GPT-5.2-Codex.

Sets new industry highs: 57% on SWE-Bench Pro and 77.3% on Terminal-Bench 2.0. Available on paid ChatGPT plans across the app, CLI, IDE extension, and web. API access coming soon.

Qwen3-Coder-Next

An open-weight model from the Qwen team, designed specifically for coding agents and local development. Built on Qwen3-Next-80B-A3B-Base with 256K context. Check the announcement.

Scores 70.6% on SWE-bench Verified and 36.2% on Terminal-Bench 2.0, impressive for only 3B active parameters. For reference, Kimi K 2.5 (1T total params, 32B active) remains the best open-weight agentic coding model at 76.8% on SWE-bench Verified.

Runs on 46GB RAM/VRAM/unified memory at 4-bit quantization. Unsloth has a guide for running it locally with llama.cpp and using it with Codex or Claude Code.

Claude Code Updates

Agent teams (experimental): Instead of a single agent working sequentially, a lead agent can delegate to multiple teammates working in parallel. Each has its own context window and can message others directly. Useful for parallel code reviews, investigating bugs with competing hypotheses, or building independent modules simultaneously. More on this in the Workflow section.

Insights command: New /insights command reads your message history from the past month, summarizes your projects and usage patterns, and suggests workflow improvements. Announcement.

Fast mode (research preview): Makes Opus 4.6 respond up to 2.5x faster at 6x the price. Not covered by Pro/Max subscriptions, billed entirely through extra usage.

Expensive, but Anthropic's offering $50 in extra usage to Pro and Max users which can be used to try out the fast mode. Claim it at Settings > Usage before February 16. Fast mode docs.

Xcode integration: Xcode 26.3 introduces native integration with the Claude Agent SDK, the same harness powering Claude Code. Full Claude Code capabilities directly in Xcode including subagents, background tasks, and plugins. Announcement.

Codex App for macOS

A new interface for managing multiple agents, running work in parallel, and collaborating over long-running tasks. Agents run in separate threads organized by projects, so you can switch between tasks without losing context. Built-in worktree support lets multiple agents work on the same repo without conflicts. Think of it as an alternative to running Codex in multiple terminal/tmux windows while manually maintaining worktrees. Announcement.

OpenAI is also doubling rate limits for Codex on Plus, Pro, Business, Enterprise, and Edu plans for two months.

Codex is also now available in Xcode 26.3 release candidate.

2. Workflow of the Week

Claude Code shipped an experimental feature, Agent Teams, to run multiple Claude Code instances as a coordinated team. One session acts as the lead, the others work as teammates, each with its own context window, able to message each other directly. Useful for parallel code reviews, investigating bugs with competing hypotheses, or building independent modules simultaneously.

Agent teams are disabled by default. To enable it, set CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS to 1 in your ~/.claude/settings.json file.

// settings.json
{
  "env": {
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
  }
}

Example prompt to try:

I'm designing a CLI tool that helps developers track TODO comments across their codebase. Create an agent team to explore this from different angles: one teammate on UX, one on technical architecture, one playing devil's advocate.

Claude handles spawning, task assignment, and coordination. Each teammate loads the project's CLAUDE.md and MCP servers but doesn't inherit the lead's conversation history so good be specific in spawn prompts with file paths and context.

Practical tips: Keep teammates working on separate files to avoid overwrites. If the lead starts doing work itself, press Shift+Tab for delegate mode to restrict it to coordination only. Start with read-only tasks like reviews before trying parallel implementation. Also, each teammate is a separate Claude instance, so token usage adds up fast.

Read the docs for more details about Agent Teams feature.

3. Community Picks

My AI Adoption Journey

Mitchell Hashimoto (Hashicorp cofounder) details a three-phase journey adopting AI for coding, moving from initial inefficiency with conversational chatbots to workflow-altering discovery by focusing on coding agents.

I really liked the end-of-day agents pattern and would recommend checking it out:

Block out the last 30 minutes of every day to kick off one or more agents. My hypothesis was that perhaps I could gain some efficiency if the agent can make some positive progress in the times I can't work anyways. Basically: instead of trying to do more in the time I have, try to do more in the time I don't have.

Building a C Compiler with a Team of Parallel Claudes

After Cursor's experiment building a browser from scratch with long-running autonomous agents, this is Anthropic's attempt at building a C compiler using Opus 4.6.

Researcher Nicholas Carlini set up 16 Claude agents running in parallel. No human in the loop for most of it. Over two weeks and about $20k in API costs, the agents produced a 100k-line compiler that can build the Linux kernel on three architectures.

The compiler successfully builds many projects but not all. Not yet a drop-in replacement for a real compiler. A GitHub issue pointing out that a hello world program didn't compile became briefly popular.

4. Weekly Quiz

SWE-bench Verified is the most commonly reported coding benchmark so it’s important to understand what it actually measures. Check your understanding with these 5 questions and the explanations will help you in filling your knowledge gaps.

The answers and explanations are just below the fifth question. Finalize your answers before scrolling further.

Q1: What does SWE-bench actually test?

A) An LLM's ability to write code from scratch given a natural language specification
B) An LLM agent's ability to generate a patch that resolves a real GitHub issue in an existing codebase
C) An LLM's ability to review pull requests and suggest improvements
D) An LLM's ability to write and run unit tests for open-source projects

Q2. SWE-bench sources its tasks from how many open-source repositories, and in which language?

A) 50 repositories across Python, JavaScript, and TypeScript
B) 12 Python repositories
C) 25 Python and Java repositories
D) 100+ repositories across multiple languages

Q3. In SWE-bench, an agent gets some information and has to figure out the fix. But what exactly does the agent get to work with?

A) The issue description, the codebase, and the failing unit tests
B) The issue description, the codebase, and the original PR discussion
C) The issue description and the codebase only
D) The issue description, the codebase, and a diff of the expected solution

Q4. OpenAI collaborated with the SWE-bench authors to create a subset, SWE-bench Verified. What was the primary goal?

A) To make the benchmark harder by adding more complex tasks
B) To create a larger dataset with more diverse programming languages
C) To filter out problematic samples that were causing the original benchmark to underestimate model capabilities
D) To add multi-file editing tasks that better represent real-world development

Q5. How is a proposed solution by an agent for an issue evaluated in SWE-bench Verified?

A) By comparing the generated patch to the original PR diff using exact match
B) By running FAIL_TO_PASS tests (which should now pass) and PASS_TO_PASS tests (which should still pass)
C) By having human reviewers grade the solution for correctness
D) By running the full repository test suite and checking for zero failures

That’s it for this week. I write this weekly on Mondays. If this was useful, subscribe below:

Reply

Avatar

or to participate

Keep Reading