Welcome to Blank Metal’s Weekly AI Headlines.
Each week, our team shares the AI stories that caught our attention—the articles, announcements, and insights we’re actually discussing internally. We curate the best of what we’re reading and add the context that matters: what happened, why it matters, and what to do about it.
The Frontier Reloads
Anthropic shipped twice in one day. A new Claude Opus aimed squarely at catching its own mistakes, and a Claude Code feature that lets a single session orchestrate hundreds of agents against a problem too big for any one of them. The pattern under both: the frontier is competing less on raw capability and more on reliability at scale—the thing that actually decides whether you can put an agent in production.
A New Claude Opus Lands With a Focus on Catching Its Own Mistakes
What: Anthropic released Claude Opus 4.8 on May 28. Pricing holds at $5 per million input tokens and $25 per million output, with a new fast mode at $10/$50 that runs roughly three times cheaper than the prior fast tier. The headline gain is reliability: Anthropic reports the model is about four times less likely than Opus 4.7 to let a flaw in its own code pass unremarked. It scores 84% on Online-Mind2Web, is the first model to break 10% on the all-pass standard of the Legal Agent Benchmark, and the only model to complete every case end-to-end on the “Super-Agent” benchmark. It ships with effort control in claude.ai and Cowork and dynamic workflows in Claude Code.
So What: The number that matters here isn’t a capability score, it’s the self-correction rate. For agentic work, the failure mode that costs you money isn’t the model being incapable—it’s the model being confidently wrong and shipping it anyway. A 4x drop in unremarked-flaw rate is a direct attack on the review burden that makes production agents expensive to run. Flat pricing on a more reliable model also means your cost per correct output drops even though the sticker price didn’t move, which is the metric that actually belongs in your build-vs-buy math.
Now What: If you’re running coding or agentic workloads in production, re-run your eval suite against 4.8 before you assume your harness needs more guardrails—some of the human review you built around 4.7 may now be redundant cost. Watch the self-check reliability gain specifically; that’s the lever that changes how much oversight a given workflow requires. Read more
Claude Code Adds “Dynamic Workflows” to Orchestrate Hundreds of Agents
What: Alongside Opus 4.8, Anthropic shipped dynamic workflows in Claude Code. Instead of a single agent or a fixed set of subagents, Claude writes its own orchestration script on the fly—decomposing a large problem, spawning tens to hundreds of parallel subagents, and validating each result independently before delivering an answer. It targets codebase-scale jobs: bug hunts across services, migrations spanning hundreds of files, verified security audits, and language ports across thousands of files. Anthropic cites Bun’s Zig-to-Rust port as a proof point: 750,000 lines of Rust, first commit to merge in 11 days, and 99.8% of existing tests passing.
So What: This is the difference between an agent that does a task and a system that decomposes a project. The constraint on agentic work has been coordination—one agent loses the thread on anything that spans more than a handful of files. Auto-decomposition plus independent verification is how you get reliable work at the scale of an actual migration or audit instead of a toy example. The verification step is the part that matters: parallel agents are easy, parallel agents that check each other before reporting is what makes the output trustworthy.
Now What: If you’ve got a migration, a framework upgrade, or a security audit sitting in the backlog because it’s too big to staff, this is the class of work that just became tractable. Pick one bounded, well-tested codebase and run it as a pilot—the test pass rate is your scoreboard. Teams with strong existing test coverage will get the most out of this first; teams without it should read the verification requirement as a reason to build that coverage now. Read more
Agents Move Into Every Role
The agent left the codebase this week. OpenAI repositioned Codex as a knowledge-work platform where non-developers are now its fastest-growing users; Microsoft put an always-on agent inside Teams; and Perplexity built one that decides on its own what to run locally versus in the cloud. Different surfaces, one direction: the agentic harness that was built for engineers is becoming the way everyone else works too.
OpenAI Pushes Codex Out of Engineering and Into Knowledge Work
What: On June 2, OpenAI repositioned Codex from a coding tool to a general knowledge-work platform. It now has more than 5 million weekly active users, up more than 6x since the February desktop launch, with non-developers making up roughly 20% of users and growing more than 3x faster than developers. OpenAI launched six role-specific plugins—data analytics, creative production, sales, product design, public-equity investing, and investment banking—bundling 62 apps and 110 skills, plus “Sites” for building shareable interactive pages and “annotations” for refining docs, sheets, and slides in place. Named users include Zapier and NVIDIA. More plugins—corporate finance, private equity, marketing strategy, strategy consulting, legal—are on the way.
So What: The signal isn’t the feature list, it’s the user mix. When non-developers are the fastest-growing segment of a tool built for engineers, the line between “coding agent” and “work agent” has stopped meaning anything. The same harness that writes code—plan, act, verify, iterate—turns out to be how you do financial modeling, sales ops, and analysis. This collapses a procurement question for you: you may not need a separate AI tool per function if the agentic platform your engineers already use also covers the analysts and the operators.
Now What: If you’re deciding where AI tooling lives in your org, stop scoping it as an engineering line item. Map the role-specific plugins against your actual functions—finance, sales, ops—and pressure-test whether one platform covers more of your headcount than your current per-team point solutions. The roles OpenAI is shipping plugins for next are a fair preview of which of your departments are about to be in scope. Read more
Microsoft Launches Scout, an Always-On AI Coworker in Teams
What: On June 2, Microsoft introduced Scout, an always-on AI agent that lives in Microsoft Teams and reads your work messages, calendar, and email to automate tasks, resolve meeting conflicts, and draft replies. It’s an OpenClaw-style agent, and Microsoft named Omar Shahine corporate VP of the effort, framing it as “your company essentially hires your assistant.” It’s launching to a small customer group; the desktop app currently requires an active GitHub Copilot subscription. Microsoft’s own internal sales org is the largest and fastest-growing user group. It lands opposite Google’s Gemini Spark, a similar always-on agent. Microsoft flags prompt injection as the main risk and is mitigating with a limited rollout and admin tracking tools.
So What: The shift here is from agent-as-tool to agent-as-standing-presence. Scout doesn’t wait to be prompted—it watches your work surface continuously and acts. That’s a meaningfully different security and governance posture than a chat window, which is exactly why Microsoft is gating the rollout and shipping admin controls first. The prompt-injection risk they name out loud is the real cost of an agent that reads everything: the same access that makes it useful makes it an attack surface.
Now What: If you’re evaluating always-on agents for your team, lead with the governance question, not the capability one. Ask what the agent can read, what it can act on without confirmation, and what audit trail your admins get—Microsoft is shipping those controls deliberately, which tells you they’re the gating factor for a sensitive or regulated environment. Treat the human-confirmation boundary as a config decision you own, not a vendor default you accept. Read more
Perplexity Splits Agent Tasks Between On-Device and Cloud Models
What: On June 2, Perplexity said its Mac-native agentic system, Perplexity Computer, will split a single task between an on-device compact model and frontier cloud models—automatically, task by task—rather than making you choose local or cloud upfront. Perplexity calls it “hybrid agentic inference.” A local model decides when sensitive data such as financial, health, or personal files should stay on the device, while the cloud handles work that needs full frontier capability. The feature is positioned on privacy and token efficiency and is set to arrive in July 2026.
So What: This is an architecture answer to two problems buyers actually have: cost and data residency. Routing the cheap, sensitive, or local-context work to an on-device model and reserving the expensive cloud model for what genuinely needs it is the same token-economics discipline that makes any agent deployment affordable at scale. The privacy framing matters more—an agent that can keep regulated data on the device by default changes what’s deployable in environments where sending everything to a cloud model is a non-starter.
Now What: If data residency or per-token cost is what’s blocking an agent rollout for you, hybrid local/cloud routing is the pattern to watch and to ask your vendors about. The design question to bring to any evaluation: who decides what stays local, on what rule, and can you audit it? An automatic split is only a privacy win if you can see and control the routing logic. Read more
The Receipts Start Coming In
The question shifted from “can it” to “did it pay.” A Thrive Holdings company put $1B behind the bet that AI changes the unit economics of accounting, with tax-season numbers to back it; OpenAI sent a former enterprise-software CEO on the road to close business in person; and SemiAnalysis explained why the gains are real even when they don’t show up in the P&L. Three angles on the same hard question every board is now asking.
A Thrive Holdings Company Bets $1B on an AI-Powered Accounting Roll-Up
What: Thrive Holdings, a spinoff of Joshua Kushner’s Thrive Capital, is committing $1B to acquiring local accounting firms through its operating company Current, run by former Mattress Firm CEO Steve Stagner. It’s a Berkshire-style long hold that leaves minority stakes with local partners, explicitly not a buy-and-flip. Current has already acquired around 50 practices. The case for the model is in the tax-season numbers from its “Tax AI” system: 7,000 returns processed through the AI, an average 31% time savings, up to 98% data-entry accuracy against a typical 10-15% human error rate, and one preparer who went from 180 hours to 15. OpenAI assigned a dedicated team and, over one weekend, let Codex run 48 hours testing hundreds of solutions.
So What: This is the clearest worked example yet of AI changing the unit economics of a services business, not just the productivity of an individual worker. The roll-up thesis only works if AI structurally lowers the cost of delivering the service—and a 31% time savings with higher accuracy is exactly that. The detail that should register for any operator is that the value didn’t come from buying a model license; it came from a focused engineering push against a specific, repetitive, high-volume workflow. The model was the easy part.
Now What: If you operate a services business with repetitive, high-volume work—accounting, claims, underwriting, document review—this is the template: pick the single highest-volume workflow, measure its current time and error cost, and engineer against it before you generalize. The ROI case here is built on one workflow done well, not a platform deployed broadly. That’s the sequencing that makes the number real. Read more
OpenAI’s Revenue Chief Spends Six Months Selling Enterprises in Person
What: OpenAI’s chief revenue officer Denise Dresser—former Slack CEO, who joined in December 2025—has spent roughly six months traveling globally to sell enterprises on OpenAI, reportedly taking around 400 customer meetings in her first 90 days. The reporting frames the push against OpenAI’s enterprise growth targets and a potential IPO, with Dresser saying the enterprise business is accelerating. (The 400-meetings figure comes via secondary coverage of a paywalled report, so treat it as directional.)
So What: The tell isn’t the meeting count, it’s that the most aggressive consumer-AI company on earth decided enterprise revenue requires a former enterprise-software CEO on planes doing in-person sales. That’s an admission that adoption at the org level isn’t a self-serve motion—it runs through procurement, security review, and change management, the same friction that has always governed enterprise software. That’s leverage for you: vendors competing this hard for your enterprise commitment are vendors you can negotiate with on price, terms, and support.
Now What: If you’re in an enterprise AI buying cycle, recognize that you’re in a seller’s-effort market and use it. The labs are spending real go-to-market money to land enterprise logos, which means now is the moment to push on pricing, dedicated support, and contractual commitments rather than accept list terms. The same dynamic that put a revenue chief on a plane to see you is the dynamic that gives you room at the table. Read more
SemiAnalysis Argues AI’s Value Is Real but Hidden From the Numbers
What: A May 29 SemiAnalysis piece by Malcolm Spittler and Dylan Patel makes the case for “dark output”—AI-generated economic value that’s real but invisible in GDP, prices, and labor statistics, because services get measured by receipts and wages rather than units of work. They split it in two: substitution dark output, roughly $1.5T in labor-cost tasks current AI could augment or automate, and new dark output, work that was too expensive to do before AI and is likely larger over time. They draw the analogy to Solow’s productivity paradox and to the 2013 GDP revision that added about $3.6T to the accounts by counting R&D and IP, and cite Anthropic’s Economic Index showing 37% of usage tokens in computer and math work against flat measured software investment.
So What: This is the analytical frame for the question every board is asking: if everyone’s using AI, why isn’t it in the P&L yet? Part of the answer is that the gains show up as work that didn’t happen—reviews not needed, analyses done in-house instead of outsourced, things attempted that weren’t worth attempting before. None of that generates a line item. The risk for an operator is the inverse: measuring AI ROI only by what shows up in cost-out reporting understates the value and can kill a program that’s actually working.
Now What: If you’re being asked to justify AI spend, stop reporting only the costs you cut and start counting the work that’s now getting done that wasn’t before—the analyses you would have skipped, the reviews you would have outsourced, the questions you can now afford to ask. That new output is where most of the value is hiding, and it won’t show up in a savings spreadsheet unless you deliberately put it there. Read more
Who Controls the Ground Truth
Agents are only as good as the data underneath them, and this week two companies drew opposite-facing lines around it. Lowe’s made the case that a clean internal semantic layer is what makes agents trustworthy; Strava locked its data behind authentication and a paywall to stop agents from taking it for free. Inside the walls and outside them, the same lesson: whoever controls the data controls whether the agents work—and who gets to use them.
Lowe’s Says a Semantic Data Layer Is What Makes Its Agents Useful
What: Lowe’s told The Information, in reporting around May 29, that it’s using semantic data and knowledge graphs to make its AI agents more useful across shopping, store operations, and finance. The core idea is using a semantic layer to standardize how business metrics are defined—what “revenue” means, for instance—so agents read enterprise data correctly instead of guessing. The story places Lowe’s as a customer-side data point in the broader fight among Microsoft, Databricks, and SAP over who controls the enterprise semantic layer.
So What: This is the unglamorous prerequisite that determines whether agents work at all. An agent querying enterprise data is only as good as the definitions underneath it—give it ambiguous metrics and it will confidently return wrong answers that look right. The reason “point an agent at your data warehouse” disappoints in practice is almost always this: the data layer was never made legible enough for an agent to reason over. Lowe’s is naming the actual bottleneck out loud.
Now What: If your agent pilots are returning plausible-but-wrong answers on your own data, the problem is probably your semantic layer, not your model. Before you invest in a better model or a fancier retrieval setup, standardize the business-metric definitions agents will read—that’s the work that turns a demo into something the finance team will trust. Whoever owns that semantic layer in your stack owns whether your agents can be believed. Read more
Strava Locks Down Its Data and Charges for API Access Ahead of an IPO
What: On June 1, TechCrunch reported Strava is moving previously public data—public profiles, fitness-club listings—behind authentication and adding a flat $11.99/month fee for all developer API access, replacing a free tiered program. Its developer community grew from 185,000 to 241,000 members year over year. Strava is retiring some endpoints with a 90-day grace period and adding MCP support for structured AI access. CEO Michael Martin says unchecked AI scraping “could be the death knell of the public internet,” cites repeated site-performance hits, and singled out Perplexity for routing scraping through aggregators after being refused a licensing deal. Strava filed confidentially for an IPO earlier this year.
So What: This is what data ownership looks like as a deliberate strategy, not a privacy afterthought. Strava is doing two things at once: pulling its data behind authentication so agents can’t take it for free, and adding MCP so agents can get it through a controlled, paid door. That’s the emerging shape of the agentic web—not open scraping, but metered, authenticated access on the data owner’s terms. For any company sitting on proprietary data, the lesson is that “publicly accessible” and “free for agents to consume” are about to be separate decisions you make on purpose.
Now What: If your company holds data that others—or their agents—currently pull for free, this is the week to decide your posture: what goes behind authentication, what you expose through a controlled interface like MCP, and what you charge for. The advantage isn’t keeping data locked away; it’s controlling the terms of access while still making it usable. Treat agent access as a product decision, not an IT setting. Read more


