G-4GY25QDH45
The JET A‑1 Blog publishes practical insights on business process AI-automation, Make.com systems, Airtable architecture, workflow optimization, schema design, and operational efficiency for businesses. This page serves as the central hub for articles that help companies eliminate manual work, improve accuracy, and scale through automation.

Agents are good at tasks and terrible at jobs

'55% of employers regret AI-driven layoffs' [Q1 2026]

Why AI Agents Keep Failing at Real Work: The Memory Wall Explained

_____________

AI commentator Nate Jones has been digging into where agent deployments are actually breaking down.

His core findings, drawn from recent research and a vivid real-world disaster, are worth your time before your organisation goes further down the agent deployment path.


THE SHORT VERSION

>> Agents excel at tasks where context is handed to them. They fail at jobs where context has to be carried, accumulated, and applied over time.

>> Tasks come with context provided. Jobs require you to bring your own. In Nate's framing: AI is now pretty good at tasks. It is not yet good at jobs. And most enterprise deployments are asking it to do jobs.

>> Agents can execute brilliantly within a well-defined frame. They cannot construct or maintain that frame themselves. That distinction is the whole problem.

>> The humans who will matter most in an agentic world are not the ones who use agents best. They are the ones who hold the organisational context that keeps agents from going sideways.

>> A 97.5% failure rate on real freelance work (Remote Labor Index) versus near expert-level performance on structured benchmarks. That gap is about context, not capability.

>> 75% of frontier LL models actively break previously working code when asked to maintain a codebase over time. We almost exclusively benchmark writing code, not sustaining it.

>> AI agents are getting genuinely capable. The problem is that the people and organisations deploying them are not keeping pace with what safe deployment actually requires.

>> A power tool that fails silently is far more dangerous than a mediocre tool that fails obviously. Agents are becoming the former.

THE FULLER VERSION

There is a gap at the heart of the current AI agent story.

AI agents are genuinely impressive at short, well-defined tasks. But the moment you ask them to do something that looks like an actual job (sustained, contextual, evolving over time) they fall apart with a reliability that should give every enterprise deployment team serious pause.

The root cause has a name: the memory wall. Nate Jones has been making this case with some force recently, and his analysis cuts through a lot of the noise around agent capability claims.

The Shape of the Problem:

A real job has an arc. It unfolds over months, sometimes years. It accumulates context: decisions made, constraints understood, relationships navigated, institutional knowledge absorbed.

Even in fast-moving tech environments where the average software role lasts 18 to 24 months, that's an enormous amount of contextual depth compared to what an AI agent carries into any given session.

Most agent runs last an hour or two. The best-case scenarios stretch to weeks.

That's a structural mismatch between what agents are and what jobs actually require. And the evidence for how badly that mismatch plays out in practice is now coming from multiple directions.

Failure Mode One: Agents Don't Know What World They're In:

A vivid illustration of the memory wall isn't a benchmark, it's a disaster. When Alexa Gregorov's AI coding agent demolished his production database, wiping 1.9 million rows of student data in seconds, it didn't make a single technical error. Every action it took was logically sound.

The problem was that the agent had no idea it was operating in a live production environment rather than a temporary staging setup. That distinction (one of the most fundamental in all of software operations) existed only in the engineer's head.

Jones uses this case study centrally, and the reason is clear. It's not a story about a reckless agent or a careless engineer. Alexa made a series of entirely reasonable requests. The agent made a series of entirely logical decisions. The disaster happened in the gap between the two, in the organisational context that was never communicated and never asked for.

This is the contextual understanding failure. Agents operate on what they can see.

They cannot infer the organisational significance of what they're touching.

They don't know which infrastructure is load-bearing and which is disposable. They don't know that a configuration file unpacked from an archive represents a live system rather than a historical record.

And critically, they don't know what they don't know, so they don't ask.

The agent decided that demolishing everything at once was "cleaner and simpler" than removing resources one at a time. In isolation, that's a reasonable call. In context, it was catastrophic. And the context was never provided.

Failure Mode Two: Maintaining Code Is a Different Skill Than Writing It:

The SWE-CI benchmark, developed by a team at Alibaba, is the first study to measure something the industry has largely ignored: not whether AI can write code, but whether it can maintain it over time.

The methodology is rigorous. One hundred real codebases, each with an average development history of 233 days and 71 consecutive updates. The agent's job is to evolve the codebase forward, adding features, fixing bugs, adapting to new requirements, the way real software actually gets built across months and years of active development.

The results are stark. 75% of frontier models tested break previously working features during maintenance.

Three out of four of the best AI models available, when asked to sustain a codebase over time, actively make things worse. The benchmark specifically penalises agents whose early decisions compound into technical debt later, and almost all of them accumulate that debt.

This matters because the entire narrative around AI replacing software jobs is built on benchmarks that measure code generation, not code stewardship.

Writing a function from scratch and maintaining a system with six months of architectural decisions baked into it are two genuinely different things.

AI has gotten genuinely good at the first. It has not gotten good at the second. And the second is what most of the actual work of software development consists of.

Failure Mode Three: Real Projects Require Context You Can't Hand Over in a Brief:

The Remote Labor Index, produced by Scale AI and the Center for AI Safety, tested frontier agents on 240 real freelance projects sourced from Upwork: video production, architecture, 3D modelling, game development, data analysis.

These were end-to-end engagements, averaging $630 per project and 29 hours of human completion time.

The best-performing agent completed 2.5% of projects at a quality a paying client would accept.

A 97.5% failure rate on real work.

The number becomes even more instructive when set against a contrasting benchmark. GDPVal, built by OpenAI, shows the same class of models approaching expert-level quality and completing tasks a hundred times faster than humans.

Both results are genuine.

The difference is the setup: GDPVal provides the model with everything it needs (a detailed brief, a defined deliverable format, explicit criteria for what good looks like).

The Remote Labor Index hands the model a client brief and some files and says, figure it out.

Jones draws this contrast sharply: a task comes with context provided, and AI is now quite good at tasks.

A job requires you to bring your own context, to understand not just what is being asked but why, in what environment, against what history, with what constraints.

That gap is precisely the gap between a benchmark result and a real deployment outcome.

What These Three Failures Have in Common:

The thread running through all three failure modes is the same: agents are operating without the contextual depth that real work requires. They can execute brilliantly within a well-defined frame.

They cannot construct or maintain that frame themselves.

They don't know the organisational history.

They don't know which decisions are load-bearing.

They don't know what changed six months ago and why it still matters today.

Better models are not the answer here. The fix requires humans who hold that context deliberately encoding it into the systems agents operate within: through documentation, through guardrails, and through rigorous evaluation infrastructure that tells an agent, before it acts, what it is and is not allowed to touch.

Agents will eventually get better at long-term contextual work.

What matters right now is what happens in the meantime, as increasingly powerful agents are deployed into environments where the context that keeps them safe lives exclusively in human heads, with no clear plan for how to bridge that gap.

_________

Nate Jones's full analysis is here: https://www.youtube.com/watch?v=awV2kJzh8zk

© 202x | JET A-1 automation