Can AI trust my gut?

A colleague introduced me to the excellent Money Stuff newsletter, written by Matt Levine at Bloomberg. The July 3rd edition was “AI Is Bad at Snacks”. It contained a delightful tale of how an AI lab trialled having Claude, a large language model or LLM, operate a “small automated shop” (aka a fridge rigged up as a vending machine). It was in charge of pricing and stock management, among other shopkeeping tasks. The long and the short of it was that the AI did not do particularly well. After recounting the LLM’s various mistakes and failures (which include asking for money to be sent to the wrong bank account), Matt says:
I am always wrong in my intuitions about what tasks will be easy or hard for AI. If you had asked me a week ago, “which task will be easier for a computer, driving a car or operating a vending machine,” I would have said “driving a car requires integrating lots of visual and other data in real time and seems really complicated, while vending machines just … are … computers? And have been operated by computers for decades?” But, nope, vending machine operation is still hard for AI.
Instinctively that seems as if it should be right. But taking a step back, automated cars have had years of investment, testing, and development of specialist technology just to get them to the point where they are now on the road in a few cities (and a ride costs several times the price of a standard taxi). The vending machine was being run by an LLM, which is not really a specialist tool, and with what appears to be no previous experience. If my colleagues put me in charge of the office vending machine I probably wouldn’t do that good a job, although I hopefully would not be tricked by my colleagues/customers into giving away a tungsten cube (which cost a pretty penny) for free. It would not be fair to compare my vending performance to my driving performance, where after many hours practicing I am undoubtedly among the 80% of self-assessed above average drivers.
Not long after I saw a study by Joel Becker, Nate Rush, Beth Barnes and David Rein, of METR. If you have some time I would highly recommend reading the paper – there is a lot of interesting stuff in there.
The study tested the impact of AI tools when used by experienced developers when dealing with real tasks on projects they had worked on for a substantial period of time (on average 5 years). These developers were all at least moderately experienced at using AI, and were allowed to use whatever tools they wanted, with the study providing a couple of options as well. The idea was to see how different the results were from the industry benchmarks or results in synthetic tasks – these can give misleading results where, for example, they test output, and an AI producing 50 lines of code seems productive but actually might be doing the same job as 5 lines of code, something we humans can be guilty of! Essentially, they were taking the testing out of the lab and into the field, to measure real world productivity changes.
The study solicited forecasts from economics and machine learning experts, as well as the developers themselves:
- The economics and machine learning experts expected that resolving issues would be almost 40% faster when using AI.
- Before starting each task, the developers estimated how long they thought they would take to complete the task and on average thought they would resolve the issue 24% faster when using AI.
- After the study finished, the developers were again asked how the AI had impacted the time spent on tasks; on average they estimated a 20% time saving.
In fact, on average it took the developers 19% longer to complete the task when using AI.

Why was this? There were two explanations that stood out:
- The paper broke down some of the tasks and looked at how time might be saved in one area (e.g. searching, coding) but there would be new work added (e.g. reviewing code, prompting the AI). This is a key point – a time saving in one area does not necessarily result in a saving across the whole task, especially where new work is created by having the tool do part of the job.
- The study focused on experienced developers and complex code repositories; the authors comment that this made it more challenging for the AI to speed up the task, both due to scale but also due to the lack of “tacit” knowledge and context. It would be interesting to compare the results to studies that look at delegation of tasks to less experienced colleagues – the act of working with someone more senior is often a critical part of how professionals get to the point where, for example, they are able to competently review the AI’s code, and beat them for speed at task completion.
What I like about these stories is that they show how good we are at convincing ourselves. In the first story, because LLMs are very good at appearing human-like, it is easy to convince ourselves that it has human like understanding and judgment. That is not the case, and we should not expect it – the machine does not “experience” the world in the same way that we do, so why should it have those attributes? But once you accept that and remember that these are just part of a wider toolkit, you can start to work out where best to deploy them. In the second, even the experienced developers thought they had saved time after having done the task nearly 20% slower! And yes, they really were saving time in some areas, but the very same time saving technology had led to more work elsewhere. A great reminder that we are just as prone to “hallucinations” as the machines.
But it is good to see that the right tool used for the right job, or right part of it, does save time – that is something I have found to be the case when it comes to the less trendy but very effective document automation tools, as well as some of the AI title diligence tools. If you want them to replace you then you are going to be out of luck, but if you want them to save you time at the outset of a transaction, or to assist at specific stages, then they can be invaluable.
Subcribe to news and viewsSurprisingly, we find that allowing AI actually increases completion time by 19%—AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter).
https://arxiv.org/abs/2507.09089


