Birbla

Seven replies to the viral Apple reasoning paper and why they fall short (garymarcus.substack.com)
320 points by spwestwood - 23 hours ago

I'm glad to read articles like this one, because I think it is important that we pour some water on the hype cycle
If we want to get serious about using these new AI tools then we need to come out of the clouds and get real about their capabilities
Are they impressive? Sure. Useful? Yes probably in a lot of cases
But we cannot continue the hype this way, it doesn't serve anyone except the people who are financially invested in these tools.
by bluefirebrand - 22 hours ago
Why do we keep posting stuff from Gary? He's been wrong for decades but he keeps writing this stuff.
As far as I can tell he's the person that people reach for when they want to justify their beliefs. But surely being this wrong for this wrong should eventually lead to losing ones status as an expert.
by hiddencost - 22 hours ago
In case anyone else missed the original paper (and discussion):
https://news.ycombinator.com/item?id=44203562
by hrldcpr - 22 hours ago
This doesn't rebut anything from the best critique of the Apple paper.
https://arxiv.org/abs/2506.09250
by avsteele - 22 hours ago
The quote from the Salesforce paper is important: “agents displayed near-zero confidentiality awareness”.
by skywhopper - 22 hours ago
This doesn’t address the primary issue: that they had no methodology for choosing puzzles that weren’t in the training set and indeed while they claimed to have chosen puzzles that aren’t they didn’t explain why they think that. The whole point of the paper was to test LLM reasoning in untrained cases but there’s no reason to expect such puzzles to not part of the training set, and if you don’t have any way of telling if it is not or then your paper is not going to work out
by bowsamic - 22 hours ago
AI hype-bros like to complain that real AI experts are too much concerned about debunking current AI then improving it - but the truth is that debunking bad AI IS improving AI. Science is a process of trial and error which only works by continuously questioning the current state.
by mentalgear - 21 hours ago
The key insight is that LLMs can 'reason' when they've seen similar solutions in training data, but this breaks down on truly novel problems. This isn't reasoning exactly, but close enough to be useful in many circumstances. Repeating solutions on demand can be handy, just like repeating facts on demand is handy. Marcus gets this right technically but focuses too much on emotional arguments rather than clear explanation.
by labrador - 21 hours ago
Most of the objections and their counterarguments seem like either poor objections (e.g. ad hominem against the first listed author) or seem to be subsumed under point 5. It’s annoying that most of this post focuses so much effort on discussing most of the other objections when the important discussion is the one to be had in point 5:
I.e. to what extent are LLMs able to reliably make use of writing code or using logic systems, and to what extent does hallucinating / providing faulty answers in the absence of such tool access demonstrate an inability to truly reason (I’d expect a smart human to just say “that’s too much” or “that’s beyond my abilities” rather than do a best effort faulty answer)?
by ummonk - 21 hours ago
Good article giving some critique to Apple's paper and Gary Marcus specifically.
https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-gen...
by wohoef - 21 hours ago
In classic ML, you never evaluste against data that was in the training set. In LLMs, everything is the training set. Doesn't this seem wrong?
by brcmthrowaway - 20 hours ago
> 1. Humans have trouble with complex problems and memory demands. True! But incomplete. We have every right to expect machines to do things we can’t. [...] If we want to get to AGI, we will have to better.
I don't get this argument. The paper is about "whether RLLMs can think". If we grant "humans make these mistakes too", but also "we still require this ability in our definition of thinking", aren't we saying "thinking in humans is a illusion" too?
by thomasahle - 20 hours ago
> 5. A student might complain about a math exam requiring integration or differentiation by hand, even though math software can produce the correct answer instantly. The teacher’s goal in assigning the problem, though, isn’t finding the answer to that question (presumably the teacher already know the answer), but to assess the student’s conceptual understanding. Do LLM’s conceptually understand Hanoi? That’s what the Apple team was getting at. (Can LLMs download the right code? Sure. But downloading code without conceptual understanding is of less help in the case of new problems, dynamically changing environments, and so on.)
Why is he talking about "downloading" code? The LLMs can easily "write" out out the code themselves.
If the student wrote a software program for general differentiation during the exam, they obviously would have a great conceptual understanding.
by thomasahle - 20 hours ago
The last paragraph:
>Talk about convergence evidence. Taking the SalesForce report together with the Apple paper, it’s clear the current tech is not to be trusted.
by baxtr - 19 hours ago
We built planes—critics said they weren't birds. We built submarines—critics said they weren't fish. Progress moves forward regardless.
You have a choice: master these transformative tools and harness their potential, or risk being left behind by those who do.
Pro tip: Endless negativity from the same voices won't help you adapt to what's coming—learning will.
by starchild3001 - 19 hours ago
> Puzzles a child can do
Certainly, I couldn't solve Hanoi's towers with 8 disks purely in my mind without being able to write down the state of every step or having a physical state in front of me. Are we comparing apples to apples?
by neoden - 18 hours ago
I find it weird that people are taking the original paper to be some kind of indictment against llms. It's not like LLMs failing at doing Hanoi tower problem at higher levels is new, the paper took an existing method that was done before.
It was simply comparing the effectiveness of reasoning and non reasoning models on the same problem.
by Illniyar - 17 hours ago
I think the Apple paper is practically a hack job - the problem was set up in such a way that the reasoning models must do all of their reasoning before outputting any of their results. Imagine a human trying to solve something this way: you’d have to either memorize the entire answer before speaking or come up with a simple pattern you could do while reciting that takes significantly less brainpower - and past a certain size/complexity, it would be impossible.
And this isn’t how LLMs are used in practice! Actual agents do a thinking/reasoning cycle after each tool-use call. And I guarantee even these 6-month-old models could do significantly better if a researcher followed best practices.
by jes5199 - 16 hours ago
It's easy to check if a blackbox AI can reason: give it a checkerboard pattern, or something more complex, and see if it can come up with a compact formula that generates this pattern. You can't bullshit your way thru this problem, and it's easy to verify the answer, yet none of these so-called researchers attempt to do this.
by akomtu - 15 hours ago
I'm shorting Apple.
by revskill - 15 hours ago
The only real point is number 5.
> Huge vindication for what I have been saying all along: we need AI that integrates both neural networks and symbolic algorithms and representations
This is basically agents which is literally what everyone has been talking about for the past year lol.
> (Importantly, the point of the Apple paper goal was to see how LRM’s unaided explore a space of solutions via reasoning and backtracking, not see how well it could use preexisting code retrieved from the web.
This is a false dichotomy. The thing that apple tested was dumb and dl'ing code from the internet is also dumb. What would've been interesting is, given the problem, would a reasoning agent know how to solve the problem with access to a coding env.
> Do LLM’s conceptually understand Hanoi?
Yes and the paper didn't test for this. The paper basically tested the equivalent of, can a human do hanoi in their head.
I feel like what the author is advocating for is basically a neural net that can send instructions to an ALU/CPU, but I haven't seen anything promising that shows that its better than just giving an agent access to a terminal
by hellojimbo - 12 hours ago
> just as humans shouldn’t serve as calculators
But they definitely could and were [0]. You just employ multiple, and cross check - with the ability of every single one to also double check and correct errors.
LLMs cannot double check, and multiples won't really help (I suspect ultimately for the same reason - exponential multiplication of errors [1])
[0] https://en.wikipedia.org/wiki/Computer_(occupation)
[1] https://www.tobyord.com/writing/half-life
by Dzugaru - 8 hours ago
To summarise: we spent billions to make intelligent machines and when they're asked to solve toy problems all we get is excuses.
by YeGoblynQueenne - 3 hours ago
> We have every right to expect machines to do things we can’t.
Not really, this makes little sense in general, but also when in comes to this specific type is machine. In general: you can have a machine that is worse than human in everything that it does yet still be immensely valuable because it's very cheap.
In this specific case:
> AGI should be a step forward
Nope, read the definition. Matching human level intelligence, warts and all, will by definition reach AGI.
> in many cases LLMs are a step backwards
That's ok, use them in cases where it's a step forward, what's the big deal?
> note the bait and switch from “we’re going to build AGI that can revolutionize the world” to “give us some credit, our systems make errors and humans do, too”.
Ah, well, again, not really, the author just has unrealistic model of the minimum requirements for a revolution.
by eviks - 3 hours ago