This week Anthropic finally dropped the first model in its Mythos-class tier, Fable 5. The pitch was clear: this is the most intelligent generally available Claude model, sitting above Opus in capability. It’s the first model in Anthropic’s new Mythos-class tier, sitting above Opus 4.8 on the capability ladder, and the hype machine kicked in fast.
The hype machine kicked in fast. The official Claude account posted on X that Fable 5’s “capabilities exceed those of any model we’ve ever made generally available.” Andrej Karpathy, the former OpenAI co-founder who joined Anthropic last month, called the release “a major-version-bump-deserving step change forward.” Matt Shumer, founder of OthersideAI and HyperWrite, posted a one-shot Minecraft clone built in custom Three.js and declared on X that “Fable has solved 3D worldbuilding… utterly insane.”
They converged on nearly everything, including the answers.
But there was also backlash. Fable 5 costs $10 per million input tokens and $50 per million output tokens, exactly double Opus 4.8. It ships with safety classifiers that route cybersecurity, biology, and chemistry prompts to the less capable Opus 4.8 instead.
Researchers found a disclosure buried in the model’s 319-page system card. Fable would quietly degrade its own responses on frontier AI research tasks without telling the user. People were not happy about this, so Anthropic walked that policy back within a day. Even Karpathy, in the same post praising the release, conceded the safeguards were tuned too trigger-happy at launch.
With strong media machines on both sides, where does that leave Fable 5? I had to find out for myself by testing the model against Opus 4.8 (which was the crowd favorite last week). I ran two tests; one was pure reasoning, and the other was hands-on coding. I used the same prompts for both models. I’ll paste the prompts below in case you want to rerun my tests and see what results your machine produces.
The two tests
Test one was a reasoning task. I pointed both models at pandas issue #32265: the np.nan vs. pd.NA debate. Pandas has two ways of representing “this value isn’t here,” and core maintainers have been arguing since 2020 about whether they should be distinct concepts or a single concept. The issue has more than 150 comments, a trail of downstream bug reports, and still no resolution. The prompt asked each model to read the full thread, summarize the disagreement, catalog the damage, and commit to an actual recommendation.
Test two was a coding task. I cloned jsonpickle into two separate directories, one for Fable 5 and one for Opus 4.8. Jsonpickle is a 16-year-old Python serialization library that pulls roughly 20 million downloads a month. Each model got the same prompt: Read the full codebase, identify legacy code and security concerns, produce a ranked modernization plan, implement the highest-impact and lowest-risk changes, and verify nothing broke.
Same answer, sharper framing
Both models did something I didn’t expect. They identified three camps in the debate (only two were obvious). Both models also traced how those positions shifted over six years rather than treating the argument as a snapshot. Then both models independently landed on the same recommendation: keep NaN representable, treat it as missing by default, and offer a keyword opt-out.
The differences were in the framing. Opus split the debate into two separable questions. It asked whether NaN and NA are different concepts, and whether isna treats NaN as missing. It argued that the thread conflated both ideas. Opus delivered correct information in a simple, straightforward way.
The comments on the issue show that by 2024, nearly everyone agreed on the destination but nobody converted that agreement into a vote and a merged keyword. Fable 5 went deeper into the history and produced the sharper diagnosis of why nothing ever shipped, calling it “consensus without ratification.” It also caught a detail Opus missed. Maintainers were freezing even uncontroversial bug fixes out of fear they’d have to be rolled back. The indecision blocked work that was valid under any resolution.
Fable 5 went deeper into the history and produced the sharper diagnosis of why nothing ever shipped, calling it ‘consensus without ratification.
The costs for both models were similar, as I would expect on a task this small. Fable 5 cost $2.55 with 4 minutes 22 seconds of API time. Opus 4.8 cost $2.18 with 5 minutes 44 seconds of API time. Fable 5 was slightly higher, and though the numbers were small in this instance, I can see how costs would scale.
Modernizing a 16-year-old library
Both models took the same disciplined approach. Both established a green baseline of all 348 passing tests before touching anything. They found the same two standout bugs: a custom `ClassNotFoundError` that inherited from `BaseException` rather than `Exception`, making it invisible to standard error handling, and an import crash in an extension module. Both verified their fixes with behavioral tests beyond just rerunning the suite. I independently confirmed the `ClassNotFoundError` fix worked on my machine. They also recommended a proper deprecation cycle rather than just deleting a long-unused compatibility module.
Both results weren’t identical, though. They diverged at the margins. Opus implemented one fix Fable 5 deprioritized, removing a dead Django backend entry. Their diffs showed different instincts about what counted as low-risk: Fable 5 leaned toward deletion (7 lines of code added, 31 removed), Opus toward addition (14 lines of code added, 5 removed).
The costs started to differ on this test. Fable 5 cost $12.19 with about 12 minutes of API time. Opus 4.8 cost $5.80 with about 13 minutes of API time.
Here’s one more thing I felt was worth noting. Partway into the task, Fable 5 hit one of its own safety classifiers, and Claude Code automatically switched my session to Opus 4.8. I had to update my settings to remove some of the safeguards, but some of the Fable work was actually done by Opus. I don’t know which work was done by which model, but Fable completed 85% of the coding test.
Smaller gap than the hype
Wait, what? How could they both follow similar logic and produce similar results? Here’s my straight speculation. Some of this convergence is explainable. AI produces results through pattern matching, not critical thinking. Both models followed the same prompt to read the codebase first, run the tests, and explain every decision. Fable 5 and Opus 4.8 are siblings.
They were built by the same company with the same training philosophy and likely overlapping data. Shared engineering instincts are expected. The codebase matters, too. `jsonpickle`’s core is small enough that any thorough audit will surface the same short list of high-impact bugs.
I have awareness that I ran two tests, not two-hundred. I can only judge based on the two tests in this blog. But based on what I saw here, the gap between Fable 5 and Opus 4.8 is smaller than the launch-day hype suggests. Fable’s analysis was sharper, but only by a small margin. It was more precise in its diagnosis and slightly more thorough on history. Opus produced equally correct results with cleaner structure and, on the coding task, at less than half the price.
Opus produced equally correct results with cleaner structure and, on the coding task, at less than half the price.
For a solo developer doing occasional deep analysis or codebase work, Opus delivers most of the value at a fraction of the cost (not to mention that Fable 5 won’t be available in subscriptions). I’ll hypothesize that Fable’s edge becomes meaningful at scale, where wall-time savings compound, and on problems where the last few percent of analytical precision is required.
The post Fable 5 vs Opus 4.8: The real stakes, not the spec sheet appeared first on The New Stack.