alvinashcraft's blurblog

What you may have missed about GPT-5 by James O'Donnell
Tuesday August 12^th, 2025 at 3:58 PM

Artificial intelligence – MIT Technology Review

Before OpenAI released GPT-5 last Thursday, CEO Sam Altman said its capabilities made him feel “useless relative to the AI.” He said working on it carries a weight he imagines the developers of the atom bomb must have felt.

As tech giants converge on models that do more or less the same thing, OpenAI’s new offering was supposed to give a glimpse of AI’s newest frontier. It was meant to mark a leap toward the “artificial general intelligence” that tech’s evangelists have promised will transform humanity for the better.

Against those expectations, the model has mostly underwhelmed.

People have highlighted glaring mistakes in GPT-5’s responses, countering Altman’s claim made at the launch that it works like “a legitimate PhD-level expert in anything any area you need on demand.” Early testers have also found issues with OpenAI’s promise that GPT-5 automatically works out what type of AI model is best suited for your question—a reasoning model for more complicated queries, or a faster model for simpler ones. Altman seems to have conceded that this feature is flawed and takes away user control. However there is good news too: the model seems to have eased the problem of ChatGPT sucking up to users, with GPT-5 less likely to shower them with over the top compliments.

Overall, as my colleague Grace Huckins pointed out, the new release represents more of a product update—providing slicker and prettier ways of conversing with ChatGPT—than a breakthrough that reshapes what is possible in AI.

But there’s one other thing to take from all this. For a while, AI companies didn’t make much effort to suggest how their models might be used. Instead, the plan was to simply build the smartest model possible—a brain of sorts—and trust that it would be good at lots of things. Writing poetry would come as naturally as organic chemistry. Getting there would be accomplished by bigger models, better training techniques, and technical breakthroughs.

That has been changing: The play now is to push existing models into more places by hyping up specific applications. Companies have been more aggressive in their promises that their AI models can replace human coders, for example (even if the early evidence suggests otherwise). A possible explanation for this pivot is that tech giants simply have not made the breakthroughs they’ve expected. We might be stuck with only marginal improvements in large language models’ capabilities for the time being. That leaves AI companies with one option: Work with what you’ve got.

The starkest example of this in the launch of GPT-5 is how much OpenAI is encouraging people to use it for health advice, one of AI’s most fraught arenas.

In the beginning, OpenAI mostly didn’t play ball with medical questions. If you tried to ask ChatGPT about your health, it gave lots of disclaimers warning you that it was not a doctor, and for some questions, it would refuse to give a response at all. But as I recently reported, those disclaimers began disappearing as OpenAI released new models. Its models will now not only interpret x-rays and mammograms for you but ask follow-up questions leading toward a diagnosis.

In May, OpenAI signaled it would try to tackle medical questions head on. It announced HealthBench, a way to evaluate how good AI systems are at handling health topics as measured against the opinions of physicians. In July, it published a study it participated in, reporting that a cohort of doctors in Kenya made fewer diagnostic mistakes when they were helped by an AI model.

With the launch of GPT-5, OpenAI has begun explicitly telling people to use its models for health advice. At the launch event, Altman welcomed on stage Felipe Millon, an OpenAI employee, and his wife, Carolina Millon, who had recently been diagnosed with multiple forms of cancer. Carolina spoke about asking ChatGPT for help with her diagnoses, saying that she had uploaded copies of her biopsy results to ChatGPT to translate medical jargon and asked the AI for help making decisions about things like whether or not to pursue radiation. The trio called it an empowering example of shrinking the knowledge gap between doctors and patients.

With this change in approach, OpenAI is wading into dangerous waters.

For one, it’s using evidence that doctors can benefit from AI as a clinical tool, as in the Kenya study, to suggest that people without any medical background should ask the AI model for advice about their own health. The problem is that lots of people might ask for this advice without ever running it by a doctor (and are less likely to do so now that the chatbot rarely prompts them to).

Indeed, two days before the launch of GPT-5, the Annals of Internal Medicine published a paper about a man who stopped eating salt and began ingesting dangerous amounts of bromide following a conversation with ChatGPT. He developed bromide poisoning—which largely disappeared in the US after the Food and Drug Administration began curbing the use of bromide in over-the-counter medications in the 1970s—and then nearly died, spending weeks in the hospital.

So what’s the point of all this? Essentially, it’s about accountability. When AI companies move from promising general intelligence to offering humanlike helpfulness in a specific field like health care, it raises a second, yet unanswered question about what will happen when mistakes are made. As things stand, there’s little indication tech companies will be made liable for the harm caused.

“When doctors give you harmful medical advice due to error or prejudicial bias, you can sue them for malpractice and get recompense,” says Damien Williams, an assistant professor of data science and philosophy at the University of North Carolina Charlotte.

“When ChatGPT gives you harmful medical advice because it’s been trained on prejudicial data, or because ‘hallucinations’ are inherent in the operations of the system, what’s your recourse?”

This story originally appeared in The Algorithm, our weekly newsletter on AI. To get stories like this in your inbox first, sign up here.

Read the whole story

alvinashcraft

51 minutes ago

reply

Pennsylvania, USA

Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic ta...
Tuesday August 12^th, 2025 at 3:56 PM

Anthropic News

Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic tasks, real-world coding, and reasoning. We plan to release substantially larger improvements to our models in the coming weeks.

Opus 4.1 is now available to paid Claude users and in Claude Code. It's also on our API, Amazon Bedrock, and Google Cloud's Vertex AI. Pricing is the same as Opus 4.

Claude Opus 4.1

Opus 4.1 advances our state-of-the-art coding performance to 74.5% on SWE-bench Verified. It also improves Claude’s in-depth research and data analysis skills, especially around detail tracking and agentic search.

Chart showing Claude's progress on a popular coding evaluation

GitHub notes that Claude Opus 4.1 improves across most capabilities relative to Opus 4, with particularly notable performance gains in multi-file code refactoring. Rakuten Group finds that Opus 4.1 excels at pinpointing exact corrections within large codebases without making unnecessary adjustments or introducing bugs, with their team preferring this precision for everyday debugging tasks. Windsurf reports Opus 4.1 delivers a one standard deviation improvement over Opus 4 on their junior developer benchmark, showing roughly the same performance leap as the jump from Sonnet 3.7 to Sonnet 4.

A benchmark table comparing Claude Opus 4.1 to prior Claude models and other public models

Getting started

We recommend upgrading from Opus 4 to Opus 4.1 for all uses. If you’re a developer, simply use claude-opus-4-1-20250805 via the API. You can also explore our system card, model page, pricing page, and docs to learn more.

As always, your feedback helps us improve, especially as we continue to release new and more capable models.

Appendix

Data sources

OpenAI: o3 launch post, o3 system card
Gemini: 2.5 Pro model card
Claude: Sonnet 3.7 launch post, Claude 4 launch post

Benchmark reporting

Claude models are hybrid reasoning models. The benchmarks reported in this blog post show the highest scores achieved with or without extended thinking. We’ve noted below for each result whether extended thinking was used:

No extended thinking: SWE-bench Verified, Terminal-Bench
The following benchmarks were reported with extended thinking (up to 64K tokens): TAU-bench, GPQA Diamond, MMMLU, MMMU, AIME

TAU-bench methodology

Scores were achieved with a prompt addendum to both the Airline and Retail Agent Policy instructing Claude to better leverage its reasoning abilities while using extended thinking with tool use. The model is encouraged to write down its thoughts as it solves the problem distinct from our usual thinking mode, during the multi-turn trajectories to best leverage its reasoning abilities. To accommodate the additional steps Claude incurs by utilizing more thinking, the maximum number of steps (counted by model completions) was increased from 30 to 100 (most trajectories completed under 30 steps with only one trajectory reaching above 50 steps).

SWE-bench methodology

For the Claude 4 family of models, we continue to use the same simple scaffold that equips the model with solely the two tools described in our prior releases here—a bash tool, and a file editing tool that operates via string replacements. We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. On all Claude 4 models, we report scores out of the full 500 problems. Scores for OpenAI models are reported out of a 477 problem subset.

Read the whole story

alvinashcraft

52 minutes ago

reply

Pennsylvania, USA

AI Is Forcing the Return of the In-Person Job Interview by msmash
Tuesday August 12^th, 2025 at 3:45 PM

Slashdot

Google, Cisco, and McKinsey have reintroduced in-person interviews to combat AI-assisted cheating in virtual technical assessments. Coda Search/Staffing reports client requests for face-to-face meetings has surged to 30% this year from 5% in 2024. A Gartner survey of 3,000 job seekers found 6% admitted to interview fraud including having someone else stand in for them, while the FBI has warned of thousands of North Korean nationals using false identities to secure remote positions at U.S. technology companies. Google CEO Sundar Pichai confirmed in June the company now requires at least one in-person round for certain roles to verify candidates possess genuine coding skills.

Anthropic just made its latest move in the AI coding wars by Hayden Field
Tuesday August 12^th, 2025 at 3:45 PM

The Verge

The AI coding wars are heating up. One of the main battlegrounds? "Context windows," or an AI model's working memory - the amount of text it can take into account when it's coming up with an answer. On that front, Anthropic just gained some ground. Today, the AI startup announced a 5x increase in its context window as it races to compete with OpenAI, Google, and other major players.

Context windows are measured in tokens, and Anthropic's new context window for Claude Sonnet 4, one of its most powerful AI models, can handle 1 million tokens. For reference, Anthropic has said in the past that a 500k context window can handle about 100 half-ho …

Read the full story at The Verge.

Read the whole story

alvinashcraft

1 hour ago

reply

Pennsylvania, USA

Microsoft is Trying To Poach Meta AI Talent and Offering Multimillion-Dollar Pay Packages by msmash
Tuesday August 12^th, 2025 at 3:44 PM

Slashdot

Microsoft has compiled a spreadsheet of Meta AI employees by name, location and position as part of an aggressive recruiting push to sustain its AI-driven march toward a $4 trillion market valuation, according to internal documents viewed by Business Insider. The company created a "critical AI talent" designation enabling top offers within 24 hours and mandated matching Meta's compensation packages, which OpenAI CEO Sam Altman says reach $100 million signing bonuses and recently hit $250 million total packages. Microsoft AI under Mustafa Suleyman and CoreAI under ex-Meta engineering boss Jay Parikh have deployed special recruiting teams making multimillion-dollar offers with multimillion-dollar on-hire bonuses, while the company maintains flat headcount after cutting thousands of employees this year.

Threads now has more than 400 million monthly active users by Aisha Malik
Tuesday August 12^th, 2025 at 3:44 PM

TechCrunch

X, meanwhile, has north of 600 million monthly active users, according to previous statements made by its former CEO, Linda Yaccarino.

Read the whole story

alvinashcraft

1 hour ago

reply

Pennsylvania, USA

What you may have missed about GPT-5 by James O'Donnell Tuesday August 12th, 2025 at 3:58 PM

Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic ta... Tuesday August 12th, 2025 at 3:56 PM

Claude Opus 4.1

Getting started

Appendix

AI Is Forcing the Return of the In-Person Job Interview by msmash Tuesday August 12th, 2025 at 3:45 PM

Anthropic just made its latest move in the AI coding wars by Hayden Field Tuesday August 12th, 2025 at 3:45 PM

Microsoft is Trying To Poach Meta AI Talent and Offering Multimillion-Dollar Pay Packages by msmash Tuesday August 12th, 2025 at 3:44 PM

Threads now has more than 400 million monthly active users by Aisha Malik Tuesday August 12th, 2025 at 3:44 PM

What you may have missed about GPT-5 by James O'Donnell
Tuesday August 12^th, 2025 at 3:58 PM

Today we're releasing Claude Opus 4.1, an upgrade to Claude Opus 4 on agentic ta...
Tuesday August 12^th, 2025 at 3:56 PM

AI Is Forcing the Return of the In-Person Job Interview by msmash
Tuesday August 12^th, 2025 at 3:45 PM

Anthropic just made its latest move in the AI coding wars by Hayden Field
Tuesday August 12^th, 2025 at 3:45 PM

Microsoft is Trying To Poach Meta AI Talent and Offering Multimillion-Dollar Pay Packages by msmash
Tuesday August 12^th, 2025 at 3:44 PM

Threads now has more than 400 million monthly active users by Aisha Malik
Tuesday August 12^th, 2025 at 3:44 PM