Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
153021 stories
·
33 followers

Meet the Finalists: JetBrains x Codex Hackathon

1 Share

Put a capable coding model inside a developer’s primary workspace, and the IDE stops being a place where you write code. It becomes a place where you direct an agent, watch how it reasons, manage what it pays attention to, and decide when its output is worth shipping. That was the defining theme of the inaugural JetBrains x Codex Hackathon: across roughly 40 submissions over a single weekend, teams explored what it actually means to build with AI natively inside the IDE – not bolted on top of it. The six finalists came up with some of the most compelling answers.

🥇 First Place: hyperreasoningAditya Mangalampalli

Most coding agents call the model once and hope for the best. As Aditya puts it: “LLMs spend a lot of time thinking in circles.” Hyperreasoning replaces the single shot with something closer to a search: the system drafts several possible approaches to a task, then a learned controller decides which to expand, which to cut, and which to verify against tests. Compiler errors and failing tests feed back into how the controller weighs its options.

Inside the IDE, a tool window renders the search live, so you can watch which paths the controller explored before settling on one. The argument the project makes is that a smaller local model wrapped in this kind of verified search loop can hold its own against much larger frontier models at meaningfully lower cost — with the IDE serving as the place where reasoning becomes visible and directable, rather than a black box that returns code.

🥈 Second Place: ScopecreepBhavik Sheoran, Kenneth Ross, Roman Javadyan, Joon Im

Hardware bring-up is a tool-juggling exercise: schematic viewer in one window, vendor apps for the oscilloscope and power supply in others, a terminal talking to the device, a spreadsheet collecting results. Scopecreep collapses that into a single JetBrains tool window. Hand it a circuit schematic and an agent works through testing the board – picking signals worth measuring, capturing the readings, and producing a report.

The design choice worth noticing: when the agent decides a probe needs to be placed, the session pauses and shows the engineer exactly where to put it. The engineer places the probe physically and clicks Resume. It’s the right call for real instruments on a real bench – autonomous, where a computer can be trusted, human-in-the-loop, where the work touches the physical world.

🥉 Third Place: mesh-codeAyush Ojha, Coco Cao, Kush Ise, AL DRAM

Switch machines mid-task, and your coding agent starts over. mesh-code fixes that by giving agents shared memory of an in-progress project – what’s been tried, what’s been decided, what’s still pending – so a session that begins on one laptop can continue from another, with whichever agent happens to be available. Codex is one of the agents that can plug in.

Latent Signal – Periscope

Long agent sessions accumulate dead weight: tool outputs nobody needs anymore, dead ends, context that was useful ten turns ago and isn’t now. Periscope, built on Wes McKinney’s open-source agentsview, is a JetBrains plugin that shows what’s actually filling up an agent’s working memory turn by turn – and recommends what to do about it, whether that’s continuing, rewinding to a better branching point, compacting, forking, or handing off entirely. It works with Codex and most other coding agents, and everything stays local.

SecureLoop – Abhiram Sribhashyam, Rahul Marri, Peyton Li

Security incident response is still mostly copy-paste: stack trace into a chat window, repo context explained by hand, a fix written and committed in the hope it’s safe. SecureLoop turns that into a controlled loop inside JetBrains. When something breaks in production, the agent gathers the relevant code, the project’s security rules, and the state of its dependencies, then asks Codex for a structured diagnosis and a proposed fix. That fix runs through automated checks before any pull request opens.

The PR opens automatically. The merge does not. SecureLoop surfaces everything that informed the decision – the diff, the policy it bumped into, the test that proved the patch – inside the IDE for the developer to approve or reject. As the team put it: “Codex fully makes the PR ready for you, and it remains human-in-the-loop where you have to approve or deny.”

The team’s bigger thesis is a security-policy.md file that lives in the repo alongside README.md, spelling out a project’s specific rules for handling secrets, errors, and risky patterns. Coding agents read it before suggesting changes, so the question stops being “what’s a good fix?” and becomes “what’s an acceptable fix under this codebase’s rules?”

Pinpoint – Het Patel

Frontend feedback delivered through a chat window is unavoidably vague. “Move that element” or “change that color” leaves the agent guessing which element you actually mean. Pinpoint takes that piece of the ambiguity off the table: developers drop pins directly on a live page, attach a comment to each, and send the whole batch to the agent with precise on-page context attached. The agent now knows exactly which element you meant – even if it still has to figure out what change you want.

The project ships in two pieces: one for annotating web pages in a browser, and a desktop companion for marking up anything visible on screen – useful when the interface in question isn’t a web page.

What the finalists show

Looking across these six projects, a clear pattern emerges. Codex embedded in the IDE isn’t just a faster way to write code – it’s a reasoning layer you can watch think, a structured output engine you can direct, a participant in workflows that span hardware instruments, production alerts, shared session state, and context windows. And the IDE becomes the place where all of that comes together: visible, controllable, and version-controlled.

That’s the possibility these teams spent a weekend proving out, and it’s only the beginning.

View the full submission gallery.

Read the whole story
alvinashcraft
15 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Breaking the code: Multi-stage ‘code of conduct’ phishing campaign leads to AiTM token compromise

1 Share

Phishing campaigns continue to improve sophistication and refinement in blending social engineering, delivery and hosting infrastructure, and authentication abuse to remain effective against evolving security controls. A large-scale credential theft campaign observed by Microsoft Defender Research exemplifies this trend, using code of conduct-themed lures, a multi-step attack chain, and legitimate email services to distribute fully authenticated messages from attacker-controlled domains.

The campaign targeted tens of thousands of users, primarily in the United States, and directed them through several stages of CAPTCHA and intermediate staging pages designed to reinforce legitimacy while filtering out automated defenses. The lures in this campaign used polished, enterprise-style HTML templates with structured layouts and preemptive authenticity statements, making them appear more credible than typical phishing emails and increasing their plausibility as legitimate internal communications. Because the messages contained concerning accusations and repeated time-bound action prompts, the campaign created a sense of urgency and pressure to act.  

Email threat landscape

Q1 2026 trends and insights ›

The attack chain ultimately led to a legitimate sign-in experience that was part of an adversary‑in‑the‑middle (AiTM) phishing flow, which allowed the attackers to proxy the authentication session and capture authentication tokens that could provide immediate account access. Unlike traditional credential harvesting, AiTM attacks intercept authentication traffic in real time, bypassing non-phishing-resistant multifactor authentication (MFA).

In this blog, we’re sharing our analysis of this campaign’s lures, infrastructure, and techniques. Organizations can defend against financial fraud initiated through phishing emails by educating users about phishing lures, investing in advanced anti-phishing solutions like Microsoft Defender for Office 365 and configuring essential email security settings, and encouraging users to employ web browsers that support SmartScreen. Organizations can also enable network protection, which lets Windows use SmartScreen as a host-based web proxy.

Multi-step social engineering campaign leading to credential theft

Between April 14 and 16, 2026, the Microsoft Defender Research team observed a series of sophisticated phishing campaigns targeting more than 35,000 users across over 13,000 organizations in 26 countries, with majority of targets located in the United States (92%). The campaign did not focus on a single vertical but instead impacted a broad range of industries, most notably Healthcare & life sciences (19%), Financial services (18%), Professional services (11%), and Technology & software (11%). Messages were distributed in multiple distinct waves between 06:51 UTC on April 14 and 03:54 UTC on April 16. 

Bar graph showing volume of messages sent by hour between April 14 and 16, 2026
Figure 1. Timeline of campaign messages sent by hour
Pie charts showing the breakdown of campaign recipients by country and industry.
Figure 2. Campaign recipients by country and industry

Emails in this campaign posed as internal compliance or regulatory communications, using display names such as “Internal Regulatory COC”, “Workforce Communications”, and “Team Conduct Report”. Subject lines included “Internal case log issued under conduct policy” and “Reminder: employer opened a non-compliance case log”.

Message bodies claimed that a “code of conduct review” had been initiated, referenced organization-specific names embedded within the text, and instructed recipients to “open the personalized attachment” to review case materials. At the top of each message, a notice stated that the message had been “issued through an authorized internal channel” and that links and attachments had been “reviewed and approved for secure access”, reinforcing the email’s purported legitimacy. To further support the confidentiality of the supposed review, the end of each message contained a green banner stating that the contents had been encrypted using Paubox, a legitimate service associated with HIPAA-compliant communications.

Screenshot of sample phishing email
Figure 3. Sample phishing email

Analysis of the sending infrastructure indicated that the campaign emails were sent using a legitime email delivery service, likely originating from a cloud-hosted Windows virtual machine. The messages were sent from multiple sender addresses using domains that are likely attacker-controlled.

Each campaign email included a PDF attachment with filenames such as Awareness Case Log File – Tuesday 14th, April 2026.pdf and Disciplinary Action – Employee Device Handling Case.pdf. The attachment provided additional context about the supposed conduct review, including a summary of the review process and instructions for accessing supporting documentation. Recipients were directed to click a “Review Case Materials” link within the PDF, which initiated the credential harvesting flow.

Screenshot of PDF attachment used in the campaign
Figure 4. PDF attachment

When clicked, users were initially directed to one of two attacker-controlled domains (for example, acceptable-use-policy-calendly[.]de or compliance-protectionoutlook[.]de). These landing pages displayed a Cloudflare CAPTCHA, presented as a mechanism to validate that the user was coming “from a valid session”. This CAPTCHA likely served as a gating mechanism to impede automated analysis and sandbox detonation. 

Screenshot of captcha challenge.
Figure 5. CAPTCHA challenge

After completing the CAPTCHA, users were redirected to an intermediate site designed to prepare them for the final stage of the attack. This page informed users that the requested documentation was encrypted and required account authentication. While this stage of the attack has several hallmarks of device code phishing, we were only able to confirm the AITM portion of the attack chain.

Screenshot of intermediate site asking users to click review & sign button
Figure 6. Intermediate site asking users to click “Review & Sign”

After clicking the provided “Review & Sign” button, users were presented with a sign-in prompt requesting their email address.

Screenshot of prompt directing users to enter email address
Figure 7. Prompt directing users to enter their email address

After submission, users were required to complete a second CAPTCHA involving image selection.

Screenshot of second captcha challenge
Figure 8. Second CAPTCHA challenge

Once these steps were completed, users were shown a message indicating that verification was successful and that their “case” was being prepared.

Screenshot of message telling users that verification completed successfully
Figure 9. Message telling users that “Verification completed successfully”

Following these steps, users were redirected to a third site hosting the final stage of the attack. Analysis of the underlying code indicates that the final destination varied depending on whether the user accessed the workflow from a mobile device or a desktop system.

Screenshot of code used to redirect users based on platform, whether mobile or dekstop
Figure 10. Code used to redirect users based on platform

On the final page, users were informed that all materials related to their code of conduct review had been “securely logged”, “time-stamped”, and “maintained within the organization’s centralized compliance tracking system”. They were then prompted to schedule a time to discuss the case, which required signing in to their account.

screenshot of final page instructing users to sign in
Figure 11. Final page instructed users to sign in

Selecting the “Sign in with Microsoft” option redirected users to a Microsoft authentication page, initiating an AiTM session hijacking flow designed to capture authentication tokens and compromise user accounts.

Mitigation and protection guidance

Microsoft recommends the following mitigations to reduce the impact of this threat. Check the recommendations card for the deployment status of monitored mitigations.

  • Review the recommended settings for Exchange Online Protection and Microsoft Defender for Office 365 to ensure your organization has established essential defenses and knows how to monitor and respond to threat activity.
  • Invest in user awareness training and phishing simulations. Attack simulation training in Microsoft Defender for Office 365, which also includes simulating phishing messages in Microsoft Teams, is one approach to running realistic attack scenarios in your organization.
  • Enable Zero-hour auto purge (ZAP) in Defender for Office 365 to quarantine sent mail in response to newly acquired threat intelligence and retroactively neutralize malicious phishing, spam, or malware messages that have already been delivered to mailboxes.
  • Responders could also manually check for and purge unwanted emails containing URLs and/or Subject fields that are similar, but not identical, to those of known bad messages. Investigate malicious email that was delivered in Microsoft 365 and use Threat Explorer to find and delete phishing emails.
  • Turn on Safe Links and Safe Attachments in Microsoft Defender for Office 365.
  • Enable network protection in Microsoft Defender for Endpoint.
  • Encourage users to use Microsoft Edge and other web browsers that support Microsoft Defender SmartScreen, which identifies and blocks malicious websites, including phishing sites, scam sites, and sites that host malware.
  • Enable password-less authentication methods (for example, Windows Hello, FIDO keys, or Microsoft Authenticator) for accounts that support password-less. For accounts that still require passwords, use authenticator apps like Microsoft Authenticator for multifactor authentication (MFA). Refer to this article for the different authentication methods and features.
  • Configure automatic attack disruption in Microsoft Defender XDR. Automatic attack disruption is designed to contain attacks in progress, limit the impact on an organization’s assets, and provide more time for security teams to remediate the attack fully.

Microsoft Defender detections

Microsoft Defender customers can refer to the list of applicable detections below. Microsoft Defender coordinates detection, prevention, investigation, and response across endpoints, identities, email, apps to provide integrated protection against attacks like the threat discussed in this blog.

Tactic Observed activity Microsoft Defender coverage 
Initial accessPhishing emailsMicrosoft Defender for Office 365
– A potentially malicious URL click was detected
– A user clicked through to a potentially malicious URL
– Suspicious email sending patterns detected
– Email messages containing malicious URL removed after delivery
– Email messages removed after delivery
– Email reported by user as malware or phish
PersistenceThreat actors sign in with stolen valid entitiesMicrosoft Entra ID Protection
– Anomalous Token
– Unfamiliar sign-in properties
– Unfamiliar sign-in properties for session cookies  

Microsoft Defender for Cloud Apps
– Impossible travel activity

Microsoft Security Copilot

Microsoft Security Copilot is embedded in Microsoft Defender and provides security teams with AI-powered capabilities to summarize incidents, analyze files and scripts, summarize identities, use guided responses, and generate device summaries, hunting queries, and incident reports.

Customers can also deploy AI agents, including the following Microsoft Security Copilot agents, to perform security tasks efficiently:

Security Copilot is also available as a standalone experience where customers can perform specific security-related tasks, such as incident investigation, user analysis, and vulnerability impact assessment. In addition, Security Copilot offers developer scenarios that allow customers to build, test, publish, and integrate AI agents and plugins to meet unique security needs.

Threat intelligence reports

Microsoft Defender XDR customers can use the following threat analytics reports in the Defender portal (requires license for at least one Defender XDR product) to get the most up-to-date information about the threat actor, malicious activity, and techniques discussed in this blog. These reports provide the intelligence, protection information, and recommended actions to prevent, mitigate, or respond to associated threats found in customer environments.

Microsoft Security Copilot customers can also use the Microsoft Security Copilot integration in Microsoft Defender Threat Intelligence, either in the Security Copilot standalone portal or in the embedded experience in the Microsoft Defender portal to get more information about this threat actor.

Hunting queries

Microsoft Defender XDR customers can run the following advanced hunting queries to find related activity in their networks:

Campaign emails by sender address

The following query identifies emails associated with this campaign using a message’s sending email address.

EmailEvents
| where SenderMailFromAddress in (" cocpostmaster@cocinternal.com "," nationaladmin@gadellinet.com ","
nationalintegrity@harteprn.com”,” m365premiumcommunications@cocinternal.com”,” documentviewer@na.businesshellosign.de”)

Indicators of compromise

IndicatorTypeDescriptionFirst seenLast seen
compliance-protectionoutlook[.]deDomainDomain hosting malicious campaign content2026-04-142026-04-16
acceptable-use-policy-calendly[.]deDomainDomain hosting malicious campaign content2026-04-142026-04-16
cocinternal[.]comDomainDomain hosting sender email address2026-04-142026-04-16
Gadellinet[.]comDomainDomain hosting sender email address2026-04-142026-04-16
Harteprn[.]comDomainDomain hosting sender email address2026-04-142026-04-16
Cocpostmaster[@]cocinternal.comEmail addressEmail address used to send campaign emails2026-04-142026-04-16
Nationaladmin[@]gadellinet.comEmail addressEmail address used to send campaign emails2026-04-142026-04-16
Nationalintegrity[@]harteprn.comEmail addressEmail address used to send campaign emails2026-04-142026-04-16
M365premiumcommunications[@]cocinternal.comEmail addressEmail address used to send campaign emails2026-04-142026-04-16
Documentviewer[@]na.businesshellosign.deEmail addressEmail address used to send campaign emails2026-04-142026-04-16
Awareness Case Log File – Monday 13th, April 2026.pdfFilenameName of PDF attachment containing phishing link2026-04-142026-04-14
Awareness Case Log File – Tuesday 14th, April 2026.pdfFilenameName of PDF attachment containing phishing link2026-04-152026-04-15
Awareness Case Log File – Wednesday 15th, April 2026.pdfFilenameName of PDF attachment containing phishing link2026-04-162026-04-16
5DB1ECBBB2C90C51D81BDA138D4300B90EA5EB2885CCE1BD921D692214AECBC6SHA-256File hash of campaign PDF attachment2026-04-14  2026-04-16  
B5A3346082AC566B4494E6175F1CD9873B64ABE6C902DB49BD4E8088876C9EADSHA-256File hash of campaign PDF attachment2026-04-142026-04-16
11420D6D693BF8B19195E6B98FEDD03B9BCBC770B6988BC64CB788BFABE1A49DSHA-256File hash of campaign PDF attachment2026-04-142026-04-16

Learn more

For the latest security research from the Microsoft Threat Intelligence community, check out the Microsoft Threat Intelligence Blog.

To get notified about new publications and to join discussions on social media, follow us on LinkedIn, X (formerly Twitter), and Bluesky.

To hear stories and insights from the Microsoft Threat Intelligence community about the ever-evolving threat landscape, listen to the Microsoft Threat Intelligence podcast.

The post Breaking the code: Multi-stage ‘code of conduct’ phishing campaign leads to AiTM token compromise appeared first on Microsoft Security Blog.

Read the whole story
alvinashcraft
23 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Register now for OpenClaw: After Hours @ GitHub

1 Share

OpenClaw, one of the fastest-growing open source projects, has already picked up over 350,000 stars and an early community of builders exploring what agentic systems can actually do in practice.

That’s why, on June 3, 2026, we are hosting OpenClaw: After Hours at GitHub HQ in San Francisco. The event will take place during Microsoft Build 2026.

This evening is a chance to bring the OpenClaw community together into the same room.

We’ll kick things off in the early evening with a fireside conversation featuring Peter Steinberger, the ClawFather and creator of OpenClaw, followed by a panel with OpenClaw maintainers and ecosystem builders sharing what’s working—and what’s not—when shipping real agentic systems.

Later in the evening, we’ll move into a series of fast-paced lightning talks and close things out with a relaxed happy hour to connect with other builders.

If you have been following the project or building with it yourself, this is a good chance to meet others, trade notes, and get your claws into what people are actually shipping.

👉 For the full agenda and speaker lineup, please see the registration page.

📍 GitHub HQ, 275 Brannan St., San Francisco
🗓 June 3, 5:30 p.m. – 9 p.m.
📺 Livestream: twitch.tv/github

Drinks and snacks will be provided. There will be a lot here to chew on. No shellfish behavior please. And bring your sharp ideas!

Spots are limited, so register early and come ready to share what you are working on.

‼️ Please note: Submitting a registration does not guarantee attendance. We’ll follow up to confirm successful registrations.

What is OpenClaw?

OpenClaw is an open source framework for building and running agentic systems, focused on giving developers real control over how agents execute tasks in the wild. It provides the core pieces for orchestrating tools, managing state, and handling long running workflows, so you can move beyond prompt demos and ship systems that actually do work. It’s also probably convinced more than a few people to buy a Mac Mini just to run “one small experiment” that somehow turned into a permanent setup.

Hear more about OpenClaw from the creator himself, Peter Steinberger: 

The post Register now for OpenClaw: After Hours @ GitHub appeared first on The GitHub Blog.

Read the whole story
alvinashcraft
29 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

We Gave Agents IDE-Native Search Tools. They Got Faster and Cheaper.

1 Share

We ran the same coding tasks with and without prebundled tooling, across multiple models and languages. Here’s what changed.

Eval-driven development

IDE-native search reduced latency, cost, and budget overruns.

The comparison below uses paired task-level deltas. Aggregate medians and totals are shown for orientation. Budget overruns are tasks that exceeded the USD 0.50 per-task cap.

8.33% Median latency reduced 83.11s → 79.03s
16.44% P95 latency reduced 268.71s → 213.17s
5.60% Total cost reduced USD 44.17 → USD 41.67
33.28% Budget overruns reduced 6.67% → 4.44%

Why We Built This

When coding agents search code, they default to shell tools. grep and find work, but they’re blind to project structure, symbol boundaries, and language semantics. The agent burns tokens sifting through noisy output and making follow-up calls to narrow things down.

So we tried something obvious: what if the agent could use the IDE’s own search instead?

We built a prebundled skill that pairs a search prompt with a unified MCP tool. One tool, four modes: file search, text search, regex, and symbol lookup. A universal router dispatches calls to the right backend.

MCP Tools

Functions the agent calls via an MCP server during task execution. IDE-native tools can tap into indices, ASTs, and project models that shell tools cannot see.

Skills

Packaged agent behaviors: a prompt plus orchestration logic. A skill can work on its own, use tools, or ship bundled with the tools it needs.

Nothing ships by default until the eval says it should. We tested four different configurations of this tooling before picking one.

Methodology

The eval pipeline spins up an MCP server alongside the IDE so the agent has access to the configured tools and skills. We run identical coding tasks with and without tooling, then compare with paired delta analysis.

We track four things: quality, latency, cost, and budget discipline. Quality asks whether all tests passed. Latency tracks median and P95 task time. Cost converts token consumption into dollars. Budget discipline tracks how often a single task exceeds the USD 0.50 budget cap.

We report improvement deltas only when they pass our significance threshold: p < 0.05, paired test with 95% confidence intervals. Metrics without a significant change are either omitted from the charts or called out explicitly. We tried four configuration variants, selected the one with the best latency and cost tradeoff, then re-ran it on different models and languages to check that the results held.

Eval frame

Same tasks, same grading, one controlled difference.

Quality All-tests-passed rate, checked before performance claims.
Latency Median and P95 task duration, compared with paired deltas.
Cost Token use converted to dollars across the task set.
Budget discipline Share of tasks exceeding the USD 0.50 single-task cap.

Results

The selected configuration was a prebundled search skill plus a unified IDE-native tool and universal router. Compared with the no-tooling baseline, it reduced latency and cost without producing a statistically significant quality change.

Baseline vs. tooling

Absolute metrics moved in the right direction.

Median latency
Baseline83.11s
With tooling79.03s
P95 latency
Baseline268.71s
With tooling213.17s
Total cost
BaselineUSD 44.17
With toolingUSD 41.67
Budget overruns
Baseline6.67%
With tooling4.44%
Budget overruns
33.28%
P95 latency
16.44%
Median latency
8.33%
Total cost
5.60%

No statistically significant change in quality. All shown deltas passed the significance threshold.

Trace snapshots

The difference is visible in the agent’s path through the project.

These are shortened traces from cases that improved in both time and cost. The baseline spends more steps discovering context; the prebundled setup gets to the relevant files faster.

Service comments and replies
prompt Update service and controller layers for comments and replies. before: no prebundled IDE search agent> list files -> search x2 -> list files x2 agent> jar inspect x5 -> javap -> jar inspect -> javap x5 agent> curl download -> decompile -> search -> find files x2 agent> read 9 files -> edit file x8 -> respond time: 472s after: prebundled skill and unified search agent> read SKILL.md -> search x3 -> read 5 files agent> read FeatureController.java -> read 4 files agent> edit file x2 -> respond time: 127s
Jackson key deserializer
prompt Preserve detailed error messages from a custom key deserializer. before: broad code walk agent> list files -> search x2 -> read README.md agent> search x5 -> read DeserializationContext.java agent> search x4 -> read StdDeserializer.java agent> search -> read DeserializerCache.java agent> read MapEntryDeserializer.java -> read JsonMappingException.java agent> edit file -> respond time: 150s after: targeted search agent> read SKILL.md -> search x3 agent> read MapDeserializer.java agent> read StdKeyDeserializer.java agent> read DeserializationContext.java agent> edit file -> respond time: 34s

Configuration Explorer

We tested four tool configurations before choosing the final shape. Lower latency and lower total cost are better, so the lower-left corner of the plot is the target.

Configuration search

The selected option had the best latency while preserving cost reduction.

Median latency, 78s to 84s Total cost, USD 39.50 to USD 45.00
Baseline 4 Search Tools Unified Search Tool 4 Tools + Router Unified Tool + Router

Cross-Model Validation

We re-ran the experiment with GPT 5.4 on Java and Kotlin codebases. The pattern holds: latency and cost both drop. Kotlin saw the biggest cost improvement, with total cost falling 13.48%.

Cross-model check

The effect held beyond the original run.

Codex 5.2

Median latency8.33%
Total cost5.60%
P95 latency16.44%

GPT 5.4, Java

Median latency3.75%
Total cost4.07%
P95 latency13.00%

GPT 5.4, Kotlin

Median latency6.92%
Total cost13.48%
P95 latencynot significant

Missing bars mean that metric was not statistically significant for that model and language.

How Models Adopt Tooling

Codex sends 91% of its search calls through the new IDE-native tool. Claude is a different story: Opus uses it for about half its searches, and Haiku only 28%, preferring grep and find instead.

This makes sense. Claude already has strong built-in code search, so it leans on what it knows. Codex doesn’t, so it grabs the better tool when one is available. The takeaway: prebundled tooling fills gaps. Where the model already has good search, it adds less. Where search is weak, it makes a real difference.

Tool adoption

Models do not use new tools at the same rate.

Codex
91 8 1
Claude Opus
53 28 19
Claude Haiku
28 33 39
IDE Search grep find

What’s Next

The eval pipeline works. Now we’re using it.

We’re running the same experiment on smaller models next. Our hunch is that they’ll benefit even more, since they have less built-in search capability to fall back on.

The current results are strongest on Java and Kotlin. We’re expanding to Python, .NET, and TypeScript with bigger sample sizes.

Meanwhile, the winning configuration is being prepared for the integrated IntelliJ IDEA MCP Server, so agent sessions can use IDE-native tooling when the server is enabled.

The next step is to turn this feature on by default in upcoming AI Assistant plugin updates.

Want to try it before the default rollout?

  1. Set these registry keys to true: llm.chat.agent.codex.mcp.idea, llm.chat.agent.skills.settings.enabled, and llm.agents.contrib.bundled.skills.sync.enabled.
  2. In AI Assistant, choose Codex for the best results.
  3. Ask the agent to find something across the current project.
Measure first, ship second, keep measuring after. That’s the whole approach.
Read the whole story
alvinashcraft
45 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Apple introduces a new Pride Collection

1 Share
The new Apple Watch Pride Edition Sport Loop, watch face, and iPhone and iPad wallpaper celebrate LGBTQ+ communities during Pride Month and beyond.

Read the whole story
alvinashcraft
54 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Developing internal skills for recurring documentation processes like release notes

1 Share
My hypothesis this year around AI was that if I develop some agent skills to speed up repeatable processes, it might clear up my bandwidth and free up time for me to work on non-repeatable doc tasks. It appears to be working.
Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories