Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
141953 stories
·
32 followers

AI is forcing the data industry to consolidate — but that’s not the whole story

1 Share
While AI may be the catalyst behind the recent wave of data company M&A, the market was ripe for consolidation.
Read the whole story
alvinashcraft
37 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Windows 11 has finally overtaken Windows 10 as the most used desktop OS

1 Share

Microsoft has finally crossed an important milestone for Windows 11, months ahead of Windows 10’s end of support cutoff date. Stat Counter, spotted by Windows Central, now lists Windows 11 as the most used desktop operating system nearly four years after its release, with 52 percent of the market, compared to 44.59 percent for Windows 10.

Windows 11 became the most popular OS for PC gaming in September, but overall adoption had still been lagging behind Windows 10 until now. Leaked data in October 2023 also revealed Windows 11 was used by more than 400 million devices at the time, a slower adoption pace than Windows 10 — which took just a year to reach 400 million devices compared to Windows 11’s two year period.

Part of the slow adoption is down to Windows 11’s hardware requirements. While Microsoft offered a free upgrade to Windows 10 users, millions of machines have been left behind due to stricter CPU and security requirements. Microsoft has been trying to convince the owners of these machines to upgrade their hardware in order to get Windows 11, sometimes with a full-screen prompt.

Windows 10 is due to reach end of support on October 14th, and Microsoft recently revealed it would give away a free year of extra security updates to consumers if they were willing to enable Windows Backup and sync their Documents folder to OneDrive. If you don’t want to do this, you’ll have to pay $30 for a year of updates, or redeem 1,000 Microsoft Reward points.

Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Don’t Build Chatbots — Build Agents With Jobs

1 Share
AI agents doing various tasks.

You wouldn’t give every employee root access to your production servers. So why give AI agents unfiltered access to your business?

There’s a growing fantasy in the AI world: fully autonomous agents that can handle any task, in any domain, with zero oversight. It’s a compelling vision, but one of the biggest reasons enterprise adoption continues to stall.

Large language models (LLMs) are powerful, but fundamentally unreliable. This isn’t just about occasional errors; hallucination is an “inevitable” limitation of the underlying tech. One prompt might return something useful, the next complete nonsense. A recent Stanford study found hallucination rates as high as 88% in legal scenarios. Businesses see this, and trust is fragile. Promising open-ended intelligence only distracts from the value these systems can reliably deliver today.

If you want useful AI, you have to work with the grain of the technology. That means embracing constraints: closed-world problems, purpose-built agents, scoped tools, strong evals and layered governance.

LLMs Will Never Know You Unless You Tell Them

We love to compare LLMs to human brains. They seem smart, articulate, even insightful. But as much as we like the comparison, LLMs aren’t like people, and expecting them to behave that way sets you up for disappointment.

In the first wave of AI, systems were trained for one job at a time.

You’d take your company’s data, label it and train a model to do something very specific, like predict churn or classify support tickets. The result was a model that really understood your domain, because it had your data baked right into its weights. The trade-off was flexibility. You couldn’t use that same model to write blog posts or analyze images. It was good at one thing and only that thing.

The purpose-built model training pipeline

The purpose-built model training pipeline.

Foundation models are the opposite.

They’re flexible, capable of handling a huge range of tasks. But they weren’t trained on your systems, your processes or your customer data. So while they’re great generalists, they’re really very dumb about your business.

That’s where prompt design and contextualization come in. To get useful results, you have to give the model everything it needs to know, every single time. There’s no memory between prompts. Every interaction starts from scratch.

Contextualizing prompts for app-specific answers

Contextualizing prompts for app-specific answers.

The key is this: Reliability doesn’t come for free. It has to be engineered. And it’s not binary. Reliability is a spectrum, and what’s “reliable enough” depends entirely on the use case.

If you’re fully automating financial transactions or medical records processing, you may need near-perfect accuracy. In that world, LLMs either aren’t ready (and may never be) or you need to wrap them in many layers of validation, testing and fallback logic to make them safe. On the other hand, if you’re generating meeting notes, surfacing helpful support docs or summarizing internal comms, an occasional error may be acceptable as long as the overall signal is useful.

Understanding that difference is what separates AI that works in the real world from prototypes that fall apart in production. If you want consistent, reliable outputs, treat context like a first-class input. Keep it tight, stay on task and give the model just enough to succeed at the job you’ve asked it to do.

LLMs Are the New OS; Don’t Give Everyone Terminal Access

In his speech “Software is Changing (Again),” Andrej Karpathy, former director of AI at Tesla, compares LLMs to operating systems. Just as apps run on macOS or Windows, AI apps now run on GPT, Claude and Gemini. In that analogy, ChatGPT is the terminal: powerful, flexible, and open-ended.

But terminals aren’t for end users. They’re for developers. In real systems, we don’t give people raw access; we build structured interfaces. The same applies here. AI needs guardrails, not root access.

You wouldn’t let employees poke around your production database using raw SQL. You’d give them scoped access through tools, like dashboards or apps, that help them do their job without exposing everything underneath. The same logic applies here.

For business AI systems, the ideal UX isn’t a chat box. It’s not about someone typing questions into a prompt window and hoping for a helpful answer. The real opportunity is in AI systems that are always on, quietly working behind the scenes. These systems don’t wait for a human to ask a question. They react to signals: a new support ticket, a customer abandoning their cart, an incident alert. And when those signals show up, they act with purpose, context and constraints.

This shift, from human-prompted to signal-driven, requires a different approach to UX.

You’re not building general-purpose chatbots. You’re building agents with a job to do. That means giving them access to just the right data, at just the right time, for a clearly defined purpose.

To effectively use AI, prioritize tackling closed-world problems.

Focus on Closed-World Problems

One of the biggest mistakes teams make with AI agents is trying to solve open-world problems when they should be solving closed-world ones.

Open-world problems are things like “teach me math” or “write me something interesting.” There’s no clear input, no fixed output and no single way to measure whether the result is good or bad. The model might give you something clever or it might completely miss the mark.

Closed-world problems are different. You can define the inputs, the expected outputs and the criteria for success. Think about things like:

  • Processing an insurance claim
  • Troubleshooting an IT ticket
  • Onboarding a new customer or employee

These are bounded tasks with rules, constraints and measurable outcomes. That’s what makes them suitable for LLM-based systems, not just because they’re easier to automate, but because they’re easier to trust.

It’s no coincidence that code generation has been one of the most successful LLM use cases so far. The inputs are clear, the outputs are testable and correctness is verifiable. You can run the code. That same pattern, clear expectations and tight feedback loops are exactly what make closed-world business use cases viable.

When you focus on closed-world problems, you get:

  • Better testability: You can write test cases and know what a good response looks like.
  • More explainability: It’s easier to debug or audit the system when things go wrong.
  • Tighter guardrails: You can limit what the AI is allowed to see, say or do.

The more you can reduce ambiguity, the more reliable your AI systems become. It’s all about designing with intent. Solve problems where the path is clear, the scope is defined and the stakes are known. Add as much determinism to the inherently nondeterministic process as you can. That’s how you build AI that delivers results instead of surprises.

Purpose-Built Agents, Not General Chatbots

Once you’ve narrowed the problem space, the next step is to break it down.

Trying to build a single, all-knowing AI that handles everything from customer service to sales forecasting is a fast track to complexity and chaos. Instead, treat AI like software: modular, composable and scoped to specific jobs.

That’s where purpose-built agents come in.

A purpose-built agent is an AI system with a clearly defined responsibility, like triaging support tickets, monitoring system logs or generating weekly sales reports. Each agent is optimized to do one thing well.

Building purpose-built agents

Building purpose-built agents.

And just like in software, the power comes from composition.

Take a closed-world problem like processing an insurance claim. It’s not just one step. It’s a series of structured, interconnected tasks: validating inputs, checking eligibility, fetching relevant policy details, summarizing the case and escalating exceptions. Instead of building one monolithic agent to handle all of it, you can design atomic agents, each handling a specific piece, and orchestrate them into a multiagent system.

This kind of decomposition makes your AI systems more reliable, more secure and easier to evolve. And just like with software microservices, the magic happens not just within each agent but in the way they work together.

Build Tools for LLMs, Not People

Once you’ve broken down your system into purpose-built agents, the next step is giving them the right tools, just like you would for any team.

With LLMs, they rely entirely on what you expose them to and how well it’s described. So if you want your agents to behave predictably, your tools need to be designed with that in mind.

Thinking loop to decide on tool use

Thinking loop to decide on tool use.

And it’s not just a matter of exposing your existing API endpoints. A generic tool, like an open-ended SQL interface to your production database, might seem powerful, but it’s incredibly hard for an LLM to use safely.

Imagine what it takes for an agent to write a query with a generic SQL tool:

  1. Ask for the schema.
  2. Parse a large, potentially messy schema response.
  3. Infer the right table to use.
  4. Guess at the necessary joins across related tables.
  5. Construct the correct SELECT statement.
  6. Try to decide how much data to return.
  7. Format the result in a useful way.
  8. Handle edge cases or ambiguous fields.

Each of those steps introduces risks like wrong assumptions, incomplete context, ambiguous naming and high potential for hallucination. Worse, if the query fails, most agent frameworks will retry the entire sequence, often with slight prompt tweaks. That leads to token bloat, cascading retries and increased cost without improving the result.

You end up with all the downsides of open-world problems: unclear intent, wide decision space, unpredictable behavior. Like agents, tools should be purpose-built. They should be designed specifically to help the agent solve a well-scoped task. Think “fetch today’s unshipped orders” instead of “run any query you want.”

The more you reduce ambiguity, the more reliably the model can get the job done.

That means building tools that are:

  • Strongly typed: No ambiguity in what goes in or what comes out.
  • Constrained: Small, focused tools are easier for agents to reason about and harder to misuse.
  • Self-describing: Tools should include metadata, examples and descriptions to help the model know when and how to use them.
  • Access-controlled: Just like with users, not every agent should have access to every tool. Scope matters.

This is where protocols like the Model Context Protocol (MCP) come in. MCP helps standardize the way tools are defined and described so agents can reason about them more effectively. If you do this right, you’re giving LLMs the right context to use tools safely and correctly.

When you design tools correctly, you can force reliability. The model stops improvising and starts operating more like software should: with clear rules, defined behavior and predictable outcomes.

Governance, Testing and the Need for AI Testers

In traditional software, testing is about whether the output is correct. With AI agents, that’s just the beginning. You also need to test how the agent discovers tools, how it decides to use them and whether it uses them correctly.

That means spending real time on evaluations.

If you’ve built your agents and tools to be purpose-built and scoped, then writing good evals should be straightforward. You know the inputs, you know the expected outputs, and you can run consistent checks across edge cases and common workflows. This isn’t something you can just eyeball. You need repeatable, deterministic tests, just like any other production system.

And for many use cases, human-in-the-loop should be part of the system. You should think through what can be fully autonomous and what requires human oversight. You may need people involved for escalation, validation and learning. Let AI handle the routine, predictable tasks, and let humans step in when things get messy.

Control Is a Feature, Not a Limitation

LLMs are most effective when they’re scoped, structured and grounded in clear context. Predictability is what makes AI reliable at scale. If you want AI that delivers real results, design for control and purpose, not open-ended freedom.

The post Don’t Build Chatbots — Build Agents With Jobs appeared first on The New Stack.

Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Getting creative with Coding Challenges

1 Share
An experiment to level up your coding skills on Stack Overflow, while learning in a space that welcomes creative problem-solving. Discover how we built it.
Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

MCP Vulnerability Exposes the AI Untrusted Code Crisis

1 Share
Wooden cubes with the words changing between “trust” and “doubt.”

A critical vulnerability in Anthropic’s widely used MCP Inspector tool allows attackers to execute arbitrary code on developer machines simply by tricking them into visiting a malicious website. With over 5,000 forked repositories affected and a CVSS score of 9.4, this represents one of the first major security crises in the AI development ecosystem.

It also foretells major gaps in trust that will need to be hardened for nascent agentic AI interoperability architectures to work securely, and for marketplaces for AI agents to reach broad adoption.

Untrusted Code Is Quickly Spreading Across Enterprises

This vulnerability is just the latest example of how important it will become to have a safe way to run AI- generated code and adjacent components from this emerging ecosystem.

Most organizations have rigorous approval processes before allowing arbitrary code to run in their environments whether from open source projects or vendor solutions. Yet with this new wave of tools, we’re simultaneously allowing thousands of employees to constantly update codebases with arbitrary, untrusted AI-generated code or wiring said codebases and applications to mechanisms that can alter or modify their behavior.

This isn’t about stopping the use of AI coding agents or sacrificing the massive productivity gains they provide. Instead, we should standardize better ways that allow us to run untrusted code across our software development pipelines.

The Developer Machine: A Gateway to Everything

When security teams think about protecting their infrastructure, they focus on production environments, CI/CD pipelines and customer-facing systems. But there’s a massive blind spot: the developer’s local machine. This isn’t just another endpoint; it’s a treasure trove of access credentials, source code, internal documentation and often direct connections to production infrastructure.

A developer’s machine can store SSH keys for production servers, database connection strings, API keys, source code for proprietary applications, internal documentation, architectural diagrams, VPN connections to internal networks and cached credentials for cloud platforms. A successful compromise of a developer’s machine doesn’t just affect one person; it can serve as the initial access vector for a devastating supply chain attack or data breach.

MCP Vulnerability: Case Study in Modern Attack Vectors

The recently disclosed vulnerability (CVE-2025-49596) in Anthropic’s MCP Inspector serves as a case study in how modern attack vectors exploit our trust in developer tools. Here’s how the attack works:

The Attack Chain:

  • Target setup: Developer runs MCP Inspector with default settings (happens automatically with the mcp dev command).
  • Exploitation: A malicious website uses JavaScript to send requests to http://0.0.0.0:6277.
  • Code execution: The request triggers arbitrary commands on the developer’s machine.
  • Full compromise: Attacker gains complete access to the development environment.

This vulnerability allows remote code execution simply by tricking a developer into visiting a malicious website. What makes this particularly dangerous:

  • No user interaction is required beyond visiting a webpage
  • Bypasses traditional security controls by targeting localhost services
  • Exploits a 19-year-old browser flaw (0.0.0.0-day) that remains unpatched
  • Targets legitimate tools that developers use daily

As AI development tools gain adoption across enterprises, there is a new class of systems to support them that can execute code on behalf of developers. This includes AI code assistants generating and running code snippets, MCP servers providing AI systems access to local tools and data, automated testing tools executing AI-generated test cases and development agents performing complex multistep operations. Each of these represents a potential code execution pathway that often bypasses traditional security controls. The risk isn’t just that AI-generated code can be inadvertently malicious; it’s that these new systems also create pathways for untrusted code execution.

AI development tools also amplify existing security risks by creating new attack pathways to exploit known vulnerabilities. Traditional web application flaws, for instance, can now be triggered through AI-generated code or automated development agents, expanding the potential reach of previously contained threats. We are already seeing offensive AI companies and solutions out there seeking to capitalize on this.

The Broader Untrusted Code Problem

This vulnerability isn’t an isolated incident; it’s an early warning of a much larger problem. The AI development ecosystem is introducing new categories of systems that can execute code on behalf of developers. These include package dependencies with potentially malicious post-install scripts, third-party libraries that may contain vulnerabilities or backdoors, open source projects where malicious commits can be hidden in plain sight, development tools that connect to external services and code samples copied from forums, documentation or tutorials. Each of these represents a potential entry point for attackers who understand that developer machines are high-value targets.

Isolation by Default

The answer isn’t to stop using AI-generated code or avoid external code, it’s to implement proper isolation for all untrusted code execution including developer environments and production systems. If you’re betting your security on container isolation alone, you’re betting on a Linux namespace doing something it was never designed to do. Real isolation requires hardware-level separation. Container isolation is convenient, not secure. The sooner we acknowledge that, the sooner we can build systems that actually protect our workloads.

While we have focused here on development environments, the same principles apply to production environments. Validation for the isolation needed in development environments is gaining traction with Apple’s Containerization Framework. However, a broader shift needs to occur to consistently treat both development and production systems with the same expectation for runtime isolation since both AI-generated code and new components in the AI development stack can be potentially malicious. It’s also important to make isolation the default, not the exception.

The MCP vulnerability is more than a single flaw; it highlights how deeply our development environments are becoming intertwined with AI systems that can generate, interpret and execute code autonomously.

As agent-based AI architectures continue to evolve, their ability to interoperate across tools, services and platforms mirrors the complex web of transitive dependencies found in modern software supply chains. Just as a weakness in a deeply nested library can compromise an entire application, an AI system with unchecked execution privileges can expose organizations to widespread and potentially devastating risks.

The objective is not to stop the adoption of AI-driven tools or to distrust their capabilities. It is to recognize that their power lies in connectivity and autonomy, which also introduces systemic vulnerabilities. As this new AI development ecosystem matures, we must learn from decades of software security failures and build in protections from the ground up. Interoperability in AI systems must be treated with the same security-first mindset we apply to software dependencies, or we risk repeating the same mistakes on a much larger and faster scale.

*The vulnerability discussed in this post has been patched in MCP Inspector version 0.14.1. Developers using MCP Inspector should ensure they’re running the latest version and review their development environment security practices.

The post MCP Vulnerability Exposes the AI Untrusted Code Crisis appeared first on The New Stack.

Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Expectations for Agentic Coding Tools: Testing Gemini CLI

1 Share
abstract

Before I launch Gemini CLI, Google’s open source AI terminal app, let’s look at what the “quality-of-life” expectations are for agentic applications. Now that we have several of these tools — Claude Code, Warp and OpenAI Codex are other examples — we have a better sense of what a developer needs from them.

Firstly, it needs to be easy to get started on the command line in your terminal. Developers are still the primary target audience for agentic apps, so environment variables or flags for options are fine. But getting straight in is vital.

For example, connecting your API key to your account can be done via environment variable or up in a web page console. Knowing when you are running out of tokens (whether freely given or paid for) is now an important gauge.

When we hit the start button, we need a simple session intro summary so that we know at least the following things:

  • The model in use;
  • The project directory;
  • Any other pertinent permission or account information, or if a working file is being watched.

A working file in the project directory where assumptions based on the project are written and can be tracked (like the Claude.md file) is an important innovation to move beyond a session life cycle into a project life cycle.

Permission boundaries have to be respected; and in general we are in the early days regarding when to allow the large language model (LLM) to change files, and where. I’ve argued that forcing vibe coders to use git is a bit malign — but then again, if you fail to plan you are clearly planning to fail.

Showing us an execution plan the LLM will follow to fulfill your request feels good, but has not yet proven to be essential. But unless this is done, the exact tactics an LLM will use are opaque. A simple checkbox list will suffice.

A quit session summary showing time, requests and tokens used is great. Full accounts can really only be tracked on a user page.

There are plenty of other features that will creep into the above list, but we need to be aware of backsliding as well as genuinely useful innovations.

Starting up Gemini

As with all cloud based LLMs, we must show our fealty before we get access to the precious tokens. Go to Google Studio to generate a key. Currently you are given 100 requests a day (check the other tier limits here).

We can install Gemini via npm at the terminal:

npm install -g @google/gemini-cli


Next, set your API key as an environment variable — I’m doing it here in the command line on my MacBook:

Then type the command gemini and we are off:

As I mentioned in the quality-of-life section above, this does the important thing of pointing at the active model (Gemini-2.5 Pro in this case) as well as reflecting the project directory.

The theme selection screen disappears as soon as you press return, but I assume you can bring it back. It takes up quite a lot of space on the introduction screen.

Like Claude Code, there is markdown file — GEMINI.md in this case — for request customization. I won’t use it in this post.

What does “no sandbox” mean? The bad news is that Gemini starts off with no restrictions as to where your AI may roam. I’m afraid that isn’t very sensible, but Gemini gives you fairly straightforward options. The good news is that we can use macOS Seatbelt, which starts off with a sensible policy of restricting access to within the project directory.

So I’ll exit this session (type /quit) and we can restart with this basic security.

The quit screen provides some of the stats I referred to earlier:

We can use Seatbelt by just setting an environment variable in this session, then adding a flag:

Now we are good to go, as we have our seatbelt on.

As I did with Codex in a recent post, let’s try out the merge of two JSON files. As before, I’m looking for how the structure supports me, as much as the outcome. If you don’t want to read the previous post, imagine I have a city website that uses JSON data. I have a JSON file called original_cities.json:

{ 
  "cities": [ 
    { 
      "id": "London", 
      "text": "London is the capital of the UK", 
      "image": "BigBen" 
    }, 
    {
      "id": "Berlin", 
      "text": "Great night club scene", 
      "image": "Brandonburg Gate",
      "imageintended": "Reichstag" 
    }, 
    { 
       "id": "Paris", 
       "text": "Held the Olympics of 2024", 
       "image": "EifelTower", 
    } 
  ] 
}


The spelling errors and formatting error (extra comma) are intentional; we want to see if we can bait the LLM.

I also have another file, called updated_cities.json:

{
  "cities": [
    {
      "id": "London",
      "text": "London is the capital and largest city in Great Britain",
      "image": "BigBen"
    },
    {
      "id": "Berlin",
      "text": "Great night club scene but a small population",
      "image": "BrandenburgGate",
      "imageintended": "Reichstag"
    },
    {
      "id": "Paris",
      "text": "Held the Olympics of 2024",
      "image": "NotreDame"
    },
    {
      "id": "Rome",
      "text": "The Eternal City",
      "image": "TheColleseum"
    }
  ]
}


I want to update the first file with the contents of the second. This simulates slightly out-of-synch working. I have one condition: I want any updated image references (that I may not have yet) copied into a key called “imageintended” so that I don’t use the data and cause a crash.

Essentially all the merge should do is add the Rome entry to the first file and introduce the new image references without overwriting the existing image key.

So my project folder looks like this. Note, I haven’t created a GEMINI.md file:

I’ll use the same request I gave to Codex:

“please update the JSON file original_cities.json with the contents of the file updated_cities.json but if the ‘image’ field is different, please update or write a new ‘imageintended’ field with the new value instead”

So let’s see what it does. This task may look specific, but is actually a bit vague, which reflects a request from the average human.

After getting confused about its project file, it gave me a perfectly good answer:

Updating text, adding the new entry and not overwriting any values in the “image” key — all done. It didn’t try to fix inconsequential spelling and didn’t get confused by the trailing comma. It was far quicker than Codex as well.

I checked the file, and indeed the changes were made. Before it answered, it didn’t quite make a plan, but gave me a fairly basic explanation of what it would do:

As the outcome was entirely correct, the process didn’t really matter. But only by checking intentions can you really correct LLM “thinking” when it takes the wrong path.

I’ll exit to show the final expenditure summary:

Conclusion

As I said, this isn’t a direct LLM comparison, but Gemini gave me an efficient agentic experience. I’m sure Google can plug in any of the missing quality-of-life issues I mentioned (specifically, some running stats on token usage), but it is definitely ready for action right now. There is a growing coterie of agentic terminal applications out there for developers to try, and Gemini CLI is a solid addition to that list.

The post Expectations for Agentic Coding Tools: Testing Gemini CLI appeared first on The New Stack.

Read the whole story
alvinashcraft
38 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories