Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
156519 stories
·
33 followers

Full-stack and server-side UX experiments: Testing beyond the frontend

1 Share

UI/UX design primarily focuses on creating usable and productive visual interfaces, and designers often evaluate designs using frontend-focused UX testing. However, frontend visuals and other frontend communication channels like voice interfaces aren’t the only factors that affect your product’s UX —the main  server-side implementation and all full-stack components affect product UX. UX testing isn’t about evaluating solely UIs, and you should evaluate UX factors in the backend as well for creating a user-centric and profitable product.

alt

Let’s understand what full-stack and server-side UX experiments are, their benefits, and how to go beyond frontend testing to improve the UX of any digital product.

What are full-stack and server-side experiments?

In UI/UX design, full-stack experimentation, also known as end-to-end (e2e) experimentation, refers to running UX experiments covering both the frontend and backend of a digital product. It motivates product teams to run A/B, A/B/n, or multivariate tests on the whole product stack, covering all client-side and server-side components — not just solely on the frontend or backend:

  • Frontend: Testing UX effects of layout, control, text, color, and typography modifications of a primarily visual product. Or, voice, gesture interface adjustments of a multimodal product
  • Backend: Testing UX effects of API, integration, database, algorithm, flow, service, and infrastructure modifications

Server-side experiments are pure backend tests that developers and AI engineers conduct to evaluate the effectiveness of backend modifications to optimize the backend logic to satisfy UX or organizational requirements, e.g., running an A/B test for two recommendation algorithm versions without UI modifications.

full-stack UX and server-side experiments
Scopes of full-stack UX and server-side experiments.

Full-stack experiments vs. server-side experiments

Server-side experiments focus on evaluating the UX effects of backend modifications without UI updates, while full-stack experiments evaluate both frontend and backend modifications:

Comparison factor Full-stack experiments Server-side experiments
Scope Frontend and backend Backend
Goal Improving UX through end-to-end optimization Optimizing backend logic without modifying UI/UX foundation
New UI prototypes are needed Yes No
Noticable product UI changes Yes Minimal, usually element ordering and count
Experiment conclusion The winning design version will be deployed with frontend and backend changes The winning backend logic will be deployed after a UX evaluation
Designer-developer collaboration Higher Moderate or low
Example A/B testing two login flows A/B testing two search algorithms

Why do designers care about full-stack and server-side experiments?

Past digital products were less personalized and less dynamic, with direct, static rules-driven, non-AI-powered data sources, but modern products focus on improving UX for user-centricity and user engagement using AI-powered personalization, offering optimized suggestions through highly adaptive interfaces.

Modern digital product interfaces are heavily influenced by data and logic from backend components, so purely UI-oriented testing isn’t enough to achieve a complete, effective UX test. As a result, modern product teams have to conduct full-stack experiments, covering the whole product, and they have to conduct server-side experiments, targeting backend optimization.

Example use cases

Here are some practical example use cases to understand when you can consider using full-stack and server-side experiments:

  • Recommendation engines: Conducting an A/B test with a server-side experiment to evaluate user engagement scores for two ecommerce product recommendation engine algorithms. Version A focuses more on suggesting unique products based on user preferences, and version B prioritizes popular products
  • Onboarding logic: Testing the effectiveness of two onboarding flows of a social media app: Version A asks the user to select video tags to build the initial video feed. Version B immediately starts the video feed with known basic data, like location and trends. A full-stack experiment evaluates versions A and B
  • Pricing logic: Doing a full-stack multivariate test to maximize user retention rate and revenue per user based on base pricing, discounts, billing cycle, etc., by running experimental flows with subscription UI adjustments
  • Authentication and recovery flows: Conducting a full-stack A/B test for new authentication and recovery flows after integrating a new experimental API server version that improves security factors. One version uses the original API server and original authentication UI, while the other version adheres to new server rules using the new UI

Challenges in full-stack and server-side experiments

Improving UX beyond the frontend layer creates better products, but you’ll face the following challenges if you do so:

  • Slowness in collaboration: You can run frontend-focused A/B tests faster since you’ll have to collaborate with developers only when you need to deploy A/B test variants, but full-stack and server-side experiments can be slow since designers and developers frequently have to work collaboratively
  • Feasibility issues: Technical limitations can become a barrier for a particular full-stack and server-side experiment — not all experiments are practically possible
  • Limited designer visibility: Designers may see the backend-related adjustments as a black box and face issues while extracting UX insights, e.g., if the searching algorithm B didn’t perform well, UX reasoning is difficult since there are no visible foundational UI changes

Collaboration tips for designers

Full-stack and server-side UX experiments go beyond the frontend, where your job role belongs, covering the product backend, which is maintained by developers, architects, system administrators, AI engineers, and security engineers. Here are some collaboration tips for you to conduct full-stack and server-side experiments smoothly:

  • Co-define hypothesis: Both frontend and backend require modifications in full-stack experiments, and server-side experiments affect UX, so if you use p-value evaluation, collaboratively define the test hypothesis to plan a feasible, UX-friendly test
  • Identify the experimentation responsibilities: Identify whether the test is full-stack or server-side. If it’s a full-stack experiment, you’ll have to create new designs; if it’s a server-side experiment, you’ll have to identify possible UX implications
  • Use collaboration tools: Collaboration isn’t solely about having meetings. Maintain shared documentation, notes, and even design sketches, and skip unwanted meetings to boost everyone’s productivity

Conclusion

UX is holistic — you can’t do a complete UX evaluation for a modern, highly-dynamic product just by evaluating its frontend. You can create the best UX for the whole digital product by evaluating UX factors beyond the frontend with full-stack and server-side experiments by collaborating with product backend maintainers.

FAQs

Full-stack vs. server-side experiment: which experiment should I use?

Depends on the product enhancement goals set by product managers

Do full-stack and server-side experiments benefit only products that use AI?

No, any product that is built with a non-AI, simple rule-based backend can use them to optimize backend and frontend integration

The post Full-stack and server-side UX experiments: Testing beyond the frontend appeared first on LogRocket Blog.

Read the whole story
alvinashcraft
just a second ago
reply
Pennsylvania, USA
Share this story
Delete

5 T-SQL features that should already exist (2026 SQL Server wish list)

1 Share

SQL Server is one of the most mature database platforms in the world, but that doesn’t mean it’s finished. With T-SQL now spanning Fabric, Azure SQL Database, Synapse, and SQL Managed Instance, the gap between what the language can do and what developers wish it could do is more visible than ever.

From clunky flat-file imports to licensing models that require a legal degree to navigate, there are real pain points that slow down data engineers and DBAs daily. This is Ed Pollack’s frank, opinionated look at five meaningful improvements that would make SQL Server and T-SQL genuinely better – not just for power users, but for anyone building on Microsoft’s data platform in 2026.

It’s mid-2026, and Microsoft SQL Server is an exceptionally mature database product. Even more important today is that T-SQL is becoming a multi-platform language – used in Fabric, Azure SQL Database, Synapse, SQL Managed Instance, and a variety of other applications.

While each variant of T-SQL and each database platform is different in its implementation and details, there remain features that I continue to wish for – and request – on a regular basis.

This article contains my thoughts on what is missing, what can be improved, and why. These are ultimately my opinions and will differ from others who have pondered this before me, and those who may do so in the future. As a T-SQL junkie, these hopes and dreams skew towards the T-SQL surface area.

Hopefully, when we look back on this in a few years, checkmarks can be placed next to some of these requests!

Before we begin: a quick note on compatibility

SQL Server and T-SQL are large and ever-evolving platforms. Microsoft is thorough in documenting compatibility with different products, as well as with ANSI standards. If you’re interested in learning more about these details, grab a (large) cup of coffee and check out their docs here.

Because of this, T-SQL supports a whole lot of different syntaxes, providing many different ways to solve most problems. There are also many features that are deprecated or maintained solely for backwards-compatibility. When choosing a solution, there’s value in picking one that’s currently supported (and will be so for the foreseeable future.)

Now, onto the list!

Wish #1: Import from compressed file formats

Moving data around for analytics, reporting, or machine learning/AI can be a hassle. This sort of analytic data can get quite large and the file formats I want to store it in are the ones that are as small and efficient as possible. Compression is key here, so CSV and Excel are not the formats I really want to deal with.

There’s no need for bells and whistles: data will be written to files once, moved around, and then imported into SQL Server. Any downstream operations will benefit from smaller file sizes. Additionally, less computing resources will be needed overall to manage these files. This is especially true if files are to be copied, moved, archived, or ingested by multiple systems.

Compatibility with other systems

For maximum compatibility with other systems – such as Synapse, Fabric, or other data warehouses/data lakes – the Apache Parquet file format is ideal. There are other highly compressed file formats out there that provide different optimizations and more features – the likes of Iceberg, ORC, Delta Lake, or Avro – but these are designed to support transactional write operations. Also, parquet files can be converted easily into Delta Lake or other formats if needed.

You may also be interested in:

Reading and writing parquet files in SQL Server

While this article I wrote last year was a fun experiment, it was a workaround, and I’d love a native integration into OPENROWSET or bcp to be available natively in SQL Server. Azure has supported parquet files for quite a while now, so why not SQL Server?

I’d love to be able to execute code like this:

SELECT
  *
FROM OPENROWSET(BULK N'D:\MyDataFiles\CritcalSalesData_06_03_2026.parquet',
FORMAT = 'PARQUET') AS PARQUETDATA;

As a big bonus, parquet files already contain metadata about column and row structure. There is no need for format files, field terminators, code pages, and other minutia that occupy us when dealing with CSV or other plain text formats. Parquet includes details on data types, values, NULLability, row counts, and more!

What does this mean when importing parquet files? The benefits explained

This means that, when importing parquet files, many of the nightmares we’ve experienced in the past are avoided. This includes:

  • What data type is each column?
    • Is it an INT or a string? DATE or DATETIME?

  • What are the length, scale, and precision for each column?
    • Is that a VARCHAR(50) or VARCHAR(100)? DECIMAL(8,2), or DECIMAL(10,4)?

  • Will column values containing delimiters such as commas or apostrophes break my import?

  • My 100 column format file needs updating…

  • Can a column be NULL?

This is the year 2026. Why does importing data sometimes feel like I am physically adjusting jumpers on an old hard drive? A descriptive metadata-driven file-format like parquet solves so many of the pain points that data engineers feel every day. The effort to implement this in SQL Server would not be immense.

Wish #2: Arrays

Arrays are an ANSI SQL standard data structure, familiar to software developers and anyone that has ever spent enough time in the mathematical world. At a high level, an array is a data type that can contain numbered elements.

For example, a few simple examples of array creation in C# would look like this:

int[] Top10CustomerIdList = new int[10];

This code creates an array that will contain 10 integers. Once created, they can be given values:

Top10CustomerIdList[0] = 561;

Top10CustomerIdList[1] = 17;

Top10CustomerIdList[2] = 280;

Arrays can be multi-dimensional and thereby support far more involved mathematical and computational problems:

int[,] CubeStats = {{17, 5, 16},{-5, 2, 19},{7, 0, 58}};

The value at a given position in a two-dimensional array can be returned easily enough:

Console.WriteLine(CubeStats[0, 1]);

This would return the value at 0 (the first triplet) and 1 (the second value contained within), which would be the number 5.

You may also be interested in:

C# via Java: Arrays

The current state of arrays in databases

PostgreSQL has arrays implemented natively, usable in scalar or table-based operations. For example, a table can contain an array as one of its columns:

CREATE TABLE OrderSummary (
OrderId INTEGER SERIAL PRIMARY KEY,
OrderDate DATE,
OrderLabels VARCHAR(50) ARRAY[3]);

From here, any row within the table may have its OrderLabels column inserted to or adjusted for any of its three possible elements:

INSERT INTO OrderSummary
VALUES (
'6/3/2026',
'{"Priority", "Gold", "Special"}');

UPDATE OrderSummary
	SET OrderLabels[1] = 'Silver'
WHERE OrderId = 1;

While SQL Server does not support arrays, it’s not too difficult to fake them using tables, sequences, delimited strings, JSON, XML, or some other trickery. Vectors can also be used to store array-like data, so long as the contents are numeric data types. These solutions work, but are never quite the same, either in terms of usability or performance.

My biggest complaint about the array replacements in SQL Server is that they are either prone to bad data, perform poorly, or both. JSON, XML, and delimited lists are often used in place of arrays – especially multi-dimensional ones. Their weakness is that the actual data structures are not a native list of values, but a blob of text that also contains the values we want.

Their primary purposes are different, too. JSON is the perfect solution to transmit an unstructured document between apps without the need to deconstruct or reconstruct it first. A delimited list makes the most sense when its original source is also a delimited list.

Why arrays should be added to SQL Server

What are my top reasons for wanting arrays in SQL Server?

  • Native support for sets of objects in any number of dimensions.

  • The ability to natively index an array-object to improve performance.

  • The database engine automatically parses element data types/sizes to ensure they fit correctly into the array.

  • More effective for structured data than a blob of text.

Note that Azure SQL Database and SQL Server 2025+ support the VECTOR data type. While there are similarities between arrays and vectors, they are not the same. Arrays are built for general support of element lists for any data types, whereas vectors are specialized for storing AI embeddings and managing the strictly numeric data for sematic search models.

Arrays take a bit of getting used to. Those who are not already familiar with how they work might find them awkward, but I sincerely hope they make their way to the Microsoft data platform in the future. They provide better, easier solutions to many common problems.

Wish #3: OVERLAPS

This predicate can be used when evaluating dates and times – returning TRUE if two time ranges overlap with each other, and FALSE if they do not. It’s a use-case that I find I need to evaluate more often than a younger version of me may have anticipated!

For example, imagine that there was an outage of a software provider from 7:15-7:30 on June 3, 2026, and I wanted to find all API calls that have any overlap in execution start/end with the outage time. The basic syntax in PostgreSQL would look like this:

SELECT
	*
FROM APICalls
WHERE (StartTime, EndTime) OVERLAPS ('6/3/2026 07:15:00', '6/3/2026 07:30:00');

Now compare that to the code required to get the same result in SQL Server:

SELECT
	*
FROM APICalls
WHERE (StartTime <= '6/3/2026 07:15:00' AND EndTime >= '6/3/2026 07:30:00')
OR (StartTime >= '6/3/2026 07:15:00' AND EndTime <= '6/3/2026 07:30:00')
OR (StartTime <= '6/3/2026 07:15:00' AND EndTime >= '6/3/2026 07:15:00')
OR (StartTime <= '6/3/2026 07:30:00' AND EndTime >= '6/3/2026 07:30:00');

This code handles four distinct scenarios:

  • The API call began before the outage and ended after the outage was over

  • The API call began during the outage and ended before the outage was over

  • The API call began before the outage and ended after the outage began

  • The API call began during the outage and ended after the outage was over

While there are a variety of different ways to code this, the versions without OVERLAPS are far more elaborate. More importantly, coding to cover all overlapping scenarios is mentally taxing – and where most mistakes creep in.

Even when working with date tables to assist in performing complex date math, I’d rather have a built-in function that does exactly what I want, than need to cobble it together from a loosely-visualized graphic in my mind.

Wish #4: Simplify licensing of SQL Server

This is a SQL-Server-specific wish: please make licensing simpler! In Azure, things have a monthly cost that can be quantified and estimated based on hardware and software needs. For SQL Server outside of Azure SQL Database, however, licensing is complicated by the fact that there are many, many licensing models available.

Why is that a problem?

As organizations grow, the number can get large quickly. The job of licensing SQL Server therefore becomes mission critical. Some of the most common questions asked include:

  • Which licensing model should we use? Per-core? CAL? Something else?

  • What is Software Assurance and how does it work?

  • Does a subscription model make sense?

  • Should licensing be determined by the virtual or physical host?

  • When is it OK to use SQL Server Developer Edition?

  • How do HA/DR copies work? Are they licensed differently?

  • Enterprise or Standard – which is the correct one to use?

  • Can Azure Arc help?

  • What is License Mobility?

  • Do I need to hire a legal or technical contractor to deal with this for me?

That’s a LOT of complexity – and it’s only the tip of the iceberg! This leads to frequent licensing mistakes when organizations can easily pay far too much for their SQL Servers. Some companies under-license and are left in an awkward position when they are audited.

Protect your data. Demonstrate compliance.

With Redgate, stay ahead of threats with real-time monitoring and alerts, protect sensitive data with automated discovery & masking, and demonstrate compliance with traceability across every environment.
Learn more

For more information on current licensing methodology from Microsoft, these docs provide some quality reading, assuming you have some leftover coffee from earlier on-hand!

It seems like each time Microsoft makes changes to its licensing, the result is more confusion. Therefore, a HUGE wish I have is to see licensing simplified. It should be no more complex than spinning up an Azure SQL Database. For example, I can jump right into Azure and get an estimate on what a database will cost per month, just like this:

Sure, I may not like the resulting price, but it’s clear as I adjust the various levers and knobs in this calculator what the cost will look like.

Wish #5: Cloud Blob Storage for everything

As a best practice, relational/transactional data should reside in a relational/transactional database. Files should reside in file systems. In the ancient past, file systems implied hard drives attached to servers that were given drive letters. Those drives were monitored and managed not only so that space didn’t run out, but to maintain high availability (as much as the system allowed) throughout.

The latter part of that description was where scalability and maintainability would become challenging. Drives fill up. Disks fail. Things breaking is a part of the world of hardware. It can’t be avoided. All hardware eventually fails – that’s a general accepted truth in computing.

Cloud storage fixed that. No longer do we need to carefully manage the nitty-gritty of storage ourselves – why not let Microsoft, Amazon, Google, and other cloud vendors deal with it for us?!

Cloud storage in SQL Server 2025 improved the situation

SQL Server 2025 brought with it the ability to back up a database to Azure Blob Storage and Amazon s3. This was a huge improvement as those storage targets are cheap and made for large files. Once there, backup files can be used, copied, moved, and managed in all the ways that files in a filesystem can be managed, with all of the maintainability benefits offered by cloud storage vendors.

For more info on backing up databases to Amazon S3 or Microsoft Blob Storage, check out the official documentation here and here.

This was a good start – but what would be great would be for those cloud storage technologies to be able to be used for more database operations involving files. Here are my top requests.

Cloud storage improvements I’d like to see in SQL Server

  • Filestream-like storage to a cloud URL

  • Log Shipping can use cloud-stored backup files for all of its needs

  • Mirror backups automatically to secondary URLs

  • Format files for bcp/Bulk Insert can reside in a cloud URL

  • Import and Export files from SQL Server via a cloud URL

  • Anywhere else we interact with files via SQL

If a database is already hosted in the cloud, then these requests become simple ones. “Let me connect to that storage right over there instead of this storage, PLEASE!”

Picture a scenario when I want to do something like this:

An image showing Ed's example scenario: the 'Import Flat File' option is selected from a menu in SQL Server.

My expectation is that I can load a file from any reasonable location available to this server:

An image showing the next part of Ed's imaginary scenario. This time, it's simply an image of Ed's file search menu, and all the files he can potentially load into SQL Server.

Unfortunately, my choices include locally attached drives. For maintainability’s sake, it would be far better for files, scripts, and other related content to be centralized in a single easily-accessible, secure cloud location. Hopefully, the trend started in Azure SQL Database – and then in SQL Server – continues, and storage targets become seen as a more unified surface area that includes URLs, Blob storage, and other commonly-used cloud solutions.

SQL Server has come a long way, but there’s still work to be done

SQL Server and T-SQL have come a long way. The gap between where the platform is and where it could be, however, is still real – and felt daily by developers and data engineers.

Native Parquet support would eliminate decades-worth of flat-file frustration in a single feature. Arrays would give developers a proper, performant alternative to the JSON and XML workarounds that have been patched into workflows for years. OVERLAPS would turn four-condition date logic into one readable line. Simpler licensing would remove a source of risk and confusion that no technical team should have to carry. And broader cloud storage integration would bring SQL Server’s file handling in line with the cloud-first world most teams already live in.

Several of these requests already exist in competing platforms or adjacent Microsoft products. Let’s hope that, if I were to revisit this list in a few years, most of it has earned a checkmark.

What would you like to see added to SQL Server? Do you agree or disagree with any of my points? Let me know in the comments below!

Fast, reliable and consistent SQL Server development…

…with SQL Toolbelt Essentials. 10 ingeniously simple tools for accelerating development, reducing risk, and standardizing workflows.
Learn more & try for free

The post 5 T-SQL features that should already exist (2026 SQL Server wish list) appeared first on Simple Talk.

Read the whole story
alvinashcraft
15 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

ASP.NET Core Request Paths Reference

1 Share

This is the modern ASP.NET Core update of my original ASP.NET Request Paths Reference.

That older reference remains useful for applications built on traditional ASP.NET and System.Web. A separate article is needed because ASP.NET Core replaced those request APIs, separates URL components differently, and commonly runs behind reverse proxies or under a configured base path.

Consider this request:

https://www.example.com/MyApplication/MyFolder/MyPage?key=value

In this example, /MyApplication is the application's configured base path. ASP.NET Core exposes that prefix as Request.PathBase, and exposes the remaining path inside the application as Request.Path.

PathBase is not automatically the first URL segment. It is a configured or forwarded path prefix for the application. It is often empty when the app runs at the domain root, such as https://www.example.com/. It can also contain one or more segments, such as /MyApplication or /apps/customers, when the app runs under that URL prefix.

Request path properties

In a controller, Razor Page, middleware, or endpoint, use properties from HttpRequest:

PropertyExample value
Request.Schemehttps
Request.Hostwww.example.com
Request.PathBase/MyApplication
Request.Path/MyFolder/MyPage
Request.QueryString?key=value
Request.QueryRequest.Query["key"] is value
Request.ProtocolHTTP/2 or the active protocol

PathBase contains the path prefix that has been separated from the request path. Path contains the remaining request path used by routing inside the application.

For example, if the incoming URL path is /MyApplication/MyFolder/MyPage:

ConfigurationRequest.PathBaseRequest.Path
No base pathempty/MyApplication/MyFolder/MyPage
Base path is /MyApplication/MyApplication/MyFolder/MyPage
Base path is /MyApplication/MyFolder/MyApplication/MyFolder/MyPage

The split can come from middleware such as UsePathBase, hosting under a sub-application, or forwarded proxy configuration such as X-Forwarded-Prefix.

var request = HttpContext.Request;

var pathWithinApplication = request.Path;
var fullPath = request.PathBase + request.Path + request.QueryString;

For the example request:

pathWithinApplication = /MyFolder/MyPage
fullPath              = /MyApplication/MyFolder/MyPage?key=value

Build the absolute request URL

ASP.NET Core deliberately keeps URL components separate. To create the absolute display URL, use GetDisplayUrl() from Microsoft.AspNetCore.Http.Extensions:

using Microsoft.AspNetCore.Http.Extensions;

var absoluteUrl = Request.GetDisplayUrl();
// https://www.example.com/MyApplication/MyFolder/MyPage?key=value

When the application runs behind a reverse proxy, configure forwarded headers so Scheme, Host, and the generated absolute URL reflect the original client request.

Generate application links

Do not manually concatenate paths when generating links to endpoints. Use ASP.NET Core link generation with Url.Action:

var path = Url.Action("Details", "Customers", new { id = 42 });
var absolute = Url.Action(
    "Details",
    "Customers",
    new { id = 42 },
    protocol: Request.Scheme);

For endpoint routing outside a controller, inject LinkGenerator.

Which property should you use?

  • Use Request.Path to inspect the route path inside the application.
  • Use Request.PathBase + Request.Path when the mounted base path matters.
  • Use Request.Query to read parsed query-string values.
  • Use GetDisplayUrl() when you need the complete incoming URL for diagnostics or display.
  • Use Url, LinkGenerator, or endpoint names when generating application links.

For traditional ASP.NET APIs such as Request.RawUrl, Request.ApplicationPath, and Request.Url, see the original ASP.NET Request Paths Reference.

Read the whole story
alvinashcraft
20 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

AI inference is obviously profitable

1 Share

Many people claim that AI inference is unprofitable to serve, and thus must be subsidized by an ocean of dumb money from investors who believe that some future AI model will come to dominate the world economy. When that dumb money goes away, so will AI products. According to this view, LLMs are just inherently too expensive (in terms of money, power, and water) to be used in consumer products. In fact, they can only be used today by externalizing the costs: money onto VC funds and now retail ETF investors, power onto electric utility consumers, and water onto the communities where datacenters are built.

There are good reasons to dislike AI, but this really isn’t one of them. In fact, AI inference is obviously profitable.

Doing the math demonstrates that inference is profitable

Frontier AI providers are reporting 70%-80% gross margins on inference, but maybe we can’t trust them. Let’s do some very rough estimates on the actual cost.

A Nvidia A100 consumes 400W of power under full load. In practice, even a carefully-tuned inference server will not be at full load all the time, but it’s at least an upper bound. Suppose you’re running a dense 70B model1, which will fit comfortably (unquantized) on four A100s at around 2M tokens per hour. At industrial power prices, that’s about 13c/hr in the USA. Suppose (pessimistically) cooling is the same cost. That’s about 13 cents per million output tokens2.

Let’s amortize the cost of the GPUs, since that’s going to be the most expensive part. An A100 costs about $20k. If each A100 lasts around five years3, you’ll have to make 16k/yr in profit to recoup your capital investment (or $1.80 per hour). At lower utilization, it’ll take longer to recoup, but your GPUs will also last longer. Either way, your overall inference costs are at about one dollar per million tokens.

GPT-5.4-mini charges $4.50 per million tokens, and stronger OpenAI or Anthropic models are three to six times as expensive. It’s hard to make a direct comparison because we don’t know the size of OpenAI or Anthropic models, but the claimed 70% or 80% profit margin is extremely plausible.

Open LLMs demonstrate that inference is profitable

What if you don’t trust my estimates either? Let’s look at the pricing of open-weights Chinese LLMs. DeepSeek have claimed a bit over 80% profit margin on inference for DeepSeek-R1. Since their API pricing for R1 is less than half that of OpenAI or Anthropic4, that suggests that my estimates above for inference cost might be too expensive. Cooling at scale is probably cheaper than power, R1 only has half the active parameters of a dense 70B model, modern GPUs are more efficient than the A100, and there are significant economies of scale in inference.

Since DeepSeek’s models are available for anyone to download, they can’t get away with extracting a large profit margin. One of the other inference providers would undercut them with the same model. Inference costs for DeepSeek-V4-Pro on the market are around 87 cents per million output tokens, which is probably pretty close to the actual cost of serving the model.

For AI labs, inference must subsidize training

All of this doesn’t mean that OpenAI or Anthropic are profitable. Those companies are making huge capital investments that may or may not pan out, and are spending enormous amounts of money on talent and compute to train brand-new models and retain users.

They’re doing crazy things like offering per-month subscription models for nearly unlimited inference, which is almost certainly not profitable. If you used an API token instead of your Anthropic subscription in Claude Code, you’d pay ten times the cost. But that doesn’t mean API-based Claude Code couldn’t be a good deal. Some people are already using DeepSeek’s inference API for agentic coding, because once you take away the huge profit margin it’s cheaper than the relative per-month subscription.

Why won’t OpenAI or Anthropic lower their prices? Supposedly OpenAI has thought about it, but for an AI lab, inference has to subsidize training costs. A company like OpenAI has to fund the production of new models from the inference margins on existing models (at least partially). That’s why the margins on inference are so high: the AI labs are trying to squeeze out every dollar so they can stay alive in the training arms race.

However, inference only has to subsidize training costs for an AI lab. If you’re merely an inference provider, you don’t have to do any training at all. Therefore, even if OpenAI and Anthropic go out of business, whoever snaps up the rights to their frontier models will be able to continue selling Opus and GPT inference at a profit5. The AI bubble popping will not mean the end of the inference business, because AI inference is obviously profitable.


  1. Expensive frontier models are probably mixture-of-experts, not dense, which is tougher to estimate. However, I think a 70B dense model and a MoE with 70B active params will come out to basically the same numbers at scale (though the MoE will require more GPU memory and thus a greater upfront cost). Are frontier models around 70B params? Nobody outside the AI labs really knows, but my guess is that 70B is probably larger than a Haiku/mini class model.

  2. I think it’s reasonable to estimate the cost of output tokens only, since they’re by far the most expensive part of serving inference. Input tokens are cheaper for two reasons: transformers let you prefill them in parallel, and for most real-world use cases they can be aggressively cached in the KV cache.

  3. It’s common (and wrong) to estimate GPU lifespan at three years. I wrote a lot about this in AI GPUs probably live longer than three years.

  4. Again, this is just an guess, since we don’t know what OpenAI or Anthropic model is equivalent in size to R1.

  5. I do wonder if Anthropic would be able to prevent other people from being able to access the model if the company goes out of business. Anthropic is currently in debt to Broadcom, Google, and a bunch of private equity firms. Would they get the Mythos and Opus weights, over Dario’s protestations?

Read the whole story
alvinashcraft
25 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Evals aren't a step at the end. They run the whole way through

1 Share

There’s a version of building an AI app that goes like this. You build the thing, you get it mostly working, and then someone says “should we evaluate it?” and you bolt some evals on at the end like a spoiler on a hatchback. It runs, the numbers look fine, you ship.

That works about as well as writing all your tests the night before launch. Evals aren’t a stage you do once at the end. They’re something that runs the whole way through, doing a different job at each step. Get that idea straight and the rest of this falls into place.

Infographic: evals across the lifecycle - requirements, design, implementation, testing and deployment, with where evals come in at each stage

Want it to hand? Download the infographic as a PDF.

The whole thing runs left to right, from build and pre-release over on the left to production and live on the right. Same lifecycle you already know from normal software. The difference is that an AI app is non-deterministic, so “does it work” isn’t a yes or no you can answer once. You have to keep asking, at every stage, and an eval is how you ask.

Requirements

Before you write a line of code, you have to decide what the app is actually for and what a good answer even looks like. This sounds obvious and almost everyone skips it.

The trap with AI is that “good” feels self-evident until you try to pin it down. A good support reply is… helpful? Polite? Correct? Grounded in the actual docs? Those are four different things, and an answer can nail one and fail the others. So this is where you turn fuzzy intentions into concrete success criteria, and start gathering the examples that show what good looks like. That collection becomes your golden dataset, which everything downstream gets measured against. I wrote a whole post on building an eval you can trust, because it’s the foundation the rest sits on.

Design

Now you choose how to build it. Which prompts, which model, what architecture. And here’s where most people go on vibes - they try a prompt, it looks good in the playground, they ship it.

The better move is to make those decisions on evidence. You’ve got a golden dataset from the last step, so use it. Run two candidate prompts against it and see which actually scores higher. Compare a cheaper model against an expensive one on the cases that matter to you, not on a leaderboard built from someone else’s data. Design becomes a series of small head-to-head bake-offs instead of a series of hunches, and you end up committing to choices you can defend.

Implementation

You build the app. You wire up the tools, the retrieval, the integrations, all the moving parts. Nothing surprising here, except for one thing you have to do that you might not be used to: instrument it.

Tracing isn’t optional for an AI app. When something goes wrong - and it will - you need to see what the app did, step by step, to figure out why. That means every span, every trace, every session is captured and inspectable. If you don’t have that, debugging an AI app is just guessing with extra steps. (Span, trace and session are the three levels you’ll be looking at, and they’re worth understanding properly - here’s a post on exactly that.) Get the instrumentation in now, while you’re building, not after you’ve shipped and you’re trying to bolt it on under pressure.

Testing

Now you validate before you ship, the same way you’d unit-test normal code. The twist is that you can’t assert on an exact output, because the output changes every run. So the shape of the test changes too.

Normal testing goes arrange, act, assert. You set things up, you run the code, you check the result is exactly what you expected. Eval-driven development adds a step: arrange, act, evaluate, assert. You set things up, you run the app, you run an eval on whatever came back, and then you assert on the eval’s score. You’re not checking the output equals a fixed string. You’re checking the output scores above the bar you set in the requirements. Same testing instinct, adjusted for a world where the answer is never byte-for-byte the same twice.

Deployment

You ship, and your app meets real users. Real users are messier than anything you tested with. They ask things you never imagined, in ways you never anticipated, and the neat distribution of inputs you tuned everything against goes out the window.

So the evals don’t stop at the door. They follow the app into production. You run online evals on live traffic, scoring real responses as they go out, and you alert the moment quality drops. This is the part that turns “it worked when we tested it” into “it’s still working right now”, which is the only version that actually matters once people depend on it.

The thread that runs through it

Notice what carries the whole way along. The golden dataset and the success criteria you defined at the very start are the same yardstick you use to compare designs, to test before shipping, and to judge live traffic. Define “good” once, properly, and it pays off at every stage after.

That’s the real point. Evals aren’t a gate at the end that says yes or no. They’re a feedback loop you run continuously, from the first requirement to live production traffic, and the job they do shifts as you move along. Bolt them on at the end and you’ll catch the odd bug. Thread them through from the start and you actually know, at every step, whether the thing works - which, for something as slippery as an AI app, is about the most valuable thing you can know.

Read the whole story
alvinashcraft
30 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Release v0.100.2

1 Share

Installer Hashes

Description Filename sha256 hash
Per user - x64 PowerToysUserSetup-0.100.2-x64.exe 945FDF327E4D38E4CED61B0727B7AB8A1222958782982052989DDF7CB7096F62
Per user - ARM64 PowerToysUserSetup-0.100.2-arm64.exe 5554984D72C30D5FD26355569D1CA8D63B74F0062EE4E59E0186E28216C837AE
Machine wide - x64 PowerToysSetup-0.100.2-x64.exe 73C04AAC8052420111FE5CDC0098EC8415D87CBDBD42DE253E9AF959781CBF9E
Machine wide - ARM64 PowerToysSetup-0.100.2-arm64.exe E78BBE00472612FA5BF91D8B253425F674DDD21A32C96B8A120681726CFF40F0

Highlights

This patch release fixes a Command Palette memory leak identified in v0.100.1. Check out the v0.100.1 notes for the full list of changes.

Command Palette

  • Reverted a Performance Monitor dock refresh change that forced item refreshes on every metric update in #48835
  • Fixed a memory leak in the Performance Monitor dock extension by reusing stable network upload/download band items instead of creating new list items on each refresh in #48880
Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories