Two months ago, we announced increases to the prices of some Raspberry Pi 4 and 5 products. These were driven by an unprecedented rise in the cost of LPDDR4 memory, thanks to competition for memory fab capacity from the AI infrastructure roll-out.

Price rises have accelerated as we enter 2026, and the cost of some parts has more than doubled over the last quarter. As a result, we now need to make further increases to our own pricing, affecting all Raspberry Pi 4 and 5, and Compute Module 4 and 5, products that have 2GB or more of memory.
| Memory density | Price increase |
| 1GB | – |
| 2GB | $10 |
| 4GB | $15 |
| 8GB | $30 |
| 16GB | $60 |
Raspberry Pi 500 and 500+ are affected, but not Raspberry Pi 400, which remains our lowest-cost all-in-one PC at $60. We have also been able to protect the pricing of 1GB products, including the $35 1GB Raspberry Pi 4 variant, and the $45 1GB Raspberry Pi 5 variant that we launched in December.
We don’t anticipate any changes to the price of Raspberry Pi Zero, Raspberry Pi 3, and other older products, as we currently hold several years’ inventory of the LPDDR2 memory that they use.
2026 looks likely to be another challenging year for memory pricing, but we are working hard to limit the impact. We’ve said it before, but we’ll say it again: the current situation is ultimately a temporary one, and we look forward to unwinding these price increases once it abates.
The post More memory-driven price rises appeared first on Raspberry Pi.
Selecting the right AI model for your application requires more than reading benchmark leaderboards. Published benchmarks measure academic capabilities, question answering, reasoning, coding, but your application has specific requirements: latency budgets, hardware constraints, quality thresholds. How do you know if Phi-4 provides acceptable quality for your document summarization use case? Will Qwen2.5-0.5B meet your 100ms response time requirement? Does your edge device have sufficient memory for Phi-3.5 Mini?
The answer lies in empirical testing: running actual models on your hardware with your workload patterns. This article demonstrates building a comprehensive model benchmarking platform using FLPerformance, Node.js, React, and Microsoft Foundry Local. You'll learn how to implement scientific performance measurement, design meaningful benchmark suites, visualize multi-dimensional comparisons, and make data-driven model selection decisions.
Whether you're evaluating models for production deployment, optimizing inference costs, or validating hardware specifications, this platform provides the tools for rigorous performance analysis.
You cannot assess model performance by running a few manual tests and noting the results. Scientific benchmarking demands controlled conditions, statistically significant sample sizes, multi-dimensional metrics, and reproducible methodology. Understand why purpose-built tooling is essential.
Performance is multi-dimensional. A model might excel at throughput (tokens per second) but suffer at latency (time to first token). Another might generate high-quality outputs slowly. Your application might prioritize consistency over average performance, a model with variable response times (high p95/p99 latency) creates poor user experiences even if averages look good. Measuring all dimensions simultaneously enables informed tradeoffs.
Hardware matters enormously. Benchmark results from NVIDIA A100 GPUs don't predict performance on consumer laptops. NPU acceleration changes the picture again. Memory constraints affect which models can even load. Test on your actual deployment hardware or comparable specifications to get actionable results.
Concurrency reveals bottlenecks. A model handling one request excellently might struggle with ten concurrent requests. Real applications experience variable load, measuring only single-threaded performance misses critical scalability constraints. Controlled concurrency testing reveals these limits.
Statistical rigor prevents false conclusions. Running a prompt once and noting the response time tells you nothing about performance distribution. Was this result typical? An outlier? You need dozens or hundreds of trials to establish p50/p95/p99 percentiles, understand variance, and detect stability issues.
Comparison requires controlled experiments. Different prompts, different times of day, different system loads, all introduce confounding variables. Scientific comparison runs identical workloads across models sequentially, controlling for external factors.
FLPerformance implements a clean separation between orchestration, measurement, and presentation:
The frontend React application provides model management, benchmark configuration, test execution, and results visualization. Users add models from the Foundry Local catalog, configure benchmark parameters (iterations, concurrency, timeout values), launch test runs, and view real-time progress. The results dashboard displays comparison tables, latency distribution charts, throughput graphs, and "best model for..." recommendations.
The backend Node.js/Express server orchestrates tests and captures metrics. It manages the single Foundry Local service instance, loads/unloads models as needed, executes benchmark suites with controlled concurrency, measures comprehensive metrics (TTFT, TPOT, total latency, throughput, error rates), and persists results to JSON storage. WebSocket connections provide real-time progress updates during long benchmark runs.
Foundry Local SDK integration uses the official foundry-local-sdk npm package. The SDK manages service lifecycle, starting, stopping, health checkin, and handles model operations, downloading, loading into memory, unloading. It provides OpenAI-compatible inference APIs for consistent request formatting across models.
The architecture supports simultaneous testing of multiple models by loading them one at a time, running identical benchmarks, and aggregating results for comparison:
User Initiates Benchmark Run
↓
Backend receives {models: [...], suite: "default", iterations: 10}
↓
For each model:
1. Load model into Foundry Local
2. Execute benchmark suite
- For each prompt in suite:
* Run N iterations
* Measure TTFT, TPOT, total time
* Track errors and timeouts
* Calculate tokens/second
3. Aggregate statistics (mean, p50, p95, p99)
4. Unload model
↓
Store results with metadata
↓
Return comparison data to frontend
↓
Visualize performance metrics
Accurate performance measurement requires instrumentation that captures multiple dimensions without introducing measurement overhead:
// src/server/benchmark.js
import { performance } from 'perf_hooks';
export class BenchmarkExecutor {
constructor(foundryClient, options = {}) {
this.client = foundryClient;
this.options = {
iterations: options.iterations || 10,
concurrency: options.concurrency || 1,
timeout_ms: options.timeout_ms || 30000,
warmup_iterations: options.warmup_iterations || 2
};
}
async runBenchmarkSuite(modelId, prompts) {
const results = [];
// Warmup phase (exclude from results)
console.log(`Running ${this.options.warmup_iterations} warmup iterations...`);
for (let i = 0; i < this.options.warmup_iterations; i++) {
await this.executePrompt(modelId, prompts[0].text);
}
// Actual benchmark runs
for (const prompt of prompts) {
console.log(`Benchmarking prompt: ${prompt.id}`);
const measurements = [];
for (let i = 0; i < this.options.iterations; i++) {
const measurement = await this.executeMeasuredPrompt(
modelId,
prompt.text
);
measurements.push(measurement);
// Small delay between iterations to stabilize
await sleep(100);
}
results.push({
prompt_id: prompt.id,
prompt_text: prompt.text,
measurements,
statistics: this.calculateStatistics(measurements)
});
}
return {
model_id: modelId,
timestamp: new Date().toISOString(),
config: this.options,
results
};
}
async executeMeasuredPrompt(modelId, promptText) {
const measurement = {
success: false,
error: null,
ttft_ms: null, // Time to first token
tpot_ms: null, // Time per output token
total_ms: null,
tokens_generated: 0,
tokens_per_second: 0
};
try {
const startTime = performance.now();
let firstTokenTime = null;
let tokenCount = 0;
// Streaming completion to measure TTFT
const stream = await this.client.chat.completions.create({
model: modelId,
messages: [{ role: 'user', content: promptText }],
max_tokens: 200,
temperature: 0.7,
stream: true
});
for await (const chunk of stream) {
if (chunk.choices[0]?.delta?.content) {
if (firstTokenTime === null) {
firstTokenTime = performance.now();
measurement.ttft_ms = firstTokenTime - startTime;
}
tokenCount++;
}
}
const endTime = performance.now();
measurement.total_ms = endTime - startTime;
measurement.tokens_generated = tokenCount;
if (tokenCount > 1 && firstTokenTime) {
// TPOT = time after first token / (tokens - 1)
const timeAfterFirstToken = endTime - firstTokenTime;
measurement.tpot_ms = timeAfterFirstToken / (tokenCount - 1);
measurement.tokens_per_second = 1000 / measurement.tpot_ms;
}
measurement.success = true;
} catch (error) {
measurement.error = error.message;
measurement.success = false;
}
return measurement;
}
calculateStatistics(measurements) {
const successful = measurements.filter(m => m.success);
const total = measurements.length;
if (successful.length === 0) {
return {
success_rate: 0,
error_rate: 1.0,
sample_size: total
};
}
const ttfts = successful.map(m => m.ttft_ms).sort((a, b) => a - b);
const tpots = successful.map(m => m.tpot_ms).filter(v => v !== null).sort((a, b) => a - b);
const totals = successful.map(m => m.total_ms).sort((a, b) => a - b);
const throughputs = successful.map(m => m.tokens_per_second).filter(v => v > 0);
return {
success_rate: successful.length / total,
error_rate: (total - successful.length) / total,
sample_size: total,
ttft: {
mean: mean(ttfts),
median: percentile(ttfts, 50),
p95: percentile(ttfts, 95),
p99: percentile(ttfts, 99),
min: Math.min(...ttfts),
max: Math.max(...ttfts)
},
tpot: tpots.length > 0 ? {
mean: mean(tpots),
median: percentile(tpots, 50),
p95: percentile(tpots, 95)
} : null,
total_latency: {
mean: mean(totals),
median: percentile(totals, 50),
p95: percentile(totals, 95),
p99: percentile(totals, 99)
},
throughput: {
mean_tps: mean(throughputs),
median_tps: percentile(throughputs, 50)
}
};
}
}
function mean(arr) {
return arr.reduce((sum, val) => sum + val, 0) / arr.length;
}
function percentile(sortedArr, p) {
const index = Math.ceil((sortedArr.length * p) / 100) - 1;
return sortedArr[Math.max(0, index)];
}
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
This measurement infrastructure captures:
Benchmark quality depends on prompt selection. Generic prompts don't reflect real application behavior. Design suites that mirror actual use cases:
// benchmarks/suites/default.json
{
"name": "default",
"description": "General-purpose benchmark covering diverse scenarios",
"prompts": [
{
"id": "short-factual",
"text": "What is the capital of France?",
"category": "factual",
"expected_tokens": 5
},
{
"id": "medium-explanation",
"text": "Explain how photosynthesis works in 3-4 sentences.",
"category": "explanation",
"expected_tokens": 80
},
{
"id": "long-reasoning",
"text": "Analyze the economic factors that led to the 2008 financial crisis. Discuss at least 5 major causes with supporting details.",
"category": "reasoning",
"expected_tokens": 250
},
{
"id": "code-generation",
"text": "Write a Python function that finds the longest palindrome in a string. Include docstring and example usage.",
"category": "coding",
"expected_tokens": 150
},
{
"id": "creative-writing",
"text": "Write a short story (3 paragraphs) about a robot learning to paint.",
"category": "creative",
"expected_tokens": 200
}
]
}
This suite covers multiple dimensions:
For production applications, create custom suites matching your actual workload:
{
"name": "customer-support",
"description": "Simulates actual customer support queries",
"prompts": [
{
"id": "product-question",
"text": "How do I reset my password for the customer portal?"
},
{
"id": "troubleshooting",
"text": "I'm getting error code 503 when trying to upload files. What should I do?"
},
{
"id": "policy-inquiry",
"text": "What is your refund policy for annual subscriptions?"
}
]
}
Raw numbers don't reveal insights—visualization makes patterns obvious. The frontend implements several comparison views:
Comparison Table shows side-by-side metrics:
// frontend/src/components/ResultsTable.jsx
export function ResultsTable({ results }) {
return (
{results.map(result => (
))}
| Model | TTFT (ms) | TPOT (ms) | Throughput (tok/s) | P95 Latency | Error Rate |
|---|---|---|---|---|---|
| {result.model_id} | {result.stats.ttft.median.toFixed(0)} (p95: {result.stats.ttft.p95.toFixed(0)}) | {result.stats.tpot?.median.toFixed(1) || 'N/A'} | {result.stats.throughput.median_tps.toFixed(1)} | {result.stats.total_latency.p95.toFixed(0)} ms | 0.05 ? 'error' : 'success'}> {(result.stats.error_rate * 100).toFixed(1)}% |
); }
Latency Distribution Chart reveals performance consistency:
// Using Chart.js for visualization
export function LatencyChart({ results }) {
const data = {
labels: results.map(r => r.model_id),
datasets: [
{
label: 'Median (p50)',
data: results.map(r => r.stats.total_latency.median),
backgroundColor: 'rgba(75, 192, 192, 0.5)'
},
{
label: 'p95',
data: results.map(r => r.stats.total_latency.p95),
backgroundColor: 'rgba(255, 206, 86, 0.5)'
},
{
label: 'p99',
data: results.map(r => r.stats.total_latency.p99),
backgroundColor: 'rgba(255, 99, 132, 0.5)'
}
]
};
return (
);
}
Recommendations Engine synthesizes multi-dimensional comparison:
export function generateRecommendations(results) {
const recommendations = [];
// Find fastest TTFT (best perceived responsiveness)
const fastestTTFT = results.reduce((best, r) =>
r.stats.ttft.median < best.stats.ttft.median ? r : best
);
recommendations.push({
category: 'Fastest Response',
model: fastestTTFT.model_id,
reason: `Lowest median TTFT: ${fastestTTFT.stats.ttft.median.toFixed(0)}ms`
});
// Find highest throughput
const highestThroughput = results.reduce((best, r) =>
r.stats.throughput.median_tps > best.stats.throughput.median_tps ? r : best
);
recommendations.push({
category: 'Best Throughput',
model: highestThroughput.model_id,
reason: `Highest tok/s: ${highestThroughput.stats.throughput.median_tps.toFixed(1)}`
});
// Find most consistent (lowest p95-p50 spread)
const mostConsistent = results.reduce((best, r) => {
const spread = r.stats.total_latency.p95 - r.stats.total_latency.median;
const bestSpread = best.stats.total_latency.p95 - best.stats.total_latency.median;
return spread < bestSpread ? r : best;
});
recommendations.push({
category: 'Most Consistent',
model: mostConsistent.model_id,
reason: 'Lowest latency variance (p95-p50 spread)'
});
return recommendations;
}
Effective model benchmarking requires scientific methodology, comprehensive metrics, and application-specific testing. FLPerformance demonstrates that rigorous performance measurement is accessible to any development team.
Critical principles for model evaluation:
The complete benchmarking platform with model management, measurement infrastructure, visualization dashboards, and comprehensive documentation is available at github.com/leestott/FLPerformance. Clone the repository and run the startup script to begin evaluating models on your hardware.
With over 20 years of software development experience, Kevin Griffin is a passionate and versatile leader, trainer, and consultant in the .NET ecosystem. He has worked with various industries, from the United States Military to health care to ticket brokering, delivering high-quality solutions and empowering his clients and teams to succeed.
In his day job, he is the CTO at Shows On Sale where he oversees the technical strategy and direction of the company. Kevin also has served as the President of the .NET Foundation for the last term. And Microsoft has conferred the Microsoft MVP to him at least 16 times. He speaks at tons of conferences and is a board member of the Stir Trek Conference series as well.
Mentioned In This Episode
LinkedIn
DNF Summit 2025 Video
Website
Dev Fest - Hampton Roads, Norfolk/Virginia Beach, Virginia.
Stir Trek May 1st.
Want to Learn More?
Visit AzureDevOps.Show for show notes and additional episodes.
Ever get asked "how do LLMs work?" at a party and freeze? We walk through the full pipeline: tokenization, embeddings, inference — so you understand it well enough to explain it. Walk away with a mental model that you can use for your next dinner party.
_Full shownotes at fragmentedpodcast.com.
We'd love to hear from you. Email is the
best way to reach us or you can check our contact page for other
ways.
We want to hear all the feedback: what's working, what's not, topics you'd like
to hear more on. We want to make the show better for you so let us know!
[!fyi] We transitioned from Android development to AI starting with
Ep. #300. Listen to that episode for the full story behind
our new direction.