Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
146747 stories
·
33 followers

Command Palette Dock is an astonishing proposal for a new PowerToys utility

1 Share
We absolutely love PowerToys here at BetaNews, and we are far from being alone. Microsoft’s collection of utilities for Windows has legions of fans who eagerly await new releases and updates. With an incredibly busy team of developers working on the project, PowerToys is a fast-moving piece of software that evolves at a greater pace than most. There are many popular modules included in the PowerToys utility suite, with Command Palette being one of the most loved. And if a recent proposal on GitHub moves forward, it is about to get a serious enhancement. One of PowerToys’ key developers, Niels… [Continue Reading]
Read the whole story
alvinashcraft
45 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

More memory-driven price rises

1 Share

Two months ago, we announced increases to the prices of some Raspberry Pi 4 and 5 products. These were driven by an unprecedented rise in the cost of LPDDR4 memory, thanks to competition for memory fab capacity from the AI infrastructure roll-out.

Price rises have accelerated as we enter 2026, and the cost of some parts has more than doubled over the last quarter. As a result, we now need to make further increases to our own pricing, affecting all Raspberry Pi 4 and 5, and Compute Module 4 and 5, products that have 2GB or more of memory.

Memory densityPrice increase
1GB
2GB$10
4GB$15
8GB$30
16GB$60

Raspberry Pi 500 and 500+ are affected, but not Raspberry Pi 400, which remains our lowest-cost all-in-one PC at $60. We have also been able to protect the pricing of 1GB products, including the $35 1GB Raspberry Pi 4 variant, and the $45 1GB Raspberry Pi 5 variant that we launched in December.

We don’t anticipate any changes to the price of Raspberry Pi Zero, Raspberry Pi 3, and other older products, as we currently hold several years’ inventory of the LPDDR2 memory that they use.

Looking ahead

2026 looks likely to be another challenging year for memory pricing, but we are working hard to limit the impact. We’ve said it before, but we’ll say it again: the current situation is ultimately a temporary one, and we look forward to unwinding these price increases once it abates.

The post More memory-driven price rises appeared first on Raspberry Pi.

Read the whole story
alvinashcraft
45 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Benchmarking Local AI Models

1 Share

Introduction

Selecting the right AI model for your application requires more than reading benchmark leaderboards. Published benchmarks measure academic capabilities, question answering, reasoning, coding, but your application has specific requirements: latency budgets, hardware constraints, quality thresholds. How do you know if Phi-4 provides acceptable quality for your document summarization use case? Will Qwen2.5-0.5B meet your 100ms response time requirement? Does your edge device have sufficient memory for Phi-3.5 Mini?

The answer lies in empirical testing: running actual models on your hardware with your workload patterns. This article demonstrates building a comprehensive model benchmarking platform using FLPerformance, Node.js, React, and Microsoft Foundry Local. You'll learn how to implement scientific performance measurement, design meaningful benchmark suites, visualize multi-dimensional comparisons, and make data-driven model selection decisions.

Whether you're evaluating models for production deployment, optimizing inference costs, or validating hardware specifications, this platform provides the tools for rigorous performance analysis.

Why Model Benchmarking Requires Purpose-Built Tools

You cannot assess model performance by running a few manual tests and noting the results. Scientific benchmarking demands controlled conditions, statistically significant sample sizes, multi-dimensional metrics, and reproducible methodology. Understand why purpose-built tooling is essential.

Performance is multi-dimensional. A model might excel at throughput (tokens per second) but suffer at latency (time to first token). Another might generate high-quality outputs slowly. Your application might prioritize consistency over average performance, a model with variable response times (high p95/p99 latency) creates poor user experiences even if averages look good. Measuring all dimensions simultaneously enables informed tradeoffs.

Hardware matters enormously. Benchmark results from NVIDIA A100 GPUs don't predict performance on consumer laptops. NPU acceleration changes the picture again. Memory constraints affect which models can even load. Test on your actual deployment hardware or comparable specifications to get actionable results.

Concurrency reveals bottlenecks. A model handling one request excellently might struggle with ten concurrent requests. Real applications experience variable load, measuring only single-threaded performance misses critical scalability constraints. Controlled concurrency testing reveals these limits.

Statistical rigor prevents false conclusions. Running a prompt once and noting the response time tells you nothing about performance distribution. Was this result typical? An outlier? You need dozens or hundreds of trials to establish p50/p95/p99 percentiles, understand variance, and detect stability issues.

Comparison requires controlled experiments. Different prompts, different times of day, different system loads, all introduce confounding variables. Scientific comparison runs identical workloads across models sequentially, controlling for external factors.

Architecture: Three-Layer Performance Testing Platform

FLPerformance implements a clean separation between orchestration, measurement, and presentation:

The frontend React application provides model management, benchmark configuration, test execution, and results visualization. Users add models from the Foundry Local catalog, configure benchmark parameters (iterations, concurrency, timeout values), launch test runs, and view real-time progress. The results dashboard displays comparison tables, latency distribution charts, throughput graphs, and "best model for..." recommendations.

The backend Node.js/Express server orchestrates tests and captures metrics. It manages the single Foundry Local service instance, loads/unloads models as needed, executes benchmark suites with controlled concurrency, measures comprehensive metrics (TTFT, TPOT, total latency, throughput, error rates), and persists results to JSON storage. WebSocket connections provide real-time progress updates during long benchmark runs.

Foundry Local SDK integration uses the official foundry-local-sdk npm package. The SDK manages service lifecycle, starting, stopping, health checkin, and handles model operations, downloading, loading into memory, unloading. It provides OpenAI-compatible inference APIs for consistent request formatting across models.

The architecture supports simultaneous testing of multiple models by loading them one at a time, running identical benchmarks, and aggregating results for comparison:

User Initiates Benchmark Run
    ↓
Backend receives {models: [...], suite: "default", iterations: 10}
    ↓
For each model:
    1. Load model into Foundry Local
    2. Execute benchmark suite
       - For each prompt in suite:
         * Run N iterations
         * Measure TTFT, TPOT, total time
         * Track errors and timeouts
         * Calculate tokens/second
    3. Aggregate statistics (mean, p50, p95, p99)
    4. Unload model
    ↓
Store results with metadata
    ↓
Return comparison data to frontend
    ↓
Visualize performance metrics

Implementing Scientific Measurement Infrastructure

Accurate performance measurement requires instrumentation that captures multiple dimensions without introducing measurement overhead:

// src/server/benchmark.js
import { performance } from 'perf_hooks';

export class BenchmarkExecutor {
  constructor(foundryClient, options = {}) {
    this.client = foundryClient;
    this.options = {
      iterations: options.iterations || 10,
      concurrency: options.concurrency || 1,
      timeout_ms: options.timeout_ms || 30000,
      warmup_iterations: options.warmup_iterations || 2
    };
  }
  
  async runBenchmarkSuite(modelId, prompts) {
    const results = [];
    
    // Warmup phase (exclude from results)
    console.log(`Running ${this.options.warmup_iterations} warmup iterations...`);
    for (let i = 0; i < this.options.warmup_iterations; i++) {
      await this.executePrompt(modelId, prompts[0].text);
    }
    
    // Actual benchmark runs
    for (const prompt of prompts) {
      console.log(`Benchmarking prompt: ${prompt.id}`);
      const measurements = [];
      
      for (let i = 0; i < this.options.iterations; i++) {
        const measurement = await this.executeMeasuredPrompt(
          modelId, 
          prompt.text
        );
        measurements.push(measurement);
        
        // Small delay between iterations to stabilize
        await sleep(100);
      }
      
      results.push({
        prompt_id: prompt.id,
        prompt_text: prompt.text,
        measurements,
        statistics: this.calculateStatistics(measurements)
      });
    }
    
    return {
      model_id: modelId,
      timestamp: new Date().toISOString(),
      config: this.options,
      results
    };
  }
  
  async executeMeasuredPrompt(modelId, promptText) {
    const measurement = {
      success: false,
      error: null,
      ttft_ms: null,  // Time to first token
      tpot_ms: null,  // Time per output token
      total_ms: null,
      tokens_generated: 0,
      tokens_per_second: 0
    };
    
    try {
      const startTime = performance.now();
      let firstTokenTime = null;
      let tokenCount = 0;
      
      // Streaming completion to measure TTFT
      const stream = await this.client.chat.completions.create({
        model: modelId,
        messages: [{ role: 'user', content: promptText }],
        max_tokens: 200,
        temperature: 0.7,
        stream: true
      });
      
      for await (const chunk of stream) {
        if (chunk.choices[0]?.delta?.content) {
          if (firstTokenTime === null) {
            firstTokenTime = performance.now();
            measurement.ttft_ms = firstTokenTime - startTime;
          }
          tokenCount++;
        }
      }
      
      const endTime = performance.now();
      measurement.total_ms = endTime - startTime;
      measurement.tokens_generated = tokenCount;
      
      if (tokenCount > 1 && firstTokenTime) {
        // TPOT = time after first token / (tokens - 1)
        const timeAfterFirstToken = endTime - firstTokenTime;
        measurement.tpot_ms = timeAfterFirstToken / (tokenCount - 1);
        measurement.tokens_per_second = 1000 / measurement.tpot_ms;
      }
      
      measurement.success = true;
      
    } catch (error) {
      measurement.error = error.message;
      measurement.success = false;
    }
    
    return measurement;
  }
  
  calculateStatistics(measurements) {
    const successful = measurements.filter(m => m.success);
    const total = measurements.length;
    
    if (successful.length === 0) {
      return {
        success_rate: 0,
        error_rate: 1.0,
        sample_size: total
      };
    }
    
    const ttfts = successful.map(m => m.ttft_ms).sort((a, b) => a - b);
    const tpots = successful.map(m => m.tpot_ms).filter(v => v !== null).sort((a, b) => a - b);
    const totals = successful.map(m => m.total_ms).sort((a, b) => a - b);
    const throughputs = successful.map(m => m.tokens_per_second).filter(v => v > 0);
    
    return {
      success_rate: successful.length / total,
      error_rate: (total - successful.length) / total,
      sample_size: total,
      
      ttft: {
        mean: mean(ttfts),
        median: percentile(ttfts, 50),
        p95: percentile(ttfts, 95),
        p99: percentile(ttfts, 99),
        min: Math.min(...ttfts),
        max: Math.max(...ttfts)
      },
      
      tpot: tpots.length > 0 ? {
        mean: mean(tpots),
        median: percentile(tpots, 50),
        p95: percentile(tpots, 95)
      } : null,
      
      total_latency: {
        mean: mean(totals),
        median: percentile(totals, 50),
        p95: percentile(totals, 95),
        p99: percentile(totals, 99)
      },
      
      throughput: {
        mean_tps: mean(throughputs),
        median_tps: percentile(throughputs, 50)
      }
    };
  }
}

function mean(arr) {
  return arr.reduce((sum, val) => sum + val, 0) / arr.length;
}

function percentile(sortedArr, p) {
  const index = Math.ceil((sortedArr.length * p) / 100) - 1;
  return sortedArr[Math.max(0, index)];
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

This measurement infrastructure captures:

  • Time to First Token (TTFT): Critical for perceived responsiveness—users notice delays before output begins
  • Time Per Output Token (TPOT): Determines generation speed after first token—affects throughput
  • Total latency: End-to-end time—matters for batch processing and high-volume scenarios
  • Tokens per second: Overall throughput metric—useful for capacity planning
  • Statistical distributions: Mean alone masks variability—p95/p99 reveal tail latencies that impact user experience
  • Success/error rates: Stability metrics—some models timeout or crash under load

Designing Meaningful Benchmark Suites

Benchmark quality depends on prompt selection. Generic prompts don't reflect real application behavior. Design suites that mirror actual use cases:

// benchmarks/suites/default.json
{
  "name": "default",
  "description": "General-purpose benchmark covering diverse scenarios",
  "prompts": [
    {
      "id": "short-factual",
      "text": "What is the capital of France?",
      "category": "factual",
      "expected_tokens": 5
    },
    {
      "id": "medium-explanation",
      "text": "Explain how photosynthesis works in 3-4 sentences.",
      "category": "explanation",
      "expected_tokens": 80
    },
    {
      "id": "long-reasoning",
      "text": "Analyze the economic factors that led to the 2008 financial crisis. Discuss at least 5 major causes with supporting details.",
      "category": "reasoning",
      "expected_tokens": 250
    },
    {
      "id": "code-generation",
      "text": "Write a Python function that finds the longest palindrome in a string. Include docstring and example usage.",
      "category": "coding",
      "expected_tokens": 150
    },
    {
      "id": "creative-writing",
      "text": "Write a short story (3 paragraphs) about a robot learning to paint.",
      "category": "creative",
      "expected_tokens": 200
    }
  ]
}

This suite covers multiple dimensions:

  • Length variation: Short (5 tokens), medium (80), long (250)—tests models across output ranges
  • Task diversity: Factual recall, explanation, reasoning, code, creative—reveals capability breadth
  • Token predictability: Expected token counts enable throughput calculations

For production applications, create custom suites matching your actual workload:

{
  "name": "customer-support",
  "description": "Simulates actual customer support queries",
  "prompts": [
    {
      "id": "product-question",
      "text": "How do I reset my password for the customer portal?"
    },
    {
      "id": "troubleshooting",
      "text": "I'm getting error code 503 when trying to upload files. What should I do?"
    },
    {
      "id": "policy-inquiry",
      "text": "What is your refund policy for annual subscriptions?"
    }
  ]
}

Visualizing Multi-Dimensional Performance Comparisons

Raw numbers don't reveal insights—visualization makes patterns obvious. The frontend implements several comparison views:

Comparison Table shows side-by-side metrics:

// frontend/src/components/ResultsTable.jsx
export function ResultsTable({ results }) {
  return (
    
        {results.map(result => (
          
        ))}
      
ModelTTFT (ms)TPOT (ms)Throughput (tok/s)P95 LatencyError Rate
{result.model_id}{result.stats.ttft.median.toFixed(0)} (p95: {result.stats.ttft.p95.toFixed(0)}){result.stats.tpot?.median.toFixed(1) || 'N/A'}{result.stats.throughput.median_tps.toFixed(1)}{result.stats.total_latency.p95.toFixed(0)} ms0.05 ? 'error' : 'success'}> {(result.stats.error_rate * 100).toFixed(1)}%
  );
}

Latency Distribution Chart reveals performance consistency:

// Using Chart.js for visualization
export function LatencyChart({ results }) {
  const data = {
    labels: results.map(r => r.model_id),
    datasets: [
      {
        label: 'Median (p50)',
        data: results.map(r => r.stats.total_latency.median),
        backgroundColor: 'rgba(75, 192, 192, 0.5)'
      },
      {
        label: 'p95',
        data: results.map(r => r.stats.total_latency.p95),
        backgroundColor: 'rgba(255, 206, 86, 0.5)'
      },
      {
        label: 'p99',
        data: results.map(r => r.stats.total_latency.p99),
        backgroundColor: 'rgba(255, 99, 132, 0.5)'
      }
    ]
  };
  
  return (
    
  );
}

Recommendations Engine synthesizes multi-dimensional comparison:

export function generateRecommendations(results) {
  const recommendations = [];
  
  // Find fastest TTFT (best perceived responsiveness)
  const fastestTTFT = results.reduce((best, r) => 
    r.stats.ttft.median < best.stats.ttft.median ? r : best
  );
  recommendations.push({
    category: 'Fastest Response',
    model: fastestTTFT.model_id,
    reason: `Lowest median TTFT: ${fastestTTFT.stats.ttft.median.toFixed(0)}ms`
  });
  
  // Find highest throughput
  const highestThroughput = results.reduce((best, r) =>
    r.stats.throughput.median_tps > best.stats.throughput.median_tps ? r : best
  );
  recommendations.push({
    category: 'Best Throughput',
    model: highestThroughput.model_id,
    reason: `Highest tok/s: ${highestThroughput.stats.throughput.median_tps.toFixed(1)}`
  });
  
  // Find most consistent (lowest p95-p50 spread)
  const mostConsistent = results.reduce((best, r) => {
    const spread = r.stats.total_latency.p95 - r.stats.total_latency.median;
    const bestSpread = best.stats.total_latency.p95 - best.stats.total_latency.median;
    return spread < bestSpread ? r : best;
  });
  recommendations.push({
    category: 'Most Consistent',
    model: mostConsistent.model_id,
    reason: 'Lowest latency variance (p95-p50 spread)'
  });
  
  return recommendations;
}

Key Takeaways and Benchmarking Best Practices

Effective model benchmarking requires scientific methodology, comprehensive metrics, and application-specific testing. FLPerformance demonstrates that rigorous performance measurement is accessible to any development team.

Critical principles for model evaluation:

  • Test on target hardware: Results from cloud GPUs don't predict laptop performance
  • Measure multiple dimensions: TTFT, TPOT, throughput, consistency all matter
  • Use statistical rigor: Single runs mislead—capture distributions with adequate sample sizes
  • Design realistic workloads: Generic benchmarks don't predict your application's behavior
  • Include warmup iterations: Model loading and JIT compilation affect early measurements
  • Control concurrency: Real applications handle multiple requests—test at realistic loads
  • Document methodology: Reproducible results require documented procedures and configurations

The complete benchmarking platform with model management, measurement infrastructure, visualization dashboards, and comprehensive documentation is available at github.com/leestott/FLPerformance. Clone the repository and run the startup script to begin evaluating models on your hardware.

Resources and Further Reading

Read the whole story
alvinashcraft
46 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Kevin Griffin: Engineering for System Uptime - Episode 387

1 Share

With over 20 years of software development experience, Kevin Griffin is a passionate and versatile leader, trainer, and consultant in the .NET ecosystem. He has worked with various industries, from the United States Military to health care to ticket brokering, delivering high-quality solutions and empowering his clients and teams to succeed.

In his day job, he is the CTO at Shows On Sale where he oversees the technical strategy and direction of the company. Kevin also has served as the President of the .NET Foundation for the last term. And Microsoft has conferred the Microsoft MVP to him at least 16 times. He speaks at tons of conferences and is a board member of the Stir Trek Conference series as well.

Mentioned In This Episode

LinkedIn
DNF Summit 2025 Video 
Website
Dev Fest - Hampton Roads, Norfolk/Virginia Beach, Virginia.
Stir Trek May 1st.

Want to Learn More?
Visit AzureDevOps.Show for show notes and additional episodes.





Download audio: https://traffic.libsyn.com/clean/secure/azuredevops/Episode_387.mp3?dest-id=768873
Read the whole story
alvinashcraft
46 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

303 - How LLMs Work - the 20 minute explainer

1 Share

Ever get asked "how do LLMs work?" at a party and freeze? We walk through the full pipeline: tokenization, embeddings, inference — so you understand it well enough to explain it. Walk away with a mental model that you can use for your next dinner party.

_Full shownotes at fragmentedpodcast.com.

Show Notes

Words -> Tokens:

Tokens -> Embeddings:

Embeddings -> Inference:

Get in touch

We'd love to hear from you. Email is the
best way to reach us or you can check our contact page for other
ways.

We want to hear all the feedback: what's working, what's not, topics you'd like
to hear more on. We want to make the show better for you so let us know!

Co-hosts:

[!fyi] We transitioned from Android development to AI starting with
Ep. #300. Listen to that episode for the full story behind
our new direction.





Download audio: https://cdn.simplecast.com/audio/20f35050-e836-44cd-8f7f-fd13e8cb2e44/episodes/d0e8970a-e53d-453d-92be-1061cc6a1077/audio/edc5a514-6970-472d-a6c5-ae9fab11751e/default_tc.mp3?aid=rss_feed&feed=LpAGSLnY
Read the whole story
alvinashcraft
46 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

AI and OpenClaw (née Clawdbot) and tools like Claude Code and GitHub Copilot for non-technical folks

1 Share
From: Scott Hanselman
Duration: 9:45
Views: 2,238

Read the whole story
alvinashcraft
46 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories