Nvidia unveiled a push into inference with a GROQ‑based chip and Samsung production that could diversify manufacturing outside TSMC. SEC filings show a sharp rise in AI agent risk disclosures, signaling growing legal and business concern about agent disruption to SaaS. Copyright disputes stalled SeedDance2.0's global launch while new AI ventures and Gemini‑powered AskMaps underscore rapid commercialization and labor market anxieties.
Fernando Rojo, Head of Mobile at Vercel, joins Mazen and Robin to discuss building the V0 mobile app with React Native and Expo, including tech stack decisions, performance optimization, and how the project pushed Vercel to improve its native tooling.
Infinite Red is an expert React Native consultancy located in the USA. With over a decade of React Native experience and deep roots in the React Native community (hosts of Chain React and the React Native Newsletter, core React Native contributors, creators of Ignite and Reactotron, and much, much more), Infinite Red is the best choice for helping you build and deploy your next React Native app.
Show: @pythonbytes@fosstodon.org / @pythonbytes.fm (bsky)
Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Monday at 10am PT. Older video versions available there too.
Finally, if you want an artisanal, hand-crafted digest of every week of the show notes in email form? Add your name and email to our friends of the show list, we'll never share it.
Dan Blanchard, maintainer of a Python character encoding detection library called chardet, released a new version of the library under a new software license. (LGPL → MIT)
Dan is allowed to make this change because v7 is a complete “clean room” rewrite using AI
BTW, v7 is WAY better:
The result is a 48x increase in detection speed for a project that lives in the hot loops of many projects. That will lead to noticeable performance increases for literally millions of users (the package gets ~130M downloads per month).
It paves a path towards inclusion in the standard library (assuming they don’t institute policies against using AI tools).
Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+
An individual claiming to be Mark Pilgrim, the original creator of the library, opened an issue in the project's GitHub repo arguing that Blanchard had no right to change the software license, citing the LPGL requirement that the license remain unchanged.
A 'complete rewrite' is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a 'clean room' implementation).
Blanchard disagreed, citing how version 7.0.0 and 6.0.0 compare when subjected to JPlag, a library for detecting plagiarism.
Blanchard told The Register he had wanted to get chardet added to the Python standard library for more than a decade since it’s a core dependency to most Python projects.
A browser plugin that improves the GitHub experience
A sampling
Adds a build/CI status icon next to the repo’s name.
Adds a link back to the PR that ran the workflow.
Enables tab and shift tab for indentation in comment fields.
Auto-resizes comment fields to fit their content and no longer show scroll bars.
Highlights the most useful comment in issues.
Changes the default sort order of issues/PRs to Recently updated.
But really, it’s a huge list of improvements
Michael #3: pgdog: PostgreSQL connection pooler, load balancer and database sharder
PgDog is a proxy for scaling PostgreSQL.
It supports connection pooling, load balancing queries and sharding entire databases.
Written in Rust, PgDog is fast, secure and can manage thousands of connections on commodity hardware.
Features
PgDog is an application layer load balancer for PostgreSQL
Health Checks: PgDog maintains a real-time list of healthy hosts. When a database fails a health check, it's removed from the active rotation and queries are re-routed to other replicas
Single Endpoint: PgDog can detect writes (e.g. INSERT, UPDATE, CREATE TABLE, etc.) and send them to the primary, leaving the replicas to serve reads
Failover: PgDog monitors Postgres replication state and can automatically redirect writes to a different database if a replica is promoted
Sharding: PgDog is able to manage databases with multiple shards
This week on the show, Scott talks to Philip Kiley about his new book, Inference Engineering. Inference Engineering is your guide to becoming an expert in inference. It contains everything that Philip has learned in four years of working at Baseten. This book is based on the hundreds of thousands of words of documentation, blogs, and talks he's written on inference; interviews with dozens of experts from our engineering team; and countless conversations with customers and builders around the world.
The NVIDIA RTX PRO 4500 Blackwell Server Edition brings GPU acceleration to the world's most widely adopted enterprise data center and edge computing platforms. It offers a significant performance increase compared to traditional CPU-only servers. For Red Hat customers, this server edition provides compact acceleration across the Red Hat AI portfolio, including Red Hat AI Inference Server, Red Hat Enterprise Linux AI, and Red Hat AI Enterprise. This gives organizations a practical path to build, optimize, deploy, and scale AI workloads across enterprise datacenter and edge environments.
Optimized for Red Hat AI
The NVIDIA RTX PRO 4500 Blackwell Server Edition is a reliable choice for compact, power-efficient AI deployments. It provides inference performance without adding unnecessary operational complexity. For Red Hat AI users, it offers a practical mix of memory capacity, performance, and efficiency for running modern models in enterprise datacenter and edge environments.
This hardware also stands out as a compelling successor to the NVIDIA L4 for this type of deployment. With more memory, greater performance headroom, and support for low-precision inference, organizations can better tune model size, throughput, latency, and overall deployment efficiency to match workload requirements.
Quantization provides much of that value. 8-bit integer (INT8) is a widely adopted option for inference, while 4-bit integer (INT4) helps fit larger models into more constrained memory footprints. FP8 has also become increasingly important for modern accelerator-based deployments. Blackwell supports NVFP4, giving Red Hat AI users flexibility for advanced model optimization and inference.
Configure the RTX PRO 4500 Blackwell Server Edition on Red Hat AI Enterprise
To use the RTX PRO 4500 Blackwell Server Edition in Red Hat OpenShift, install the Node Feature Discovery and the NVIDIA GPU Operator (Figure 1).
Figure 1: Search for and select the NVIDIA GPU Operator from the OpenShift Software Catalog.
Set these parameters in the NVIDIA GPU Operator installation UI:
Set the NVIDIA GPU Operator ClusterPolicy to version 580.126.16 (version 595 will be the officially supported NVIDIA driver release). Enter this value in the driver version field to deploy the required driver image tag across the cluster.
Enter nvcr.io/nvidia in the repository field of the ClusterPolicy so the operator pulls the container from the correct registry.
Enter driver in the image field of the ClusterPolicy to reference the correct driver container image.
Set kernelModuleType to open in the NVIDIA GPU Operator ClusterPolicy to use open GPU kernel modules during installation.
You can also edit with the cluster policy and add these parameters:
Once installed, you can use the RTX PRO 4500 Blackwell Server Edition with OpenShift (Figure 2).
Figure 2: Use the terminal to verify the installation of the NVIDIA GPUs using the nvidia-smi command.
Running nvidia-smi from the NVIDIA driver daemonset in the OpenShift web console confirms that both NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs are detected correctly
Verify the hardware
This validation environment uses Red Hat OpenShift 4.20.15.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.20.15 True False 24m Cluster version is 4.20.15
The deployment uses a single-node Red Hat OpenShift cluster running Kubernetes 1.33.6.
$ oc get nodes
NAME STATUS ROLES AGE VERSION
redhat-validation-02-gpu01 Ready control-plane,master,worker 6h24m v1.33.6
Figure 3: Use the Node Feature Discovery Operator to manage hardware-specific labeling within your cluster.
After you install the Node Feature Discovery Operator (Figure 3), the node identifies as hosting an NVIDIA PCI device with Single Root I/O Virtualization (SR-IOV) capabilities.
The NVIDIA GPU Operator deploys into the nvidia-gpu-operator project.
$ oc project nvidia-gpu-operator
Now using project "nvidia-gpu-operator" on server "https://api.launchpad.nvidia.com:6443".
During installation, the NVIDIA GPU Operator starts components in sequence. These include the driver daemonset, container toolkit, device plug-in, NVIDIA Data Center GPU Manager (DCGM), GPU Feature Discovery, node status exporter, and operator validator.
Once the installation completes, verify that the NVIDIA GPU Operator components are operational. These include the driver daemonset, MIG Manager, and the node status exporter.
Running nvidia-smi confirms that OpenShift exposes the NVIDIA RTX PRO 4500 Blackwell Server Edition. The output shows driver version 580.126.16 and CUDA 13.0, with the GPUs idle and ready for workload validation.
$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- nvidia-smi
Tue Mar 10 20:46:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.16 Driver Version: 580.126.16 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 4500 Blac... On | 00000000:17:00.0 Off | 0 |
| N/A 33C P8 16W / 165W | 0MiB / 32623MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA RTX PRO 4500 Blac... On | 00000000:63:00.0 Off | 0 |
| N/A 34C P8 17W / 165W | 0MiB / 32623MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Verify the full GPU names with the following command:
$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
nvidia-smi --query-gpu=name --format=csv
name
NVIDIA RTX PRO 4500 Blackwell Server Edition
NVIDIA RTX PRO 4500 Blackwell Server Edition
At idle, the NVIDIA RTX PRO 4500 Blackwell Server Edition reports temperatures of 32–33°C and a power draw of approximately 17 W against a 165 W power limit.
Topology reporting shows that the GPUs and Mellanox NICs are attached within the same platform fabric, with both GPUs sharing NUMA affinity and standard PCIe-based connectivity.
$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NODE X NODE SYS SYS SYS 0-31,64-95 0 N/A
NIC0 NODE NODE X SYS SYS SYS
NIC1 SYS SYS SYS X PIX NODE
NIC2 SYS SYS SYS PIX X NODE
NIC3 SYS SYS SYS NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
MIG is disabled, compute mode remains in the default setting, and both persistence mode and ECC are enabled.
The Red Hat AI inference CUDA image supports NVIDIA's NVFP4 quantization format on RTX PRO 4500 Blackwell-based GPUs. This allows for efficient, low-cost large-model inference with vLLM. NVFP4 is a 4-bit floating-point format introduced with the NVIDIA Blackwell architecture that uses hardware acceleration.
We have reliably deployed NVFP4-quantized models. Using Red Hat AI Inference Server 3.3, results for completions, tool calling, reasoning, and accuracy are consistent with original full-precision models. Tests confirm good accuracy RedHatAI/Qwen3-30B-A3B-NVFP4 (TP1) and RedHatAI/Llama-3.3-70B-Instruct-NVFP4 (TP2).
Model name
Completions
Chat completion
Tool calling
Accuracy
RedHatAI/Qwen3-30B-A3B-NVFP4
Yes
Yes
Yes
80%
RedHatAI/Llama-3.3-70B-Instruct-NVFP4
Yes
Yes
Yes
93%
The following is a sample deployment that serves the model using Red Hat AI Inference Server. An init container downloads the model weights from Hugging Face, and the main container launches vLLM with tensor parallelism across two GPUs with tool-calling support enabled.
Create the necessary resources, such as Hugging Face secret for authentication (needed for gated model) and a persistent volume for caching the model weights, and then apply the deployment:
# Create the HF token secret
oc create secret generic hf-token-secret \
--from-literal=HUGGING_FACE_TOKEN=<your-token>
# Create a PVC for model caching
oc apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 100Gi
EOF
Use the following commands and outputs to validate model completions, chat performance, and accuracy benchmarks.
1. Completion (POST /v1/completions)
curl -s -X POST http://localhost:9000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/Qwen3-30B-A3B-NVFP4",
"prompt": "The capital of France is",
"max_tokens": 32,
"temperature": 0.0
}' | jq -r '.choices[0].text'
" Paris. The capital of the United Kingdom is London. The capital of the United States is Washington, D.C. The capital of Germany is Berlin. The capital",
2. Chat Completion - single turn:
curl -X POST http://localhost:9000/v1/chat/completions -H Content-Type: application/json -d {
"model": "RedHatAI/Qwen3-30B-A3B-NVFP4",
"messages": [{"role": "user", "content": "What is the capital of France? Answer in one sentence."}],
"max_tokens": 64,
"temperature": 0.0
}
HTTP STATUS: 200
3. Accuracy(gsm8k):
local-completions ({'model': 'RedHatAI/Qwen3-30B-A3B-NVFP4', 'base_url': 'http://localhost:9000/v1/completions', 'num_concurrent': 100, 'tokenized_requests': False}), gen_kwargs: ({'max_gen_toks': 4048}), limit: None, num_fewshot: None, batch_size: 16
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9067|± |0.0080|
| | |strict-match | 5|exact_match|↑ |0.9052|± |0.0081|
Performance validation
After confirming accuracy with NVFP4 quantized models, we validated performance characteristics using the GuideLLM benchmarking tool. The tests measured throughput and latency across five NVFP4 models deployed with Red Hat AI Inference Server 3.3 on the RTX PRO 4500 Blackwell Server Edition GPUs. See the full list of NVFP4 quantized models from Red Hat.
Test configuration
The validation used a standardized workload profile with 1,000 input tokens and 1,000 output tokens per request. We tested multiple concurrency levels to identify throughput limits and latency behavior under load. Each concurrency level ran for 2-4 minutes to ensure stable measurements.
All deployments used a dual-replica configuration with tensor parallelism set to 1 (TP=1), meaning each replica ran on a single GPU.
Performance results
The following table shows peak throughput and peak SLO-compliant throughput for each model. Peak SLO-compliant concurrency is the highest level where P99 Time to First Token (TTFT) is at or below 3,000 ms and P99 Inter-Token Latency (ITL) is at or below 80 ms.
Performance summary for the 2x NVIDIA RTX PRO 4500 Blackwell Server Edition. Throughput values represent output tokens per second. SLO-compliant concurrency is the maximum concurrent requests while maintaining SLO (P99 TTFT ≤ 3s, P99 ITL ≤ 80ms).
Model
Size
Peak throughput (tok/s)
Peak concurrency
Peak SLO-compliant throughput (tok/s)
Peak SLO-compliant concurrency
P99 TTFT (ms)
P99 ITL (ms)
Llama-3.1-8B
8B
3,515
225
2,847
100
2,645
31
Qwen3-8B
8B
2,966
225
2,421
100
2,531
32
Qwen3-14B
14B
2,225
150
1,339
50
2,719
33
Mistral-Small-3.2-24B
24B
1,625
170
688
30
2,137
34
Qwen3-32B
32B
666
50
333
20
2,076
43
Key findings:
The 8B models demonstrate linear throughput scaling up to 100 concurrent requests and maintain sub-3 second P99 response times.
The 14B model provides a balance between capability and performance, supporting up to 50 concurrent requests within the prescribed SLO.
The 24B and larger models are best suited for lower-concurrency workloads where model capability is prioritized over throughput.
The scaling behavior for these models across concurrent requests is shown in Figure 4, and the comparison of peak versus SLO-compliant throughput is shown in Figure 5.
Figure 4: Output throughput versus concurrency comparison for various models, highlighting peak and SLO-compliant operating points. Blue stars mark peak throughput for each model. Green stars mark the peak SLO-compliant operating point (P99 TTFT ≤3s, P99 ITL ≤80ms).Figure 5: Comparison of peak versus SLO-compliant throughput for Llama, Qwen, and Mistral models. SLO-compliant throughput represents the maximum sustainable performance while meeting strict latency SLOs.
Conclusion
The NVFP4 quantized models running on dual RTX PRO 4500 Blackwell Server Edition GPUs deliver high-speed inference performance across various model sizes. This platform demonstrates that 4-bit NVFP4 quantization, combined with modern GPU architecture and optimized inference engines, delivers reliable AI inference at scale.
Red Hat OpenShift AI
With the accelerator environment already prepared and validated, the next step is to add Red Hat OpenShift AI so teams can start using those resources for model serving, inference, and other AI workflows at scale. This is the point where the validated hardware configuration becomes available through the OpenShift AI experience and can be used by data scientists, developers, and platform teams.
Install Red Hat OpenShift AI from the Software Catalog using the stable channel stable-3.x and version 3.3.0. Once installed, the platform can make use of the available accelerator resources for AI workloads (Figure 6).
Figure 6: The Red Hat OpenShift AI operator installation interface within the Software Catalog.
To make the NVIDIA RTX PRO 4500 Blackwell Server Edition available as a reusable accelerator option in Red Hat OpenShift AI, we created a dedicated hardware profile. In OpenShift AI, hardware profiles define the resource configuration that users can select for workbenches and other AI workloads, combining CPU, memory, and accelerator settings into a single reusable profile.
For this configuration, we created a profile named NVIDIA RTX PRO 4500 Blackwell Server Edition and associated it with the accelerator resource identifier nvidia.com/gpu. We then defined the default and allowed resource ranges for CPU, memory, and GPU allocation. In this example, the profile was configured with a default of 2 CPU cores, 16 GiB of memory, and 1 GPU, with support for scaling to 8 CPU cores, 32 GiB of memory, and 2 GPUs as required (Figure 7).
Figure 7: The Hardware profiles interface displaying configured node resource limits for CPU, memory, and NVIDIA accelerators.
After the profile is updated, it is listed as an enabled hardware profile in OpenShift AI and can be used as a standard accelerator-backed configuration for supported workloads (Figure 8).
Figure 8: The enabled hardware profile is now available in the Red Hat OpenShift AI dashboard for workload allocation.
For example, we created a distributed training job using Kubeflow Trainer to fine-tune a large language model (LLM) on Red Hat OpenShift AI using two NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs. Figure 9 illustrates the training configuration and metrics during the distributed model fine-tuning process directly from a Jupyter notebook using TensorBoard.
Figure 9: Distributed model training job metrics using Kubeflow Trainer with the RTX PRO 4500 Blackwell Server Edition.
Figure 10 displays the OpenShift web console observability dashboard, which allows you to monitor the GPU metrics in real time and shows the high utilization of the two NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs during the fine-tuning job.
Figure 10: OpenShift AI GPU metrics.
Summary and next steps
The NVIDIA RTX PRO 4500 Blackwell Server Edition provides a clear upgrade path for teams moving beyond the NVIDIA L4. By using the NVFP4 format on Red Hat OpenShift, you can maximize inference efficiency while maintaining a compact hardware footprint. Use the configuration steps in this guide to begin validating Blackwell-class workloads in your environment.
Building a Remote MCP Server with .NET 10 and Prompts
Model Context Protocol (MCP) gives AI clients a standard way to discover and call server capabilities. Most examples focus on tools first, but this project demonstrates an equally important direction: prompt-first server design.
In this post, we walk through a practical ASP.NET Core implementation that exposes:
tool operations (for executable tasks)
reusable prompts (for instruction templates and interaction quality)
HTTP MCP transport for remote clients
lightweight operational endpoints for health and diagnostics
By the end, you will understand how this server is structured, how prompts are registered and invoked, and how to expand the design for production scenarios.
Why Prompt-First MCP Servers Matter
Tools answer what the server can do. Prompts shape how those capabilities are used.
In assistant systems, quality often depends on strong instruction templates:
consistent phrasing
clear required input
predictable output structure
fallback behavior when user requests are ambiguous
When prompts are exposed through MCP, clients can discover them dynamically and compose better experiences without hardcoding all instructions in the client itself.
This creates a cleaner split of responsibilities:
server owns domain guidance and prompt governance
client owns orchestration and user interface
Get Started
The complete, production-ready source code is available on GitHub:
src/McpServer/WeatherService: external weather API integration
scripts: PowerShell automation for MCP protocol calls
Runtime and Transport Design
The server is configured with both HTTP transport and stdio transport. HTTP is the primary remote integration path, while stdio can be useful for local tool-chain and debugging workflows.
At startup, the app:
reads FUNCTIONS_CUSTOMHANDLER_PORT (default 8081)
binds Kestrel to 0.0.0.0:port
registers tools, resources, and prompts in the MCP pipeline
maps /mcp for protocol traffic
maps /api/healthz for liveness checks
This split is operationally useful:
/mcp handles AI protocol traffic
/api/healthz supports probes and platform health monitors
Prompt Architecture Deep Dive
Prompt definitions live in a dedicated prompt container class decorated with an MCP prompt type attribute.
Prompt Container Responsibilities
The prompt class has three key responsibilities:
define discoverable prompt methods
document methods and arguments with metadata
optionally log invocation context for observability
The implementation includes dependency-injected logging, making it straightforward to trace usage patterns in real deployments.
Prompt 1: default_prompt
Purpose:
provide a baseline system instruction for concise, reliable responses
Behavior:
returns a compact instruction block emphasizing brevity and non-hallucination
includes explicit uncertainty handling by asking for clarification when needed
Why this is valuable:
offers a reusable default instruction that clients can apply consistently
avoids duplicating base instruction text across multiple clients
Prompt 2: weather_query_guide
Purpose:
teach users or orchestrators how to ask weather questions effectively
Behavior:
returns structured guidance with examples and best practices
accepts optional userContext and appends it when provided
Why argumentized prompts are powerful:
one prompt template can adapt to locale, user preferences, or interaction history
client-side logic remains simple while server controls guidance quality
Prompt 3: weather_data_interpretation
Purpose:
standardize how weather tool output should be interpreted and presented
Behavior:
explains expected weather fields and presentation strategy
includes recommendation patterns, for example suggesting umbrella advice when appropriate
Why this helps:
decouples raw tool output from user-facing narrative quality
improves consistency across multi-client environments
Prompt Discovery and Invocation Flow
MCP interaction is straightforward:
Client calls prompts/list
Server returns prompt metadata including names, descriptions, and argument schema
Client calls prompts/get with a selected prompt name (and optional arguments)
Server returns prompt content ready for orchestration
This means client authors can build dynamic UIs or agent planners that adapt to server capabilities at runtime, without shipping fixed prompt catalogs.
Local Run and Verification
From repo root:
dotnet restore .\src\McpServer\McpServer.slnx
dotnet run --project .\src\McpServer\McpServer\McpServer.csproj
Default MCP endpoint:
http://0.0.0.0:8081/mcp
Health check:
http://localhost:8081/api/healthz
Then, in a second terminal, verify protocol behavior with scripts.
Script-Driven MCP Validation
The scripts folder includes ready-to-use protocol calls for tools, prompts, and optional resources.