Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149750 stories
·
33 followers

What to Expect from Nvidia This Week

1 Share
From: AIDailyBrief
Duration: 8:25
Views: 92

Nvidia unveiled a push into inference with a GROQ‑based chip and Samsung production that could diversify manufacturing outside TSMC. SEC filings show a sharp rise in AI agent risk disclosures, signaling growing legal and business concern about agent disruption to SaaS. Copyright disputes stalled SeedDance2.0's global launch while new AI ventures and Gemini‑powered AskMaps underscore rapid commercialization and labor market anxieties.

The AI Daily Brief helps you understand the most important news and discussions in AI.
Subscribe to the podcast version of The AI Daily Brief wherever you listen: https://pod.link/1680633614
Get it ad free at http://patreon.com/aidailybrief
Learn more about the show https://aidailybrief.ai/

Read the whole story
alvinashcraft
41 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

RNR 356 - How Vercel Built the v0 App with React Native

1 Share

Fernando Rojo, Head of Mobile at Vercel, joins Mazen and Robin to discuss building the V0 mobile app with React Native and Expo, including tech stack decisions, performance optimization, and how the project pushed Vercel to improve its native tooling.

 

Show Notes

 

Connect With Us!

 

This episode is brought to you by Infinite Red!

Infinite Red is an expert React Native consultancy located in the USA. With over a decade of React Native experience and deep roots in the React Native community (hosts of Chain React and the React Native Newsletter, core React Native contributors, creators of Ignite and Reactotron, and much, much more), Infinite Red is the best choice for helping you build and deploy your next React Native app.





Download audio: https://cdn.simplecast.com/media/audio/transcoded/1208ee61-9c16-43c1-bc4c-ca790717f4a8/2de31959-5831-476e-8c89-02a2a32885ef/episodes/audio/group/4d720856-c07a-44c8-9822-4c94c9ddde64/group-item/91aae137-57d0-420e-a36c-6565425ed1ca/128_default_tc.mp3?aid=rss_feed&feed=hEI_f9Dx
Read the whole story
alvinashcraft
42 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

#473 A clean room rewrite?

1 Share
Topics covered in this episode:
Watch on YouTube

About the show

Sponsored by us! Support our work through:

Michael #1: chardet ,AI, and licensing

  • Thanks Ian Lessing
  • Wow, where to start?
  • A bit of legal precedence research.
  • Chardet dispute shows how AI will kill software licensing, argues Bruce Perens on the Register
  • Also see this GitHub issue.
  • Dan Blanchard, maintainer of a Python character encoding detection library called chardet, released a new version of the library under a new software license. (LGPL → MIT)
  • Dan is allowed to make this change because v7 is a complete “clean room” rewrite using AI
  • BTW, v7 is WAY better:
    • The result is a 48x increase in detection speed for a project that lives in the hot loops of many projects. That will lead to noticeable performance increases for literally millions of users (the package gets ~130M downloads per month).
    • It paves a path towards inclusion in the standard library (assuming they don’t institute policies against using AI tools).
    • Thread-safe detect() and detect_all() with no measurable overhead; scales on free-threaded Python 3.13t+
  • An individual claiming to be Mark Pilgrim, the original creator of the library, opened an issue in the project's GitHub repo arguing that Blanchard had no right to change the software license, citing the LPGL requirement that the license remain unchanged.
  • A 'complete rewrite' is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a 'clean room' implementation).
  • Blanchard disagreed, citing how version 7.0.0 and 6.0.0 compare when subjected to JPlag, a library for detecting plagiarism.
  • Blanchard told The Register he had wanted to get chardet added to the Python standard library for more than a decade since it’s a core dependency to most Python projects.

Brian #2: refined-github

  • Suggested by Matthias Schöttle
  • A browser plugin that improves the GitHub experience
  • A sampling
    • Adds a build/CI status icon next to the repo’s name.
    • Adds a link back to the PR that ran the workflow.
    • Enables tab and shift tab for indentation in comment fields.
    • Auto-resizes comment fields to fit their content and no longer show scroll bars.
    • Highlights the most useful comment in issues.
    • Changes the default sort order of issues/PRs to Recently updated.
  • But really, it’s a huge list of improvements

Michael #3: pgdog: PostgreSQL connection pooler, load balancer and database sharder

  • PgDog is a proxy for scaling PostgreSQL.
  • It supports connection pooling, load balancing queries and sharding entire databases.
  • Written in Rust, PgDog is fast, secure and can manage thousands of connections on commodity hardware.
  • Features
    • PgDog is an application layer load balancer for PostgreSQL
    • Health Checks: PgDog maintains a real-time list of healthy hosts. When a database fails a health check, it's removed from the active rotation and queries are re-routed to other replicas
    • Single Endpoint: PgDog can detect writes (e.g. INSERT, UPDATE, CREATE TABLE, etc.) and send them to the primary, leaving the replicas to serve reads
    • Failover: PgDog monitors Postgres replication state and can automatically redirect writes to a different database if a replica is promoted
    • Sharding: PgDog is able to manage databases with multiple shards

Brian #4: Agentic Engineering Patterns

Extras

Brian:

Michael:

Joke: Ergonomic keyboard

Also pretty good and related:

Links





Download audio: https://pythonbytes.fm/episodes/download/473/a-clean-room-rewrite.mp3
Read the whole story
alvinashcraft
42 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Inference Engineering with Baseten's Philip Kiely

1 Share

This week on the show, Scott talks to Philip Kiley about his new book, Inference Engineering. Inference Engineering is your guide to becoming an expert in inference. It contains everything that Philip has learned in four years of working at Baseten. This book is based on the hundreds of thousands of words of documentation, blogs, and talks he's written on inference; interviews with dozens of experts from our engineering team; and countless conversations with customers and builders around the world.





Download audio: https://r.zen.ai/r/cdn.simplecast.com/media/audio/transcoded/75c667ea-2739-4306-96be-e15097ef0853/24832310-78fe-4898-91be-6db33696c4ba/episodes/audio/group/4e2f255e-1fb0-4e72-9b8f-9df9bb8ac5a7/group-item/6ca9fcdf-689a-4c68-94fe-bc08d7372e82/128_default_tc.mp3?aid=rss_feed&feed=gvtxUiIf
Read the whole story
alvinashcraft
42 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Configure NVIDIA Blackwell GPUs for Red Hat AI workloads

1 Share

The NVIDIA RTX PRO 4500 Blackwell Server Edition brings GPU acceleration to the world's most widely adopted enterprise data center and edge computing platforms. It offers a significant performance increase compared to traditional CPU-only servers. For Red Hat customers, this server edition provides compact acceleration across the Red Hat AI portfolio, including Red Hat AI Inference Server, Red Hat Enterprise Linux AI, and Red Hat AI Enterprise. This gives organizations a practical path to build, optimize, deploy, and scale AI workloads across enterprise datacenter and edge environments.

Optimized for Red Hat AI

The NVIDIA RTX PRO 4500 Blackwell Server Edition is a reliable choice for compact, power-efficient AI deployments. It provides inference performance without adding unnecessary operational complexity. For Red Hat AI users, it offers a practical mix of memory capacity, performance, and efficiency for running modern models in enterprise datacenter and edge environments.

This hardware also stands out as a compelling successor to the NVIDIA L4 for this type of deployment. With more memory, greater performance headroom, and support for low-precision inference, organizations can better tune model size, throughput, latency, and overall deployment efficiency to match workload requirements.

Quantization provides much of that value. 8-bit integer (INT8) is a widely adopted option for inference, while 4-bit integer (INT4) helps fit larger models into more constrained memory footprints. FP8 has also become increasingly important for modern accelerator-based deployments. Blackwell supports NVFP4, giving Red Hat AI users flexibility for advanced model optimization and inference.

NVIDIA RTX PRO Servers with RTX PRO 4500 Blackwell Server Edition are also featured as part of the updated NVIDIA Enterprise AI Factory validated design and the NVIDIA AI Data Platform, a customizable reference design for building modern storage systems for enterprise agentic AI.

Configure the RTX PRO 4500 Blackwell Server Edition on Red Hat AI Enterprise

To use the RTX PRO 4500 Blackwell Server Edition in Red Hat OpenShift, install the Node Feature Discovery and the NVIDIA GPU Operator (Figure 1).

Software Catalog in Red Hat OpenShift showing a search for NVIDIA GPU Operator with one result provided by NVIDIA Corporation.
Figure 1: Search for and select the NVIDIA GPU Operator from the OpenShift Software Catalog.

Set these parameters in the NVIDIA GPU Operator installation UI:

  1. Set the NVIDIA GPU Operator ClusterPolicy to version 580.126.16 (version 595 will be the officially supported NVIDIA driver release). Enter this value in the driver version field to deploy the required driver image tag across the cluster.
  2. Enter nvcr.io/nvidia in the repository field of the ClusterPolicy so the operator pulls the container from the correct registry.
  3. Enter driver in the image field of the ClusterPolicy to reference the correct driver container image.
  4. Set kernelModuleType to open in the NVIDIA GPU Operator ClusterPolicy to use open GPU kernel modules during installation.

You can also edit with the cluster policy and add these parameters:

$ oc edit clusterpolicy
driver:
   version: 580.126.16
   image: driver
   repository: nvcr.io/nvidia
   kernelModuleType: open

Once installed, you can use the RTX PRO 4500 Blackwell Server Edition with OpenShift (Figure 2).

Red Hat OpenShift terminal showing nvidia-smi output for two NVIDIA RTX PRO 4500 Blackwell GPUs with no running processes found.
Figure 2: Use the terminal to verify the installation of the NVIDIA GPUs using the nvidia-smi command.

Running nvidia-smi from the NVIDIA driver daemonset in the OpenShift web console confirms that both NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs are detected correctly

Verify the hardware

This validation environment uses Red Hat OpenShift 4.20.15.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.20.15   True        False         24m     Cluster version is 4.20.15

The deployment uses a single-node Red Hat OpenShift cluster running Kubernetes 1.33.6.

$ oc get nodes
NAME                         STATUS   ROLES                         AGE     VERSION
redhat-validation-02-gpu01   Ready    control-plane,master,worker   6h24m   v1.33.6
Node Feature Discovery Operator installation modal in the Red Hat OpenShift Software Catalog highlighting the install button and software components.
Figure 3: Use the Node Feature Discovery Operator to manage hardware-specific labeling within your cluster.

After you install the Node Feature Discovery Operator (Figure 3), the node identifies as hosting an NVIDIA PCI device with Single Root I/O Virtualization (SR-IOV) capabilities.

$ oc describe node/redhat-validation-02-gpu01 | grep pci-10de
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true

The NVIDIA GPU Operator deploys into the nvidia-gpu-operator project.

$ oc project nvidia-gpu-operator
Now using project "nvidia-gpu-operator" on server "https://api.launchpad.nvidia.com:6443".

During installation, the NVIDIA GPU Operator starts components in sequence. These include the driver daemonset, container toolkit, device plug-in, NVIDIA Data Center GPU Manager (DCGM), GPU Feature Discovery, node status exporter, and operator validator.

$ oc get pods
NAME                                           READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-sftmv                    0/1     Init:0/1   0          2m21s
gpu-operator-595d9f95cf-rv2jr                  1/1     Running    0          13m
nvidia-container-toolkit-daemonset-5h99p       1/1     Running    0          2m21s
nvidia-dcgm-exporter-6trh8                     0/1     Init:0/2   0          2m21s
nvidia-dcgm-r5gsn                              0/1     Init:0/1   0          2m21s
nvidia-device-plugin-daemonset-j7s74           0/1     Init:0/1   0          2m21s
nvidia-driver-daemonset-9.6.20250925-0-cdrcf   2/2     Running    0          2m28s
nvidia-node-status-exporter-5wflx              1/1     Running    0          2m27s
nvidia-operator-validator-vbwlr                0/1     Init:0/4   0          2m21s

Once the installation completes, verify that the NVIDIA GPU Operator components are operational. These include the driver daemonset, MIG Manager, and the node status exporter.

$ oc get pods
NAME                                           READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-sftmv                    1/1     Running     0             3m16s
gpu-operator-595d9f95cf-rv2jr                  1/1     Running     0             14m
nvidia-container-toolkit-daemonset-5h99p       1/1     Running     0             3m16s
nvidia-cuda-validator-pv4mv                    0/1     Completed   0             42s
nvidia-dcgm-exporter-6trh8                     1/1     Running     2 (22s ago)   3m16s
nvidia-dcgm-r5gsn                              1/1     Running     0             3m16s
nvidia-device-plugin-daemonset-j7s74           1/1     Running     0             3m16s
nvidia-driver-daemonset-9.6.20250925-0-cdrcf   2/2     Running     0             3m23s
nvidia-mig-manager-w5ncg                       1/1     Running     0             23s
nvidia-node-status-exporter-5wflx              1/1     Running     0             3m22s
nvidia-operator-validator-vbwlr                1/1     Running     0             3m16s

Running nvidia-smi confirms that OpenShift exposes the NVIDIA RTX PRO 4500 Blackwell Server Edition. The output shows driver version 580.126.16 and CUDA 13.0, with the GPUs idle and ready for workload validation.

$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf   -- nvidia-smi
Tue Mar 10 20:46:45 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.16             Driver Version: 580.126.16     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX PRO 4500 Blac...    On  |   00000000:17:00.0 Off |                    0 |
| N/A   33C    P8             16W /  165W |       0MiB /  32623MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX PRO 4500 Blac...    On  |   00000000:63:00.0 Off |                    0 |
| N/A   34C    P8             17W /  165W |       0MiB /  32623MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify the full GPU names with the following command:

$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
  nvidia-smi --query-gpu=name --format=csv
name
NVIDIA RTX PRO 4500 Blackwell Server Edition
NVIDIA RTX PRO 4500 Blackwell Server Edition

At idle, the NVIDIA RTX PRO 4500 Blackwell Server Edition reports temperatures of 32–33°C and a power draw of approximately 17 W against a 165 W power limit.

$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
  nvidia-smi --query-gpu=index,name,temperature.gpu,power.draw,power.limit,fan.speed --format=csv
index, name, temperature.gpu, power.draw [W], power.limit [W], fan.speed [%]
0, NVIDIA RTX PRO 4500 Blackwell Server Edition, 32, 16.74 W, 165.00 W, [N/A]
1, NVIDIA RTX PRO 4500 Blackwell Server Edition, 33, 17.40 W, 165.00 W, [N/A]

Each GPU exposes 32 GB of memory:

$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
  nvidia-smi --query-gpu=index,name,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv
index, name, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA RTX PRO 4500 Blackwell Server Edition, 0 %, 0 %, 32623 MiB, 0 MiB, 32128 MiB
1, NVIDIA RTX PRO 4500 Blackwell Server Edition, 0 %, 0 %, 32623 MiB, 0 MiB, 32128 MiB

At idle, the graphics and streaming multiprocessor (SM) clocks run at 180 MHz, with memory clocks at 405 MHz.

$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
  nvidia-smi --query-gpu=index,name,clocks.current.graphics,clocks.current.sm,clocks.current.memory --format=csv
index, name, clocks.current.graphics [MHz], clocks.current.sm [MHz], clocks.current.memory [MHz]
0, NVIDIA RTX PRO 4500 Blackwell Server Edition, 180 MHz, 180 MHz, 405 MHz
1, NVIDIA RTX PRO 4500 Blackwell Server Edition, 180 MHz, 180 MHz, 405 MHz

Topology reporting shows that the GPUs and Mellanox NICs are attached within the same platform fabric, with both GPUs sharing NUMA affinity and standard PCIe-based connectivity.

$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
  nvidia-smi topo -m
        GPU0    GPU1    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NODE     X      NODE    SYS     SYS     SYS     0-31,64-95      0               N/A
NIC0    NODE    NODE     X      SYS     SYS     SYS
NIC1    SYS     SYS     SYS      X      PIX     NODE
NIC2    SYS     SYS     SYS     PIX      X      NODE
NIC3    SYS     SYS     SYS     NODE    NODE     X 
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
NIC Legend:
  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

MIG is disabled, compute mode remains in the default setting, and both persistence mode and ECC are enabled.

nvidia@redhat-validation-02-bastion:~$ oc exec -it nvidia-driver-daemonset-9.6.20250925-0-cdrcf -- \
  nvidia-smi --query-gpu=index,mig.mode.current,compute_mode,persistence_mode,ecc.mode.current --format=csv
index, mig.mode.current, compute_mode, persistence_mode, ecc.mode.current
0, Disabled, Default, Enabled, Enabled
1, Disabled, Default, Enabled, Enabled

Run Red Hat AI inference

Use the registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0-1771898916 container image to run Red Hat AI Inference Server 3.3.

The Red Hat AI inference CUDA image supports NVIDIA's NVFP4 quantization format on RTX PRO 4500 Blackwell-based GPUs. This allows for efficient, low-cost large-model inference with vLLM. NVFP4 is a 4-bit floating-point format introduced with the NVIDIA Blackwell architecture that uses hardware acceleration.

We have reliably deployed NVFP4-quantized models. Using Red Hat AI Inference Server 3.3, results for completions, tool calling, reasoning, and accuracy are consistent with original full-precision models. Tests confirm good accuracy RedHatAI/Qwen3-30B-A3B-NVFP4 (TP1) and RedHatAI/Llama-3.3-70B-Instruct-NVFP4 (TP2).

Model nameCompletionsChat  completionTool callingAccuracy
RedHatAI/Qwen3-30B-A3B-NVFP4YesYesYes80%
RedHatAI/Llama-3.3-70B-Instruct-NVFP4YesYesYes93%

The following is a sample deployment that serves the model using Red Hat AI Inference Server. An init container downloads the model weights from Hugging Face, and the main container launches vLLM with tensor parallelism across two GPUs with tool-calling support enabled.

Create the necessary resources, such as Hugging Face secret for authentication (needed for gated model) and a persistent volume for caching the model weights, and then apply the deployment:

# Create the HF token secret
oc create secret generic hf-token-secret \
  --from-literal=HUGGING_FACE_TOKEN=<your-token>
# Create a PVC for model caching
oc apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 100Gi
EOF
# Create model deployment with basic confguration
oc apply -f - <<EOF
kind: Deployment
apiVersion: apps/v1
metadata:
  name: llm-deploy-929
  namespace: test-rhaiis
  labels:
    app: rhaiis-runner
spec:
  replicas: 1
  selector:
    matchLabels:
      app: rhaiis-runner
  template:
    metadata:
      name: rhaiis-runner
      labels:
        app: rhaiis-runner
    spec:
      restartPolicy: Always
      initContainers:
        - name: download
          command:
            - /bin/bash
            - '-c'
          env:
            - name: HF_HUB_OFFLINE
              value: '0'
            - name: HF_HOME
              value: /mnt/model
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: HUGGING_FACE_TOKEN
          volumeMounts:
            - name: cache-volume
              mountPath: /mnt/model
          terminationMessagePolicy: File
          image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0-1771898916
          args:
            - huggingface-cli download RedHatAI/Qwen3-30B-A3B-NVFP4
      imagePullSecrets:
        - name: quay-secrets
      containers:
        - resources:
            limits:
              cpu: '16'
              memory: 30Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: 10m
              memory: 29Gi
              nvidia.com/gpu: '1'
          name: rhaiis
          command:
            - /bin/bash
            - '-c'
          env:
            - name: HF_HUB_OFFLINE
              value: '0'
            - name: HF_HOME
              value: /mnt/model
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token-secret
                  key: HUGGING_FACE_TOKEN
          ports:
            - containerPort: 8000
              protocol: TCP
          volumeMounts:
            - name: cache-volume
              mountPath: /mnt/model
            - name: dshm
              mountPath: /dev/shm
          image: registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0-1771898916
          args:
            - vllm serve RedHatAI/Qwen3-30B-A3B-NVFP4 --uvicorn-log-level debug --trust-remote-code --enable-chunked-prefill --tensor-parallel-size 1 --max-model-len 10000
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: model-cache
        - name: dshm
          emptyDir:
            medium: Memory
EOF

Use the following commands and outputs to validate model completions, chat performance, and accuracy benchmarks.

1. Completion (POST /v1/completions)
  curl -s -X POST http://localhost:9000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/Qwen3-30B-A3B-NVFP4",
"prompt": "The capital of France is",
"max_tokens": 32,
"temperature": 0.0
}' | jq -r '.choices[0].text'
" Paris. The capital of the United Kingdom is London. The capital of the United States is Washington, D.C. The capital of Germany is Berlin. The capital",
2.  Chat Completion - single turn:
 curl -X POST http://localhost:9000/v1/chat/completions -H Content-Type: application/json -d {
        "model": "RedHatAI/Qwen3-30B-A3B-NVFP4",
        "messages": [{"role": "user", "content": "What is the capital of France? Answer in one sentence."}],
        "max_tokens": 64,
        "temperature": 0.0
    }
HTTP STATUS: 200
3.  Accuracy(gsm8k):
local-completions ({'model': 'RedHatAI/Qwen3-30B-A3B-NVFP4', 'base_url': 'http://localhost:9000/v1/completions', 'num_concurrent': 100, 'tokenized_requests': False}), gen_kwargs: ({'max_gen_toks': 4048}), limit: None, num_fewshot: None, batch_size: 16
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9067|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.9052|±  |0.0081|

Performance validation

After confirming accuracy with NVFP4 quantized models, we validated performance characteristics using the GuideLLM benchmarking tool. The tests measured throughput and latency across five NVFP4 models deployed with Red Hat AI Inference Server 3.3 on the RTX PRO 4500 Blackwell Server Edition GPUs. See the full list of NVFP4 quantized models from Red Hat.

Test configuration

The validation used a standardized workload profile with 1,000 input tokens and 1,000 output tokens per request. We tested multiple concurrency levels to identify throughput limits and latency behavior under load. Each concurrency level ran for 2-4 minutes to ensure stable measurements.

All deployments used a dual-replica configuration with tensor parallelism set to 1 (TP=1), meaning each replica ran on a single GPU.

Performance results

The following table shows peak throughput and peak SLO-compliant throughput for each model. Peak SLO-compliant concurrency is the highest level where P99 Time to First Token (TTFT) is at or below 3,000 ms and P99 Inter-Token Latency (ITL) is at or below 80 ms.

Performance summary for the 2x NVIDIA RTX PRO 4500 Blackwell Server Edition. Throughput values represent output tokens per second. SLO-compliant concurrency is the maximum concurrent requests while maintaining SLO (P99 TTFT 3s, P99 ITL 80ms).
ModelSizePeak throughput (tok/s)Peak concurrencyPeak SLO-compliant throughput (tok/s)Peak SLO-compliant concurrencyP99 TTFT (ms)P99 ITL (ms)
Llama-3.1-8B8B3,5152252,8471002,64531
Qwen3-8B8B2,9662252,4211002,53132
Qwen3-14B14B2,2251501,339502,71933
Mistral-Small-3.2-24B24B1,625170688302,13734
Qwen3-32B32B66650333202,07643

Key findings:

  • The 8B models demonstrate linear throughput scaling up to 100 concurrent requests and maintain sub-3 second P99 response times.
  • The 14B model provides a balance between capability and performance, supporting up to 50 concurrent requests within the prescribed SLO.
  • The 24B and larger models are best suited for lower-concurrency workloads where model capability is prioritized over throughput.

The scaling behavior for these models across concurrent requests is shown in Figure 4, and the comparison of peak versus SLO-compliant throughput is shown in Figure 5.

Line chart comparing LLM output throughput across concurrent requests, showing Llama-3.1-8B achieving the highest peak performance.
Figure 4: Output throughput versus concurrency comparison for various models, highlighting peak and SLO-compliant operating points. Blue stars mark peak throughput for each model. Green stars mark the peak SLO-compliant operating point (P99 TTFT ≤3s, P99 ITL ≤80ms).
Bar chart comparing peak and SLO-compliant throughput across five LLMs, with Llama-3.1-8B achieving the highest performance.
Figure 5: Comparison of peak versus SLO-compliant throughput for Llama, Qwen, and Mistral models. SLO-compliant throughput represents the maximum sustainable performance while meeting strict latency SLOs.

Conclusion

The NVFP4 quantized models running on dual RTX PRO 4500 Blackwell Server Edition GPUs deliver high-speed inference performance across various model sizes. This platform demonstrates that 4-bit NVFP4 quantization, combined with modern GPU architecture and optimized inference engines, delivers reliable AI inference at scale.

Red Hat OpenShift AI

With the accelerator environment already prepared and validated, the next step is to add Red Hat OpenShift AI so teams can start using those resources for model serving, inference, and other AI workflows at scale. This is the point where the validated hardware configuration becomes available through the OpenShift AI experience and can be used by data scientists, developers, and platform teams.

Install Red Hat OpenShift AI from the Software Catalog using the stable channel stable-3.x and version 3.3.0. Once installed, the platform can make use of the available accelerator resources for AI workloads (Figure 6).

OpenShift AI installation dialog in the Software Catalog showing selections for the stable-3.x channel and version 3.3.0.
Figure 6: The Red Hat OpenShift AI operator installation interface within the Software Catalog.

To make the NVIDIA RTX PRO 4500 Blackwell Server Edition available as a reusable accelerator option in Red Hat OpenShift AI, we created a dedicated hardware profile. In OpenShift AI, hardware profiles define the resource configuration that users can select for workbenches and other AI workloads, combining CPU, memory, and accelerator settings into a single reusable profile.

For this configuration, we created a profile named NVIDIA RTX PRO 4500 Blackwell Server Edition and associated it with the accelerator resource identifier nvidia.com/gpu. We then defined the default and allowed resource ranges for CPU, memory, and GPU allocation. In this example, the profile was configured with a default of 2 CPU cores, 16 GiB of memory, and 1 GPU, with support for scaling to 8 CPU cores, 32 GiB of memory, and 2 GPUs as required (Figure 7).

Hardware profiles table in OpenShift AI showing resource settings for a profile with CPU, memory, and GPU identifiers.
Figure 7: The Hardware profiles interface displaying configured node resource limits for CPU, memory, and NVIDIA accelerators.

After the profile is updated, it is listed as an enabled hardware profile in OpenShift AI and can be used as a standard accelerator-backed configuration for supported workloads (Figure 8).

Hardware profiles list in Red Hat OpenShift AI showing the newly created NVIDIA RTX PRO 4500 Blackwell Server Edition profile as enabled.
Figure 8: The enabled hardware profile is now available in the Red Hat OpenShift AI dashboard for workload allocation.

For example, we created a distributed training job using Kubeflow Trainer to fine-tune a large language model (LLM) on Red Hat OpenShift AI using two NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs. Figure 9 illustrates the training configuration and metrics during the distributed model fine-tuning process directly from a Jupyter notebook using TensorBoard.

Four line graphs in TensorBoard showing increasing training epoch and decreasing gradient norm, learning rate, and loss over time.
Figure 9: Distributed model training job metrics using Kubeflow Trainer with the RTX PRO 4500 Blackwell Server Edition.

Figure 10 displays the OpenShift web console observability dashboard, which allows you to monitor the GPU metrics in real time and shows the high utilization of the two NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs during the fine-tuning job.

OpenShift console Metrics dashboard displaying a stacked area chart tracking real-time utilization peaks for two NVIDIA GPU instances.
Figure 10: OpenShift AI GPU metrics.

Summary and next steps

The NVIDIA RTX PRO 4500 Blackwell Server Edition provides a clear upgrade path for teams moving beyond the NVIDIA L4. By using the NVFP4 format on Red Hat OpenShift, you can maximize inference efficiency while maintaining a compact hardware footprint. Use the configuration steps in this guide to begin validating Blackwell-class workloads in your environment.

Learn more about the NVIDIA RTX PRO 4500 Blackwell Server Edition GPU and view the technical specifications.

The post Configure NVIDIA Blackwell GPUs for Red Hat AI workloads appeared first on Red Hat Developer.

Read the whole story
alvinashcraft
42 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Building a Remote MCP Server with .NET 10 and Prompts

1 Share

Building a Remote MCP Server with .NET 10 and Prompts

Model Context Protocol (MCP) gives AI clients a standard way to discover and call server capabilities. Most examples focus on tools first, but this project demonstrates an equally important direction: prompt-first server design.

In this post, we walk through a practical ASP.NET Core implementation that exposes:

  • tool operations (for executable tasks)
  • reusable prompts (for instruction templates and interaction quality)
  • HTTP MCP transport for remote clients
  • lightweight operational endpoints for health and diagnostics

By the end, you will understand how this server is structured, how prompts are registered and invoked, and how to expand the design for production scenarios.

Why Prompt-First MCP Servers Matter

Tools answer what the server can do. Prompts shape how those capabilities are used.

In assistant systems, quality often depends on strong instruction templates:

  • consistent phrasing
  • clear required input
  • predictable output structure
  • fallback behavior when user requests are ambiguous

When prompts are exposed through MCP, clients can discover them dynamically and compose better experiences without hardcoding all instructions in the client itself.

This creates a cleaner split of responsibilities:

  • server owns domain guidance and prompt governance
  • client owns orchestration and user interface

Get Started

The complete, production-ready source code is available on GitHub:

remote-MCP-servers-using-dotnet-sdk-prompts

Clone it locally:

git clone https://github.com/azurecorner/remote-MCP-servers-using-dotnet-sdk-prompts.git

In the repository you’ll find:

  • Full source code for tool, prompt, and resource implementations
  • PowerShell scripts for testing MCP capabilities
  • Configuration files for local VS Code MCP client integration
  • Examples ready to extend and adapt for your own domain

Project Overview

This repository contains a remote MCP server built on:

  • ASP.NET Core
  • ModelContextProtocol.AspNetCore
  • .NET 10

Main capabilities:

  • MCP endpoint at /mcp
  • health check endpoint at /api/healthz
  • weather tool integration through Open-Meteo
  • prompt discovery and retrieval through prompts/list and prompts/get

Core source layout:

  • src/McpServer/McpServer: MCP host app, tools, prompts, resources, startup
  • src/McpServer/WeatherService: external weather API integration
  • scripts: PowerShell automation for MCP protocol calls

Runtime and Transport Design

The server is configured with both HTTP transport and stdio transport. HTTP is the primary remote integration path, while stdio can be useful for local tool-chain and debugging workflows.

At startup, the app:

  1. reads FUNCTIONS_CUSTOMHANDLER_PORT (default 8081)
  2. binds Kestrel to 0.0.0.0:port
  3. registers tools, resources, and prompts in the MCP pipeline
  4. maps /mcp for protocol traffic
  5. maps /api/healthz for liveness checks

This split is operationally useful:

  • /mcp handles AI protocol traffic
  • /api/healthz supports probes and platform health monitors

Prompt Architecture Deep Dive

Prompt definitions live in a dedicated prompt container class decorated with an MCP prompt type attribute.

Prompt Container Responsibilities

The prompt class has three key responsibilities:

  • define discoverable prompt methods
  • document methods and arguments with metadata
  • optionally log invocation context for observability

The implementation includes dependency-injected logging, making it straightforward to trace usage patterns in real deployments.

Prompt 1: default_prompt

Purpose:

  • provide a baseline system instruction for concise, reliable responses

Behavior:

  • returns a compact instruction block emphasizing brevity and non-hallucination
  • includes explicit uncertainty handling by asking for clarification when needed

Why this is valuable:

  • offers a reusable default instruction that clients can apply consistently
  • avoids duplicating base instruction text across multiple clients

Prompt 2: weather_query_guide

Purpose:

  • teach users or orchestrators how to ask weather questions effectively

Behavior:

  • returns structured guidance with examples and best practices
  • accepts optional userContext and appends it when provided

Why argumentized prompts are powerful:

  • one prompt template can adapt to locale, user preferences, or interaction history
  • client-side logic remains simple while server controls guidance quality

Prompt 3: weather_data_interpretation

Purpose:

  • standardize how weather tool output should be interpreted and presented

Behavior:

  • explains expected weather fields and presentation strategy
  • includes recommendation patterns, for example suggesting umbrella advice when appropriate

Why this helps:

  • decouples raw tool output from user-facing narrative quality
  • improves consistency across multi-client environments

Prompt Discovery and Invocation Flow

MCP interaction is straightforward:

  1. Client calls prompts/list
  2. Server returns prompt metadata including names, descriptions, and argument schema
  3. Client calls prompts/get with a selected prompt name (and optional arguments)
  4. Server returns prompt content ready for orchestration

This means client authors can build dynamic UIs or agent planners that adapt to server capabilities at runtime, without shipping fixed prompt catalogs.

Local Run and Verification

From repo root:

dotnet restore .\src\McpServer\McpServer.slnx
dotnet run --project .\src\McpServer\McpServer\McpServer.csproj

Default MCP endpoint:

  • http://0.0.0.0:8081/mcp

Health check:

  • http://localhost:8081/api/healthz

Then, in a second terminal, verify protocol behavior with scripts.

Script-Driven MCP Validation

The scripts folder includes ready-to-use protocol calls for tools, prompts, and optional resources.

List prompts:

.\scripts\list-mcp-server-prompts.ps1

Get a specific prompt:

.\scripts\call-mcp-prompt.ps1 -PromptName "default_prompt"

Get weather query guide:

.\scripts\call-mcp-prompt.ps1 -PromptName "weather_query_guide"

Get weather interpretation guide:

.\scripts\call-mcp-prompt.ps1 -PromptName "weather_data_interpretation"

Validate tools too:

.\scripts\list-mcp-server-tools.ps1
.\scripts\call-mcp-tool.ps1 -toolName "ping" -toolParams @{ message = "hello from test" }
.\scripts\call-mcp-tool.ps1 -toolName "get_weather" -toolParams @{ city = "Paris" }

How to Add a New Prompt Correctly

Recommended process:

  1. Add a method to the prompt container class.
  2. Decorate it as an MCP prompt.
  3. Add clear descriptions on method and arguments.
  4. Return string or Task<string>.
  5. Keep output deterministic and easy for clients to reuse.
  6. Verify with prompts/list and prompts/get.
[McpServerPrompt]
[Description("Provides concise guidance for travel weather planning")]
public Task<string> TravelWeatherGuide(
    [Description("Destination city")] string city,
    [Description("Optional number of travel days")] int? days = null)
{
    var output = $"""
    # Travel Weather Guide

    Destination: {city}
    Duration: {(days.HasValue ? $"{days} day(s)" : "unspecified")}

    Ask for:
    - daily highs/lows
    - rain probability
    - wind conditions
    """;

    return Task.FromResult(output);
}

Defined in src/McpServer/McpServer/McpServerPrompts.cs and registered in src/McpServer/McpServer/Program.cs with:

.WithPrompts<McpServerPrompts>();

Read the whole story
alvinashcraft
42 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories