Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149927 stories
·
33 followers

MariaDB innovation: binlog_storage_engine, 48-core server, Insert Benchmark

1 Share

MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB). See this blog post for more details on the new feature.

My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a large (48-core) server. Storage on this server has a low fsync latency while the small server has high fsync latency.

tl;dr

  • binlog storage engine makes some things better without making other things worse
  • binlog storage engine doesn't make all write-heavy steps faster because the commit path isn't the bottleneck in all cases on a server with storage that has low fsync latency

tl;dr for a CPU-bound workload

  • the l.i0 step (load in PK order) is ~1.3X faster with binlog storage engine
  • the l.i2 step (write-only with smaller transactions) is ~1.5X faster with binlog storage engine
tl;dr for an IO-bound workload
  • the l.i0 step (load in PK order) is ~1.08X faster with binlog storage engine
Builds, configuration and hardware

I compiled MariaDB 12.3.1 from source.

The server has 48-cores and 128G of RAM. Storage is 2 NVMe device with ext-4, discard enabled and RAID. The OS is Ubuntu 22.04. AMD SMT is disabled. The SSD has low fsync latency.

I tried 4 my.cnf files:
  • z12b_sync
  • z12c_sync
    • my.cnf.cz12c_sync_c32r128 (z12c_sync) is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
  • z12b_sync_dw0
  • z12c_sync_dw0
The Benchmark

The benchmark is explained here. It was run with 20 clients for two workloads:
  • CPU-bound - the database is cached by InnoDB, but there is still much write IO
  • IO-bound - most, but not all, benchmark steps are IO-bound
The benchmark steps are:

  • l.i0
    • insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 10M for CPU-bound and 200M for IO-bound.
  • l.x
    • create 3 secondary indexes per table. There is one connection per client.
  • l.i1
    • use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
  • l.i2
    • like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
    • Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
  • qr100
    • use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 3600 seconds.
  • qp100
    • like qr100 except uses point queries on the PK index
  • qr500
    • like qr100 but the insert and delete rates are increased from 100/s to 500/s
  • qp500
    • like qp100 but the insert and delete rates are increased from 100/s to 500/s
  • qr1000
    • like qr100 but the insert and delete rates are increased from 100/s to 1000/s
  • qp1000
    • like qp100 but the insert and delete rates are increased from 100/s to 1000/s
Results: summary

The performance reports are here for CPU-bound and IO-bound.

The summary sections from the performance reports have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA. And from the third table for the IO-bound workload I see that there were failures to meet the SLA for qp500, qr500, qp1000 and qr1000.

I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from the base version.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures: 
  • insert/s for l.i0, l.i1, l.i2
  • indexed rows/s for l.x
  • range queries/s for qr100, qr500, qr1000
  • point queries/s for qp100, qp500, qp1000
Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

I often use context switch rates as a proxy for mutex contention.

Results: CPU-bound

The summary is here
  • Enabling the InnoDB doublewrite buffer doesn't improve performance.
With and without the InnoDB doublewrite buffer enabled, enabling the binlog storage engine improves throughput a lot for two of the write-heavy steps while there are only small changes on the other two write-heavy steps:
  • l.i0, load in PK order, gets ~1.3X more throughput
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 1/2
      • KB written to storage per insert (wkbpi) is a bit smaller
      • context switches per insert (cspq) are reduced by about 1/3
  • l.x, create secondary indexes, is unchanged
    • when the binlog storage engine is enabled (see here)
      • storage writes per insert (wpi) are reduced by about 4/5
      • KB written to storage per insert (wkbpi) are reduced almost in half
      • context switches per insert (cspq) are reduced by about 1/4
  • l.i1, write-only with larger tranactions, is unchanged
  • l.i2, write-only with smaller transactions, gets ~1.5X more throughput
The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync)

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120301_rel_withdbg.cz12b_sync_c32r1281.001.001.001.001.001.001.001.001.001.00
ma120301_rel_withdbg.cz12c_sync_c32r1281.321.020.991.521.011.021.011.021.011.01
ma120301_rel_withdbg.cz12b_sync_dw0_c32r1281.000.941.001.031.031.021.031.021.031.02
ma120301_rel_withdbg.cz12c_sync_dw0_c32r1281.311.041.001.551.011.021.021.021.021.02

Results: IO-bound

The summary is here.
  • For the read-write steps the insert SLA was not met for qr500, qp500, qr1000 and qp1000 as those steps needed more IOPs than the storage devices can provide.
  • Enabling the InnoDB doublewrite buffer improves throughput by ~1.25X on the l.i2 step (write-only with smaller transactions) but doesn't change performance on the other steps.
    • as expected there is a large reduction in KB written to storage (see wkbpi here)
  • Enabling the binlog storage engine improves throughput by 9% and 8% on the l.i0 step (load in PK order) but doesn't have a significant impact on other steps.
    • with the binlog storage engine there is a large reduction in storage writes per insert (wpi), a small reduction in KB written to storage per insert (wkbpi) and small increases in CPU per insert (cpupq) and contex switches per insert (cspq) -- see here
The second table from the summary section has been inlined below. That table shows relative throughput which is: (QPS for my config / QPS for z12b_sync)

dbmsl.i0l.xl.i1l.i2qr100qp100qr500qp500qr1000qp1000
ma120301_rel_withdbg.cz12b_sync_c32r1281.001.001.001.001.001.001.001.001.001.00
ma120301_rel_withdbg.cz12c_sync_c32r1281.091.010.991.010.990.990.970.971.000.98
ma120301_rel_withdbg.cz12b_sync_dw0_c32r1281.011.011.011.251.011.040.801.310.940.90
ma120301_rel_withdbg.cz12c_sync_dw0_c32r1281.081.011.001.260.991.040.681.310.930.90





Read the whole story
alvinashcraft
22 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

ClawRoute Technical Architecture: How Smart Model Routing Works

1 Share

ClawRoute Technical Architecture: How Smart Model Routing Works

Overview

ClawRoute is a distributed AI routing system that intelligently routes requests across multiple LLM providers using a unified 0-100 scoring system, Thompson Sampling for exploration/exploitation balance, circuit breakers for fault tolerance, predictive rate limiting, and multi-provider support. The system optimizes for cost, speed, and reliability while providing zero-configuration developer APIs.

Core Architecture

1. Request Router (router.py)

The main entry point that receives requests and routes them based on:

  • Unified 0-100 quality score (task-specific weights)
  • Cost optimization
  • Latency requirements
  • Availability and health status

Key Features:

  • Unified Scoring System: All models rated 0-100 with weights adjusted per task type
  • Thompson Sampling: Balances exploration and exploitation for model selection
  • Smart Fallback: Automatic switching when primary model underperforms
  • Global Distribution: Routes to geographically closest healthy endpoints

2. Provider Adapters

Modular adapters for each LLM provider:

OpenAI Adapter

  • GPT-3.5, GPT-4, GPT-4 Turbo support
  • API key rotation and rate limit handling

Anthropic Adapter

  • Claude 3 family support
  • API key management

Google Adapter

  • Gemini Pro/Ultra support

Custom Endpoints

  • Self-hosted OpenAI-compatible models
  • Local LLM deployments

3. Unified 0-100 Scoring System

Every model response receives a score from 0-100 based on five dimensions, with weights that adjust based on task type:

final_score = (0.25 * relevance) + (0.20 * coherence) + (0.20 * completeness) + 
              (0.15 * latency_score) + (0.10 * cost_efficiency) + (0.10 * task_specific)

Scoring Dimensions (0-100 each):

  • Relevance: Does response address the prompt? (semantic similarity)
  • Coherence: Is response logically structured and consistent?
  • Completeness: Does it fully answer the question?
  • Latency Score: Normalized response time (faster = higher score)
  • Cost Efficiency: Quality per dollar spent
  • Task Specific: Custom dimension based on use case

Task-Specific Weight Examples:

  • Coding Tasks: Quality weight increased to 0.35, latency reduced to 0.10
  • Creative Writing: Relevance weight 0.30, coherence 0.25
  • Data Analysis: Completeness weight 0.30, cost efficiency 0.15
  • Real-time Chat: Latency weight 0.25, relevance 0.20

4. Thompson Sampling for Model Selection

Instead of static routing, ClawRoute treats each model as a "bandit arm" and uses Thompson Sampling to balance exploration and exploitation:

For each request:
  1. Sample from each model's Beta(α, β) distribution
     where α = successes + 1, β = failures + 1
  2. Select model with highest sampled value
  3. Execute request
  4. Observe outcome (score 0-100)
  5. Update distribution:
        if score >= threshold: α += 1
        else: β += 1

This dynamically shifts traffic toward better-performing models while still testing alternatives.

5. Circuit Breaker Pattern

Prevents cascading failures with three states:

CLOSED → [failures ≥ threshold] → OPEN
  ▲                                 |
  |                                 |
  |                    [timeout]    |
  |                                 ▶
HALF-OPEN ← [probe success] —— CLOSED

Configuration:

  • Failure threshold: 5 consecutive low scores (< 60)
  • Timeout: 30 seconds before half-open
  • Half-open: Allow one test request

6. Predictive Rate Limiting

Learns provider limits from 429 responses:

class AdaptiveRateLimiter:
    def __init__(self, provider):
        self.provider = provider
        self.window = 60  # seconds
        self.requests = deque()
        self.limit = None  # Learned from 429s
        self.safety_margin = 0.8  # Stay under 80% of limit

    def allow_request(self):
        now = time.time()
        # Remove old requests
        while self.requests and self.requests[0] < now - self.window:
            self.requests.popleft()

        # Predictive check
        if self.limit and len(self.requests) >= self.limit * self.safety_margin:
            return False

        return len(self.requests) < (self.limit or 1000)

7. Multi-Provider Abstraction

Unified interface hides provider differences:

response = clawroute.generate(
    prompt="Explain RSA encryption",
    task_type="coding",  # Adjusts scoring weights
    max_tokens=500
)

Provider Capabilities Matrix:

Provider Models Avg Score (0-100) Cost/1K Tokens RPM Limit
OpenAI GPT-4 Turbo 88 $0.03 10,000
Anthropic Claude 3 Opus 92 $0.075 1,000
Google Gemini Ultra 85 $0.015 2,000
Self-hosted Llama 3 70B 82 $0.002 Unlimited

Technical Implementation

Request Flow

def route_request(request):
    # 1. Apply task-specific weights
    weights = get_task_weights(request.task_type)

    # 2. Thompson Sampling selects candidate models
    candidates = thompson_sample(request.context)

    # 3. Filter by circuit breaker state
    healthy = [m for m in candidates if circuit_breaker[m].state == "CLOSED"]

    # 4. Check predictive rate limits
    available = [m for m in healthy if rate_limiter[m].can_send()]

    # 5. Select highest expected score
    selected = max(available, key=lambda m: m.beta_distribution.mean())

    # 6. Execute and score
    response = providers[selected].call(request)
    score = score_response(response, weights)

    # 7. Update learning systems
    update_thompson(selected, score)
    update_rate_limiter(selected, response.headers)
    return response

Scoring Algorithm

def score_response(response, weights):
    scores = {
        'relevance': semantic_similarity(response, request.prompt) * 100,
        'coherence': coherence_model.score(response) * 100,
        'completeness': completeness_check(response, request) * 100,
        'latency': normalize_latency(response.latency) * 100,
        'cost_efficiency': (base_quality / response.cost) * 100,
        'task_specific': task_specific_scorer[request.task_type](response)
    }

    return sum(scores[k] * weights[k] for k in weights)

Deployment & Scaling

Horizontal Scaling

  • Stateless router instances behind load balancer
  • Shared Redis for scoring history and rate limit tracking
  • Consistent hashing for provider affinity

Database Schema

model_performance (
    model_id, 
    timestamp, 
    task_type, 
    score_0_100,
    latency_ms,
    cost_usd,
    success_bool
)

rate_limit_state (
    provider, 
    window_start, 
    request_count, 
    learned_limit
)

Monitoring

  • Real-time score distributions per model
  • Alert on scoring distribution shifts (model drift)
  • Track cost savings vs baseline routing
  • Latency and success rate dashboards

Performance Impact

A/B Test Results (vs Round Robin)

Metric Round Robin ClawRoute Improvement
Avg Score (0-100) 76.2 84.7 +11.2%
Cost per 1K req $12.40 $8.90 -28.2%
P95 Latency 3.2s 2.1s -34.4%
Success Rate 96.8% 99.3% +2.6%

Task-Specific Gains

  • Code Generation: 22% higher quality scores
  • Customer Support: 18% faster responses
  • Content Creation: 15% better coherence

Getting Started

Install via npm:

npm install @clawroute/sdk

Initialize with providers:

import { ClawRoute } from '@clawroute/sdk';

const ai = new ClawRoute({
  providers: {
    openai: { apiKey: process.env.OPENAI_API_KEY },
    anthropic: { apiKey: process.env.ANTHROPIC_API_KEY },
    google: { apiKey: process.env.GOOGLE_API_KEY }
  },
  scoring: {
    // Optional: customize task weights
    taskWeights: {
      coding: { relevance: 0.30, coherence: 0.15, completeness: 0.25, 
               latency: 0.10, cost: 0.10, taskSpecific: 0.10 }
    }
  }
});

// Route automatically based on task type
const result = await ai.generate({
  prompt: "Create a Python function to calculate fibonacci",
  taskType: "coding",
  maxTokens: 200
});

Future Enhancements

  • Online Learning: Real-time weight adjustment based on user feedback
  • Multi-Objective Optimization: Pareto frontier for cost vs quality
  • Prompt Caching: Semantic caching for repeated queries
  • Edge Deployment: Regional model providers for lower latency

ClawRoute is open source under MIT License. Visit github.com/clawhub/clawroute for documentation and examples.

ClawRoute: Intelligent AI routing that learns and adapts to deliver the best model for every request.

Read the whole story
alvinashcraft
22 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Introducing “vibe design” with Stitch

1 Share
Stitch is evolving into an AI-native platform that allows anyone to create, iterate, and collaborate on high-fidelity UI.
Read the whole story
alvinashcraft
51 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Has Agile lost its way? How AI-powered DevSecOps can help [Q&A]

1 Share
Agile. What started 25 years ago as a movement for responsiveness and customer value now often gets bogged down by backlogs, burndowns, and bloated frameworks. Teams find themselves saying, “we’re Agile, but…” -- a clear sign of compromise. So has Agile lost its way and what can be done about it? We spoke to Bryan Ross, field CTO at GitLab, who argues that the future of Agile lies not in replacing it, but in using AI platforms to finally deliver on its founding principles. BN: After 25 years of Agile, what are the main challenges facing traditional Agile planning today?… [Continue Reading]
Read the whole story
alvinashcraft
51 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

Observability for AI Systems: Strengthening visibility for proactive risk detection

1 Share

Adoption of Generative AI (GenAI) and agentic AI has accelerated from experimentation into real enterprise deployments. What began with copilots and chat interfaces has quickly evolved into powerful business systems that autonomously interact with sensitive data, call external APIs, connect to consequential tools, initiate workflows, and collaborate with other agents across enterprise environments. As these AI systems become core infrastructure, establishing clear, continuous visibility into how these systems behave in production can help teams detect risk, validate policy adherence, and maintain operational control.

Observability is one of the foundational security and governance requirements for AI systems operating in production. Yet many organizations don’t understand the critical importance of observability for AI systems or how to implement effective AI observability. That mismatch creates potential blind spots at precisely the moment when visibility matters most.

In February, Microsoft Corporate Vice President and Deputy Chief Information Security Officer, Yonatan Zunger, blogged about expanding Microsoft’s Secure Development Lifecycle (SDL) to address AI-specific security concerns. Today, we continue the discussion with a deep dive into observability as a necessity for the secure development of GenAI and agentic AI systems.

For additional context, read the Secure Agentic AI for Your Frontier Transformation blog that covers how to manage agent sprawl, strengthen identity controls, and improve governance across your tenant.

Observability for AI systems

In traditional software, client apps make structured API calls and backend services execute predefined logic. Because code paths follow deterministic flows, traditional observability tools can surface straightforward metrics like latency, errors, and throughput to track software performance in production.

GenAI and agentic AI systems complicate this model. AI systems are probabilistic by design and make complex decisions about what to do next as they run. This makes relying on predictable finite sets of success and failure modes much more difficult. We need to evolve the types of signals and telemetry collected so that we can accurately understand and govern what is happening in an AI system.

Consider this scenario: an email agent asks a research agent to look up something on the web. The research agent fetches a page containing hidden instructions and passes the poisoned content back to the email agent as trusted input. The email agent, now operating under attacker influence, forwards sensitive documents to unauthorized recipients, resulting in data exfiltration.

In this example, traditional health metrics stay green: no failures, no errors, no alerts. The system is working exactly as designed… except a boundary between untrusted external content and trusted agent context has been compromised.

This illustrates how AI systems require a unique approach to observability. Without insights into how context was assembled at each step—what was retrieved, how it impacted model behavior, and where it propagated across agents—there is no way to detect the compromise or reconstruct what occurred.

Traditional monitoring, built around uptime, latency, and error rates, can miss the root cause here and provide limited signal for attribution or reconstruction in AI-related scenarios. This is an example of one of the new categories of risk that the SDL must now account for, and it is why Microsoft has incorporated enhanced AI observability practices within our secure development practices.

Traditional observability versus AI observability

Observability of AI systems means the ability to monitor, understand, and troubleshoot what an AI system is doing, end-to-end, from development and evaluation to deployment and operation. Traditional services treat inputs as bounded and schema-defined. In AI systems, input is assembled context. This includes natural language instructions plus whatever the system pulls in and acts on, such as system and developer instructions, conversation history, outputs returned from tools, and retrieved content (web pages, emails, documents, tickets).

For AI observability, context is key: capture which input components were assembled for each run, including source provenance and trust classification, along with the resulting system outputs.

Traditional observability is often optimized for request-level correlation, where a single request maps cleanly to a single outcome, with correlation captured inside one trace. In AI systems, dangerous failures can unfold across many turns. Each step looks harmless until the conversation ramps into disallowed output, as we’ve seen in multi-turn jailbreaks like Crescendo.

For AI observability, best practices call for propagating a stable conversation identifier across turns, preserving trace context end-to-end, so outcomes can be understood within the full conversational narrative rather than in isolation. This is “agent lifecycle-level correlation,” where the span of correlation should be the same as the span of persistent memory or state within the system.

Defining AI system observability

Traditional observability is built on logs, metrics, and traces. This model works well for conventional software because it’s optimized around deterministic, quantifiable infrastructure and service behavior such as availability, latency, throughput, and discrete errors.

AI systems aren’t deterministic. They evaluate natural language inputs and return probabilistic results that can differ subtly (or significantly) from execution to execution. Logs, metrics, and traces still apply here, but what gets captured within them is different. Observability for AI systems updates traditional observability to capture AI-native signals.

Logs, metrics, and traces indicate what happened in the AI system at runtime.

  • Logs capture data about the interaction: request identity context, timestamp, user prompts and model responses, which agents or tools were invoked, which data sources were consulted, and so on. This is the core information that tells you what happened. User prompts and model responses are often the earliest signal of novel attacks before signatures exist, and are essential for identifying multi-turn escalation, verifying whether attacks changed system behavior, adjudicating safety detections, and reconstructing attack paths. User-prompt and model-response logs can reveal the exact moment an AI agent stops following user intent and starts obeying attacker-authored instructions from retrieved content.
  • Metrics measure traditional performance details like latency, response times, and errors as well as AI-specific information such as token usage, agent turns, and retrieval volume. This information can reveal issues such as unauthorized usage or behavior changes due to model updates.
  • Traces capture the end-to-end journey of a request as an ordered sequence of execution events, from the initial prompt through response generation. Without traces, debugging an agent failure means guessing which step went wrong.

AI observability also incorporates two new core components: evaluation and governance.

  • Evaluation measures response quality, assesses whether outputs are grounded in source material, and evaluates whether agents use tools correctly. Evaluation gives teams measurable signals to help understand agent reliability, instruction alignment, and operational risk over time.
  • Governance is the ability to measure, verify, and enforce acceptable system behavior using observable evidence. Governance uses telemetry and control plane mechanisms to ensure that the system supports policy enforcement, auditability, and accountability.

These key components of observability give teams improved oversight of AI systems, helping them ship with greater confidence, troubleshoot faster, and tune quality and cost over time.  

Operationalizing AI observability through the SDL

The SDL provides a formal mechanism by which technology leaders and product teams can operationalize observability. The following five steps can help teams implement observability in their AI development workflows.

  1. Incorporate AI observability into your secure development standards. Observability standards for GenAI and agentic AI systems should be codified requirements within your development lifecycle; not discretionary practices left to individual teams.
  2. Instrument from the start of development. Build AI-native telemetry into your system at design time, not after release. Aligning with industry conventions for logging and tracing, such as OpenTelemetry (OTel) and its GenAI semantic conventions, can improve consistency and interoperability across frameworks. For implementation in agentic systems, use platform-native capabilities such as Microsoft Foundry agent tracing (in preview) for runtime trace diagnostics in Foundry projects. For Microsoft Agent 365 integrations, use the OTel-based Microsoft Agent 365 Observability SDK (in Frontier preview) to emit telemetry into Agent 365 governance workflows.
  3. Capture the full context. Log user prompts and model responses, retrieval provenance, what tools were invoked, what arguments were passed, and what permissions were in effect. This detail can help security teams distinguish a model error from an exploited trust boundary and enables end-to-end forensic reconstruction. What to capture and retain should be governed by clear data contracts that balance forensic needs against privacy, data residency, retention requirements, and compliance with legal and regulatory obligations, with access controls and encryption aligned to enterprise policy and risk assessments.
  4. Establish behavioral baselines and alert on deviation. Capture normal patterns of agent activity—tool call frequencies, retrieval volumes, token consumption, evaluation score distributions—through Azure Monitor and Application Insights or similar services. Alert on meaningful departures from those baselines rather than relying solely on static error thresholds.
  5. Manage enterprise AI agents. Observability alone cannot answer every question. Technology leaders need to know how many AI agents are running, whether those agents are secure, and whether compliance and policy enforcement are consistent. Observability, when coupled with unified governance, can support improved operational control. Microsoft Foundry Control Plane, for example, consolidates inventory, observability, compliance with organization-defined AI guardrail policies, and security into one role-aware interface; Microsoft Agent 365 (in Frontier preview) provides tenant-level governance in the Microsoft 365 admin plane.

To learn more about how Microsoft can help you manage agent sprawl, strengthen identity controls, and improve governance across your tenant, read the Secure Agentic AI for Your Frontier Transformation blog.

Benefits for security teams

Making enterprise AI systems observable transforms opaque model behavior into actionable security signals, strengthening both proactive risk detection and reactive incident investigation.

When embedded in the SDL, observability becomes an engineering control. Teams define data contracts early, instrument during design and build, and verify before release that observability is sufficient for detection and incident response. Security testing can then validate that key scenarios such as indirect prompt injection or tool-mediated data exfiltration are surfaced by runtime protections and that logs and traces enable end-to-end forensic reconstruction of event paths, impact, and control decisions.  

Many organizations already deploy inference-time protections, such as Microsoft Foundry guardrails and controls. Observability complements these protections, enabling fast incident reconstruction, clear impact analysis, and measurable improvement over time. Security teams can then evaluate how systems behave in production and whether controls are working as intended.

Adapting traditional SDL and monitoring practices for non-deterministic systems doesn’t mean reinventing the wheel. In most cases, well-known instrumentation practices can be simply expanded to capture AI-specific signals, establish behavioral baselines, and test for detectability. Standards and platforms such as OpenTelemetry and Azure Monitor can support this shift.

AI observability should be a release requirement. If you cannot reconstruct an agent run or detect trust-boundary violations from logs and traces, the system may not be ready for production.

The post Observability for AI Systems: Strengthening visibility for proactive risk detection appeared first on Microsoft Security Blog.

Read the whole story
alvinashcraft
52 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

CTOs Face Pressure to Deliver AI Gains, but Productivity Isn’t There Yet

1 Share

How are CTOs feeling about AI?

According to Andy Skipper, founder of CTO Craft, they’re experiencing fear, uncertainty, and doubt.

And if the technical leaders of companies are feeling that way, what can the rest of us expect? Certainly, we dream of productivity boosts and an AI El Dorado – but that’s not the reality.

That’s why we sat down with Skipper to talk about how CTOs should manage expectations for AI, and how to navigate the hype versus reality.

Stakeholders and investors are watching CTOs closely, and the pressure is rising

Many CTOs, Skipper notes, are navigating intense pressure from non-technical stakeholders and investors alike, especially with the massive resources being invested in AI and LLM technologies.

He’s a bit careful about this:

AI is not going to reduce costs or increase productivity in the way some non-technical people think just yet. It’s getting there, but it’s not there yet.

At the same time, Skipper points out a surprising upside: AI is giving engineering leaders a chance to reconnect with the code and architecture without writing all the code themselves:

One of the things you have to accept as an engineering leader is that you are going to get further away from the code the more senior you become. AI gives people an opportunity to get back to architecture and development work, even if they aren’t coding themselves.

CTO role can be isolating

When Skipper became a CTO for the first time, he quickly realized just how isolating the role could be. There was nowhere for tech leaders to share challenges, get support, or navigate the non-technical side of the job.

That gap inspired him to start CTO Craft, now a community helping senior engineering leaders navigate team dynamics, strategy, and AI.

When I was a CTO for the first time, I didn’t have somebody who I could talk to about the issues I was seeing or compare notes with people who had similar challenges. That’s what CTO Craft is all about – helping people understand where the challenges come from and understand they’re not alone in having those challenges.

As a coach and mentor, Andy works closely with CTOs around the world, helping them deal with issues like burnout, communication with nontechnical stakeholders, and, lately, how to adapt in the AI era.

The most common CTO mistake? Always chasing the newest technologies

Many first-time CTOs struggle with burnout, overextending themselves to shield teams from stress, and balancing hands-on coding with high-level responsibilities. He explains:

A lot of the people that I work with directly are suffering from burnout. First time CTOs commonly miss out self-preservation. And usually that’s a combination of too much expectation of their own energy levels, their own abilities, backlogs…

And after overextending themselves, first-time CTOs often make another common mistake: chasing the newest technologies. While adopting the latest tools and frameworks can seem exciting, Skipper warns that it’s not always the best choice for fast-moving teams trying to scale.

Using bleeding-edge tech can slow you down, make systems harder to maintain, and even complicate hiring because the talent pool for newer technologies might be limited,” he explains.

As a coach, Skipper says these are just some of the recurring challenges he sees among engineering leaders, alongside a range of other operational and people-related issues.

Engineering skills alone won’t make you a CTO

For aspiring engineering leaders, Skipper highlights that growing into a successful CTO requires more than technical excellence: commercial understanding, communication, coaching, and vision-setting are just as crucial:

The difference between a good engineering manager and a great CTO is understanding how technology drives business success, while still inspiring and guiding your teams.

But technical and business skills are only part of the picture. Motivation and team management are equally critical. Skipper stresses that not everyone is motivated by the same things, and leaders need to understand individual drivers:

Having a vision in the first place is very important. But when it comes to actually bringing individuals along on the journey, they all need to be worked with differently. You can’t just set it and expect everyone to be motivated.

He also warns against a common mistake among CTOs: trying to shield their teams from the challenges of a pivot or rapid change. While the instinct is understandable, it often backfires and drains the leader’s emotional energy. Instead, transparency and realistic communication are key:

Being transparent, being realistic, measuring your words, not being super negative about everything, but still being realistic, I think all these things are really important.

The need for a support network, not another tech stack

Skipper believes resilience and peer support are crucial for engineering leaders navigating the complexity of the CTO role. Sharing experiences and learning from others can help leaders realize they’re not alone when facing difficult decisions.

Looking ahead, however, he admits that the pace of technological change makes it hard to predict what the role will look like in the future.

Five years from now, I honestly have no idea what the role of a CTO will look like. The way we build software is already changing rapidly, especially with AI. But the fundamentals like setting a vision, communicating it clearly, and connecting technology with business outcomes, will always remain essential.

For Skipper, that uncertainty makes peer support crucial: it helps leaders adapt, learn, and navigate a fast-changing profession.

Ultimately, he believes the most important skill for CTOs is the ability to keep learning and tackle challenges without going it alone.

*Infobip, the global communications API leader that launched ShiftMag, was an Event Partner at CTO Craft 2026.

The post CTOs Face Pressure to Deliver AI Gains, but Productivity Isn’t There Yet appeared first on ShiftMag.

Read the whole story
alvinashcraft
52 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories