Most MCP servers in the wild are single-instance processes. That's fine when they're driving a local Claude or VS Code session — but it's the wrong shape for a production agent fleet that has to absorb traffic spikes, ride through deploys, and survive instance failures.
The good news: the MCP spec already grew up. The 2025-06-18 revision formalizes stateless HTTP transport (and the current 2025-11-25 revision keeps it), which means a single request carries everything the server needs to answer. No long-lived connection, no in-process session table, no sticky-session hacks to keep a client glued to one box.
That tiny protocol change unlocks something big: you can stick an MCP server behind App Service's built-in load balancer and scale it like any other web API. This post walks through how, with a runnable sample.
Sample: seligj95/app-service-mcp-stateless-scale-python. One azd up and you have a stateless FastAPI MCP server running on three App Service instances behind the platform load balancer, with a staging slot, Application Insights, and a k6 script that visualizes load distribution from the client side.
Why "stateless" is the whole story
Earlier MCP transports leaned on persistent connections — SSE channels and WebSocket-style sessions where the server held per-client state in memory (open tools, subscriptions, partial streams). That model is great for a local IDE talking to a local process. It's hostile to load balancing, because routing a follow-up request to a different instance breaks the session.
The stateless HTTP transport flips that. Each request is a complete JSON-RPC envelope (initialize, tools/list, tools/call), every response is self-contained, and the server is allowed to forget the client between requests. Any instance can serve any call. That is the property a load balancer needs.
In the sample, every tool is a pure function of its arguments — whoami reports the serving instance, lookup_fact reads a static dictionary, compute_primes runs a sieve. None of them touches per-client memory. That's not a constraint of the protocol; it's a discipline you adopt to keep statelessness intact.
Why App Service, and not Functions or AKS
A few defaults made App Service the right home for a scaled MCP server:
- Always On. Reasoning tools call into LLMs and external APIs; latencies routinely sit in the multi-second range. Functions caps a single execution at ten minutes by default (and aggressively scales workers to zero between bursts, which kills warm caches). App Service keeps the process resident.
- Horizontal scale is one parameter. Pick a Premium SKU, set the plan's capacity to N, and you have N instances behind a managed load balancer. No VMSS to declare, no ingress controller to wire up, no Service to reconcile.
- Deployment slots. Swap a warmed-up staging slot into production for zero-downtime deploys. Critical when your "API" is an LLM tool surface that an agent is actively driving.
- Easy Auth. OAuth 2.1 in front of the MCP endpoint without writing the flow yourself — turn on the App Service authentication blade and point it at Entra ID. The sample leaves this off so the deploy is one command, but the wiring is a checkbox away.
The TL;DR: it's PaaS that already knows how to run a stateful long-lived process at horizontal scale, which is exactly the shape of a scaled MCP server.
The FastAPI MCP server, end-to-end stateless
The whole transport is one POST handler. The full source is in main.py, but here are the load-bearing pieces:
@app.post("/mcp")
async def mcp_endpoint(request: Request):
body = await request.json()
method = body.get("method", "")
msg_id = body.get("id")
if method == "initialize":
return {"jsonrpc": "2.0", "id": msg_id, "result": _server_info()}
if method == "tools/list":
return {"jsonrpc": "2.0", "id": msg_id, "result": {"tools": [...]}}
if method == "tools/call":
params = body.get("params", {})
result = await MCP_TOOLS[params["name"]]["function"](**params.get("arguments", {}))
return {
"jsonrpc": "2.0",
"id": msg_id,
"result": {"content": [{"type": "text", "text": json.dumps(result)}]},
}
There is no session table. There is no client_id cookie. There is no AsyncIterator held open between requests. initialize, tools/list, and tools/call all return in a single round trip, which is the shape App Service's load balancer expects.
The most useful debugging tool in the sample is whoami:
async def tool_whoami() -> Dict[str, Any]:
return {
"instance_id": os.environ.get("WEBSITE_INSTANCE_ID", "local"),
"hostname": socket.gethostname(),
...
}
WEBSITE_INSTANCE_ID is unique per App Service worker. Call whoami a few times from your MCP client and the value rotates — that's the load balancer working. If it doesn't rotate, something is pinning your traffic (almost always the ARR Affinity cookie; we'll get there).
The Bicep that actually makes it scale
The infra is a P0v3 plan with capacity: 3, a web app with affinity disabled, and a staging slot on the same plan:
resource appServicePlan 'Microsoft.Web/serverfarms@2024-04-01' = {
name: name
sku: {
name: 'P0v3'
capacity: instanceCount
}
properties: { reserved: true }
}
resource web 'Microsoft.Web/sites@2024-04-01' = {
name: name
properties: {
serverFarmId: appServicePlanId
httpsOnly: true
clientAffinityEnabled: false
siteConfig: {
linuxFxVersion: 'PYTHON|3.11'
alwaysOn: true
healthCheckPath: '/health'
appCommandLine: 'python -m uvicorn main:app --host 0.0.0.0 --port 8000'
}
}
}
resource staging 'Microsoft.Web/sites/slots@2024-04-01' = {
parent: web
name: 'staging'
properties: { }
}
The single most important line in that template is clientAffinityEnabled: false. App Service defaults to on, which sets the ARRAffinity cookie and pins every subsequent request from a given client to the instance that handled the first one. That default exists because legacy ASP.NET apps used in-process session state. Stateless MCP does not. Leaving affinity on silently undoes everything we just built.
Premium v3 (P0v3) is the floor for two reasons: it gives Always On and unlocks deployment slots. Below that tier you don't get either.
Application Insights without writing telemetry code
The sample drops one line of bootstrap into main.py:
from azure.monitor.opentelemetry import configure_azure_monitor
if os.environ.get("APPLICATIONINSIGHTS_CONNECTION_STRING"):
configure_azure_monitor(logger_name="mcp")
The Azure Monitor OpenTelemetry distro auto-instruments FastAPI and outbound HTTP. Every request span App Service emits is tagged with cloud_RoleInstance, which Application Insights populates from WEBSITE_INSTANCE_ID. That makes the question "is traffic actually spreading across my instances?" a one-liner in Logs:
requests
| where timestamp > ago(15m)
| where name contains "/mcp"
| summarize count() by cloud_RoleInstance
| order by count_ desc
If you see three roughly-equal rows, you're done. If you see one row, your client is sending ARRAffinity cookies — turn affinity off and redeploy.
Deploy
azd auth login
azd up
That provisions the resource group, plan, web app, staging slot, Log Analytics workspace, and Application Insights resource, then deploys the Python app via Oryx. The output prints both WEB_URI and WEB_STAGING_URI. Open the production URI — the home page renders the instance ID that served it. Refresh. The ID changes.
To swap the staging slot into production with no downtime:
az webapp deployment slot swap \
--resource-group <rg> --name <app> \
--slot staging --target-slot production
App Service warms the staging instances, redirects traffic, and the old production becomes the new staging — the classic blue-green pattern, but free.
Prove it scales
The sample ships a k6 script that hammers /mcp with tools/call requests and tags every response with the instance_id the server returned:
BASE_URL=https://<your-app>.azurewebsites.net \
k6 run --summary-export=summary.json loadtest/k6-mcp.js
jq '.metrics.mcp_instance_hits.values' summary.json
The output groups hits per instance tag. On a three-instance plan with a 60-second steady load you should see something close to:
{
"count": 1842,
"instance0d3e2f...": 614,
"instance7a91bc...": 612,
"instance19f0c4...": 616
}
Roughly 33% on each box — the App Service load balancer round-robining new connections, with no help from the application.
What I'd do next
The sample is intentionally a starting point. Two extensions are the obvious next moves:
- Add Easy Auth. Turn on App Service authentication, pick Entra ID, require auth on
/mcp. The token surfaces as headers; your tool handlers can use it to identify the calling agent without you owning any of the OAuth machinery.
- Autoscale on CPU.
instanceCount: 3 is a starting point. Wire up Microsoft.Insights/autoscalesettings against the plan and let it scale 3 → 10 on the prime-counting tool. The architecture already supports it — that's the whole point of stateless.
Try it
If you ship something with it, I'd love to hear how it held up.