Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
149053 stories
·
33 followers

Scaling Global Storytelling: Modernizing Localization Analytics at Netflix

1 Share

Valentin Geffrier, Tanguy Cornuau

Each year, we bring the Analytics Engineering community together for an Analytics Summit — a multi-day internal conference to share analytical deliverables across Netflix, discuss analytic practice, and build relationships within the community. This post is one of several topics presented at the Summit highlighting the breadth and impact of Analytics work across different areas of the business.

At Netflix, our goal is to entertain the world, which means we must speak the world’s languages. Given the company’s growth to serving 300 million+ members in more than 190+ countries and 50+ languages, the Localization team has had to scale rapidly in creating more dubs and subtitle assets than ever before. However, this growth created technical debt within our systems: a fragmented landscape of analytics workflows, duplicated pipelines, and siloed dashboards that we are now actively modernizing.

The Challenge: “Who Made This Dub?”

Historically, business logic for localization metrics was replicated across isolated domains. A question as simple as “Who made this dub/subtitle?” is actually complex — it requires mapping multiple data sources through intricate and constantly changing logic, which varies depending on the specific language asset type and creation workflow.

When this logic is copied into isolated pipelines for different use cases it creates two major risks: inconsistency in reporting and a massive maintenance burden whenever upstream logic changes. We realized we needed to move away from these vertical silos.

Our Modernization Strategy

To address this, we defined a vision centered on consolidation, standardization, and trust, executed through three strategic pillars:

1. The Audit and Consolidation Playbook

We initiated a comprehensive audit of over 40 dashboards and tools to assess usage and code quality. Our focus has shifted from patching frontend visualizations to consolidating backend pipelines. For example, we are currently merging three legacy dashboards related to dubbing partner KPIs (around operational performance, capacity, and finances), focusing first on a unified data and backend layer that can support a variety of future frontend iterations.

2. Reducing “Not-So-Tech” Debt

Technical debt isn’t just about code; it is also about the user experience. We define “Not-So-Tech Debt” as the friction stakeholders feel when tools are hard to interpret or can benefit from better storytelling. To fix this, we revamped our Language Asset Consumption tool — instead of reporting dub and subtitle metrics independently, we combine audio and text languages into one consumption language that helps differentiate Original Language versus Localized Consumption and measure member preferences between subtitles, dubs, or a combination of both for a given language. This unlocks more intuitive insights based on actual recurring stakeholder use cases.

3. Investing in Core Building Blocks

We are shifting to a write once, read many architecture. By centralizing business logic into unified tables — such as a “Language Asset Producer” table — we solve the “Who made this dub?” problem once. This centralized source now feeds into multiple downstream domains, including our Dub Quality and Translation Quality metrics, ensuring that any logic update propagates instantly across the ecosystem.

The Future: Event-Level Analytics

Looking ahead, we are moving beyond asset-level metrics to event-level analytics. We are building a generic data model to capture granular timed-text events, such as individual subtitle lines. This data helps us understand how subtitle characteristics (e.g. reading speed) affect member engagement and, in turn, refine the style guidelines we provide to our subtitle linguists to improve the member experience with localized content.

Ultimately, this modernization effort is about scaling our ability to measure and enhance the joy and entertainment we deliver to our diverse global audience, ensuring that every member, regardless of their language, has the best possible Netflix experience.


Scaling Global Storytelling: Modernizing Localization Analytics at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Read the whole story
alvinashcraft
24 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Go 1.26.1-1 and 1.25.8-1 Microsoft builds now available

1 Share

A new release of the Microsoft build of Go including security fixes is now available for download. For more information about this release and the changes included, see the table below:

Microsoft Release Upstream Tag
v1.26.1-1 go1.26.1 release notes
v1.25.8-1 go1.25.8 release notes

The post Go 1.26.1-1 and 1.25.8-1 Microsoft builds now available appeared first on Microsoft for Go Developers.

Read the whole story
alvinashcraft
30 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Vertical Slices doesn’t mean “Share Nothing”

1 Share

How do you share code between vertical slices? Vertical slices are supposed to be share nothing, right? Wrong. It is not about share nothing. It is about sharing the right things and avoiding sharing the wrong things. That is really the point.

YouTube

Check out my YouTube channel, where I post all kinds of content on Software Architecture & Design, including this video showing everything in this post.

Boundaries

If you have watched my videos before, you probably know I talk a lot about boundaries. A vertical slice is not that different from a logical boundary. What matters here is that a vertical slice defines a boundary around a use case.

That is the lens I want you to look through, because once you do that, the question of sharing becomes a lot clearer.

A Shipment Workflow Is a Good Example

A good example is a shipment. It is a workflow.

You have different actions that happen along the way that make up a life cycle from beginning to end. Think about ordering something online. It gets dispatched, which means the order is ready to be picked up at the warehouse. The carrier arrives at the shipper. Then the freight is loaded onto the truck. Then it departs. Then it arrives at the destination. Then it is delivered.

That is the workflow.

Now the natural question is this: is the vertical slice the whole workflow, or is each step its own slice?

Because if each step is a slice, how do you share between them?

A Vertical Slice Can Be One Step In the Workflow

Each step in that workflow can be a vertical slice.

You could model the entire workflow as one slice. Sometimes that might be fine. But often, each step can be its own slice because the workflow can change. It can deviate based on the actions that occur.

Take the same shipment example. The order gets dispatched, the vehicle is on the way to the warehouse, it arrives there, and then finds out the order was cancelled. There is nothing to pick up anymore.

That is a different use case.

In shipping, that might be called a dry run.

How do you implement that? It is just another vertical slice. It is part of the workflow, but it is also a deviation from that workflow.

That gets us back to the original question. What can you share between those vertical slices that are part of the same workflow?

There Are Two Different Kinds of Sharing

The first kind of sharing is technical infrastructure and plumbing.

Things like error and result types, logging, tracing, authorization helpers, messaging support, outbox primitives, event bus abstractions, and small utility code. That kind of stuff is normal to share. Some slices will use it. Some will not.

A slice gets to decide what dependencies it takes on and what tactical approach it uses. That is part of the slice owning its implementation.

But that leads to the more important question: what does the slice actually own?

A Vertical Slice Owns Its Data and Behavior

A vertical slice owns its data. That use case owns the data it needs and how that data is persisted.

It also owns the dependencies it takes on. It chooses the tactical patterns it wants to use for that specific use case.

That is important because people hear that and then assume everything must be completely isolated. But that is not really true, especially when several slices are part of the same workflow.

Shared State Is Not the Problem

In the shipment example, what you really have is a state machine. You have state transitions across the life cycle, from dispatched all the way to delivered.

So yes, there is shared state.

That does not mean there is shared ownership of everything.

View the code on Gist.

Imagine a shipment with state like status, dispatched at, arrived at, pallets loaded, and emptied at. If that was mapped to a table, each piece of that state is owned by the slice responsible for that action. The dispatch slice changes the dispatch related state. The arrive slice changes the arrival related state. The loaded slice changes the loaded related state.

Each slice owns the behavior around its part of that workflow.

This Still Applies With Event Sourcing

You can think about the exact same idea with event sourcing.

View the code on Gist.

Instead of changing columns in a table, you are appending events to a stream. Dispatched. Arrived. Loaded. Emptied.

Same concept.

Each use case owns the behavior that produces those events. It owns the data tied to that behavior. It owns where that data lives, whether that is in a table or in an event stream.

That can still all live together. You are still sharing an aggregate. You are still sharing a concern because there are invariants around that workflow.

That is not bad sharing.

Sharing an Aggregate Is Not Bad Sharing

An aggregate is a consistency boundary. You need that consistency boundary around the state.

A slice is a use case.

So if you have several use cases related to the same underlying model, that can be shared. If two slices share validation because both operate on the same domain model, that can be shared too.

At the same time, you can have other slices that are not part of that workflow at all and use a completely different model. That is fine too.

The point is not that every slice must have its own isolated universe. The point is understanding what actually belongs together.

Different Slices Can Use the Same Domain Model

Another way to visualize this is by looking at what each slice does from top to bottom.

One slice might be invoked by an HTTP API. It has application code and a data model under it. Another slice might be invoked by a message or event. It has different infrastructure, different application logic, but still uses the same underlying domain model as the HTTP slice.

The entry point is different. The infrastructure is different. The application code is different.

That does not mean the domain model cannot be shared.

And then you might have another use case that is not related at all, even if it lives in the same broader boundary. That one may have a completely separate model.

Again, the point is that slices define their own dependencies and their own behavior. But that does not mean they cannot share anything.

Where Sharing Becomes a Problem

The problem starts when you begin sharing things that have no business being shared.

In the shipment example, I am talking specifically about the workflow and the shipment life cycle from beginning to end. Nothing in that example has anything to do with compliance, customer support, customer tracking, or what was actually ordered.

Those are separate concerns.

The trap people fall into is that they start sharing and coupling things they should not. The model becomes unfocused. That is how you end up with one giant Shipment object for your whole system.

That is where you get into trouble.

Do not share one god object.

A Vertical Slice Is a Logical Boundary, Not a Physical One

This part is really important.

People often talk about vertical slices in the context of code organization, and that is useful. But a vertical slice is not a physical boundary. It is a logical boundary.

That means it does not have to live in one C#, Java, or TypeScript file. It does not have to live in one project.

If you have a mobile app deployed separately to iOS or Android, and it is dealing with specific actions as part of a use case, that can still be part of the slice. If the same use case is invoked through an HTTP API, that is also part of the slice.

The slice is the logical boundary around the use case. It is not just a folder structure.

Good Sharing Versus Bad Sharing

Good sharing is when vertical slices are operating on the same underlying thing as part of a workflow, a life cycle, or a set of common invariants.

Bad sharing is when a change unrelated to one vertical slice affects another slice unexpectedly.

That is when you are sharing things that have unrelated reasons to change.

That is the distinction.

Do Not Share Domain Meaning

Put another way, do not share domain meaning.

In the shipment workflow, dispatched, arrived, and loaded are use case specific. Dispatch is its own thing. No other feature should be doing something related to dispatch unless it actually owns that behavior.

If dispatch publishes events or changes dispatch related state, that should happen there. If there is state related to dispatching, that slice should own it.

You are not sharing that ownership.

Could you still share an underlying domain model that handles the workflow transitions? Absolutely.

But ownership of behavior still matters.

Focus on Actions, Not Just Data

Hopefully one thing stands out in this example. When I describe vertical slices and use cases, I am describing actions. I am not starting with data.

That matters.

And that is the real issue underneath all of this.

This Is Really About Coupling

When we talk about sharing, what are we really talking about?

Coupling.

That is what this usually comes down to.

If you understand what you are coupling to between vertical slices and use cases, you can manage it. If several slices depend on the same underlying domain model because they are part of the same workflow, that can be perfectly fine.

If every vertical slice can touch any piece of data and change state anywhere, then yes, you are going to have a problem.

At the end of the day, this is about managing coupling.

Share the Right Things

Vertical slices are not about sharing nothing.

They are about sharing the right things.

Technical concerns and plumbing? Sure.

Shared invariants as part of the same workflow? Sure.

A shared aggregate when several use cases are part of the same consistency boundary? Sure.

What you want to avoid is coupling things together that do not belong together.

That is the difference between good sharing and bad sharing.

Join CodeOpinon!
Developer-level members of my Patreon or YouTube channel get access to a private Discord server to chat with other developers about Software Architecture and Design and access to source code for any working demo application I post on my blog or YouTube. Check out my Patreon or YouTube Membership for more info.

The post Vertical Slices doesn’t mean “Share Nothing” appeared first on CodeOpinion.

Read the whole story
alvinashcraft
37 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

How Does Kubernetes Self-Healing Work? Understand Self-Healing By Breaking a Real Cluster

1 Share

I have noticed that many engineers who run Kubernetes in production have never actually watched it heal itself. They know it does. They have read the docs. But they have never seen a ReplicaSet controller fire, an OOMKill from kubectl describe, or watched pod endpoints go empty during a cascading failure. That's where 3 am incidents find you. This tutorial puts you on the other side of it.

You will clone one repo, spin up a real 3-node cluster, break it seven different ways, and watch it fix itself each time. No simulated output or fake clusters. Real Kubernetes, real failures, and real recovery. By the end, you will recognize these failure patterns when they show up in your production environment.

Table of Contents

What is KubeLab?

KubeLab is an open-source Kubernetes failure simulation lab. It runs a real Node.js backend, a PostgreSQL database, Prometheus and Grafana, all inside a real cluster. When you click "Kill Pod", the backend calls the Kubernetes API and deletes an actual running pod. Nothing is fake.

Simulation What it teaches
Kill Random Pod ReplicaSet self-healing, pod immutability
Drain Worker Node Zero-downtime maintenance, PodDisruptionBudgets
CPU Stress Throttling vs crashing, invisible latency
Memory Stress OOMKill, exit code 137, silent restart loops
Database Failure StatefulSets, PVC persistence
Cascading Pod Failure Why replicas: 2 isn't enough
Readiness Probe Failure Liveness vs readiness, traffic control

Plan about 90 minutes for the full path. Or jump directly to any simulation if you have a specific production problem you want to reproduce.

KubeLab cluster map — pods grouped by node, color-coded by status. During simulations, chips change color and move between nodes in real time.

Prerequisites

You need basic familiarity with Docker and comfort with the command line, but no prior Kubernetes experience is required.

Hardware: 8GB RAM minimum, 16GB recommended. The lab can run on Mac, Linux, or Windows with WSL2. You'll need to install three tools. Multipass spins up Ubuntu VMs for the cluster. kubectl is the Kubernetes CLI you will use for every simulation. Git clones the repo. If you cannot run three VMs, the repo includes a Docker Compose preview at setup/docker-compose-preview.md full UI with mock data, no real cluster needed.

How to Get the Lab Running

Full cluster setup lives at setup/k8s-cluster-setup.md in the repo. It walks through creating three VMs with Multipass, installing MicroK8s, joining the worker nodes, and deploying KubeLab. Follow it until all eleven pods show Running:

kubectl get pods -n kubelab
# All 11 pods should show STATUS: Running

Then open two port-forwards in separate terminal tabs and keep them running for the entire tutorial:

# Tab 1 — KubeLab UI at http://localhost:8080
kubectl port-forward -n kubelab svc/frontend 8080:80

# Tab 2 — Grafana at http://localhost:3000
kubectl port-forward -n kubelab svc/grafana 3000:3000

Grafana login: admin / kubelab-grafana-2026.

Position the KubeLab UI and Grafana side by side. Left half of the screen is the app. Right half is Grafana. You will watch both simultaneously from Simulation 3 onward.

Simulation 1: Kill Random Pod

This simulation deletes a running backend pod via the Kubernetes API. Without Kubernetes, you would SSH to the server, find the crashed process, and restart it manually, usually discovered by a user alert at 3am.

Before you click: Run kubectl get pods -n kubelab -w. Watch for a pod to go Terminating then a new one to appear.

Terminals running side by side before clicking Run, events streaming, pod watch, frontend and grafana port forwarding.
kubectl get pods -n kubelab -w
# backend-abc123  1/1   Terminating   0   2m
# backend-xyz789  1/1   Running       0   0s   ← ReplicaSet created a replacement

What happened: The ReplicaSet controller noticed actual(1) did not match desired(2) and created a replacement in parallel with the shutdown. The Endpoints controller removed the dying pod from the Service before SIGTERM fired, so zero traffic hit a dying pod.

The production trap: A missing readiness probe means the new pod receives traffic before it has opened a DB connection. You get 500s on every deployment for 2–3 seconds.

The fix: Set replicas: 2, add a readiness probe, and set terminationGracePeriodSeconds to match your longest request timeout.

Simulation 2: Drain a Worker Node

This simulation cordons a worker node, then evicts all its pods to the remaining node.

To "cordon" a worker node means to mark it as unschedulable. When you run kubectl cordon <node-name>, the Kubernetes control plane adds the node.kubernetes.io/unschedulable:NoSchedule taint to the node. (A taint is a marker that tells the scheduler to avoid placing pods on that node unless they have a matching "toleration.") This tells the scheduler to stop placing any new pods onto that node. It does not affect the pods that are already running there.

Cordoning is the first, safe step in preparing a node for maintenance. It ensures that while you are draining the node, the scheduler isn't simultaneously trying to schedule new workloads onto it, which would defeat the purpose of the drain.

Without Kubernetes you would drain the server manually, guess when in-flight requests finish, patch it, and bring it back, the window of downtime is unpredictable.

Before you click: Run kubectl get pods -n kubelab -o wide -w. Watch which node each pod runs on.

kubectl get pods -n kubelab -o wide -w
NAME                     NODE               STATUS
backend-abc123-xk2qp    kubelab-worker-1   Terminating   ← evicted
backend-abc123-n7mw3    kubelab-worker-2   Running       ← rescheduled

In kubectl get nodes the node shows Ready,SchedulingDisabled until you run kubectl uncordon.

What happened: The node spec got spec.unschedulable=true. The Eviction API ran per pod. That path goes through PodDisruptionBudget policy checks before proceeding, unlike a raw delete. A raw kubectl delete pod bypasses this check entirely — which is why draining with kubectl drain is always safer than deleting pods manually during maintenance.

The production trap: Two replicas with no pod anti-affinity often land on the same node. Drain that node and both pods evict at once. Complete downtime despite replicas: 2.

The fix: Use pod anti-affinity with topology key: kubernetes.io/hostname and a PodDisruptionBudget with minAvailable: 1.

Node drain CLI output: cordoned node shows Ready,SchedulingDisabled; pods reschedule to the other node.

Simulation 3: CPU Stress and Throttling

This simulation burns CPU inside a backend pod for 60 seconds, hitting the 200m limit. Without Kubernetes, one runaway process can consume all CPU on the host and starve every other service.

Before you click: Run watch -n 2 kubectl top pods -n kubelab and open the Grafana CPU Usage panel.

kubectl top pods -n kubelab
# backend-abc123   200m   ← pegged at limit for 60s; the other pod stays ~15m

What happened: The Linux CFS scheduler enforces the cgroup limit by granting 20ms of CPU per 100ms period then freezing all processes in the cgroup for 80ms. The pod is not slow because it is broken. It is slow because it is frozen 80% of the time.

The production trap: kubectl top shows the pod using 95-150m, which looks normal. The metric shows usage at the ceiling, not the throttle rate. Teams spend hours checking application code for a latency bug that is actually a CPU limit set too low.

The fix: For latency-sensitive workloads, set CPU requests but remove CPU limits. Requests tell the scheduler where to place the pod without throttling at runtime. Confirm throttling with rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m]).

One backend pod flatlined at exactly 95-150m for 60 seconds. A healthy pod's CPU fluctuates, this flat ceiling is the throttle.

Simulation 4: Memory Stress and OOMKill

This simulation allocates memory in 50MB chunks inside a backend pod until the kernel kills it. Without Kubernetes the process dies, the server goes down, and someone gets paged.

Before you click: Run kubectl get pods -n kubelab -l app=backend -w and open the Grafana Memory Usage panel.

kubectl get pods -n kubelab -l app=backend -w
# backend-abc123   0/1   OOMKilled   3   5m   ← no Terminating phase; SIGKILL bypasses graceful shutdown

What happened: The cgroup memory limit crossed 256Mi. The Linux kernel OOM killer scored processes in the container's cgroup and sent SIGKILL (exit code 137) to the top consumer. Not Kubernetes, the kernel. SIGKILL cannot be caught or handled, so no preStop hook runs and in-memory data or open transactions can be lost. Kubernetes only observed the exit, labeled it OOMKilled, and started a fresh container.

The production trap: The pod runs fine for 8 hours, OOMKills, and restarts. Memory resets to zero and everything looks healthy again. This repeats every 8 hours. The restart count climbs to 7, then 15, then 30, but no alert fires because the metrics look normal between crashes. You find out when a user emails saying the app has been "a bit glitchy lately."

The fix: Alert on rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) > 3 before users notice.
The Prometheus expression means: look at how many times containers in the kubelab namespace have restarted over the last hour, calculate how fast that number is increasing per second, and fire an alert if that rate exceeds the equivalent of 3 restarts per hour. A healthy pod rarely restarts. Several restarts in an hour usually means the container is hitting its memory limit, dying, and coming back in a loop, so this alert catches the silent OOMKill pattern before users do.

Confirm it happened:

kubectl describe pod -n kubelab <pod-name> | grep -A 5 "Last State:"
# Reason: OOMKilled
# Exit Code: 137

To see the last output before the kernel killed the process, run kubectl logs -n kubelab <pod-name> --previous. The log stream stops abruptly with no shutdown message, SIGKILL leaves no time for cleanup or final logs.

One backend pod's memory climbs, then the line drops at the OOMKill and reappears as the container restarts. The other pod's line stays flat the whole time

Simulation 5: Database Failure

This simulation scales the PostgreSQL StatefulSet to 0 replicas. The pod terminates completely. Without Kubernetes, the database server crashes and data recovery depends on whether backups exist and when they ran.

Before you click: Run kubectl get pods,pvc -n kubelab. Note that the PVC exists before you start.

kubectl get pods,pvc -n kubelab
# postgres-0   (gone)
# postgres-data-postgres-0   Bound   ← PVC stays; data lives on the volume

A PVC, or PersistentVolumeClaim, is a request for storage by a user. Think of it as a pod's way of saying, "I need a certain amount of durable, persistent storage." In the context of a stateful application like PostgreSQL, the PVC is critical. When the database pod is deleted, the PVC (and the underlying PersistentVolume it is bound to) remains. This is where the actual database files are stored. When a new postgres-0 pod is created, the StatefulSet knows to re-attach the same PVC, ensuring the new pod has access to all the old data, preventing data loss.

What happened: The StatefulSet controller deleted the pod but left the PersistentVolumeClaim untouched. StatefulSets guarantee stable names and stable PVC binding. postgres-0 always mounts postgres-data-postgres-0. When you restore, the same pod name comes back and reattaches the same volume. PostgreSQL replays WAL to reach a consistent state.

The production trap: Apps without connection retry logic return 500s and stay broken even after PostgreSQL restores. Connection pools that do not validate on acquire hold dead connections forever.

The fix: Add connection retry with exponential backoff in your app. Use network-attached storage (EBS, GCE PD) in production so the pod can reschedule to any node.

Simulation 6: Cascading Pod Failure

This simulation deletes both backend replicas at the same time. If everything is down, without Kubernetes, you'd have to restart every service manually, and hope they come up in the right order.

Before you click: Run kubectl get endpoints -n kubelab backend-service -w. Watch the IP list.

kubectl get endpoints -n kubelab backend-service -w
# ENDPOINTS   <none>   ← every request in this window gets Connection refused

What happened: Both pods were deleted. The Service had zero endpoints. The ReplicaSet created two replacements in parallel, but traffic stayed broken until both passed their readiness probes. The endpoint list went empty and came back. You can see the exact downtime window in Grafana's HTTP Request Rate panel.

The 5xx spike during Cascading Failure, 5 to 15 seconds of real downtime with the exact window timestamped

The production trap: replicas: 2 protects you from one pod dying at a time, nothing more.
If both replicas land on the same node and that node goes down, you have zero replicas and full downtime.
Check right now with kubectl get pods -n kubelab -o wide | grep backend, and if both pods show the same NODE, you are one node failure away from an outage.

The fix: Use pod anti-affinity to force replicas onto different nodes and a PodDisruptionBudget with minAvailable: 1 to block any voluntary action that would leave zero replicas.

Simulation 7: Readiness Probe Failure

This simulation makes one backend pod fail its readiness probe for 120 seconds without restarting it. Without Kubernetes, you'd have no way to take a pod out of traffic rotation without killing it. This is what happens in production when your app connects to a database on startup but the DB is slow. The pod is alive, but it's not ready. Kubernetes holds it out of rotation until it is.

Before you click: Run kubectl get pods -n kubelab -w in one tab and kubectl get endpoints -n kubelab backend-service -w in another.

# Pods tab: STATUS Running, RESTARTS 0 — almost nothing changes
# Endpoints tab: one IP disappears — the pod is alive but not receiving traffic

What happened: /ready returned 503. The kubelet marked the pod Ready=False. The Endpoints controller removed its IP from the Service. The liveness probe /health) still returned 200, so no restart. After 120 seconds /ready recovered and the pod rejoined. Run kubectl logs -n kubelab <failing-pod> -f to see the app log 503s for the readiness endpoint while the pod stays Running and receives no traffic.

The production trap: Readiness probes that check external dependencies (database, cache, downstream API) will remove all pods from rotation when that dependency goes down. Instead of degrading gracefully, your entire app goes offline.

The fix: Readiness probes should test only what the pod itself controls. Use a separate deep health endpoint for dependency checks and never tie readiness to external service availability.

4. How to Read the Signals in Grafana

A screenshot showing my grafana dashboards

kubectl shows current state. Grafana shows what happened over time. That history is essential when you are debugging something that started 4 hours ago.

The Four Panels that Matter

Pod Restarts: A flat line is good. A step up every few hours is a silent OOMKill loop — the most common invisible production failure.

CPU Usage: A healthy pod's CPU fluctuates. A throttled pod's CPU is unnaturally flat at its limit. That flat ceiling is the signal, not the number.

Memory Usage: Watch for a line that climbs steadily then disappears. That disappearance is an OOMKill. The line reappearing from zero is the restart.

HTTP Request Rate: During Cascading Failure you see a spike of 5xx for 5–15 seconds, the exact downtime window, timestamped.

5. How to Read the Terminal Signals

What you see in the terminal during and after each simulation tells you things Grafana cannot. Five commands matter.

The -w flag on kubectl get pods -n kubelab -w streams changes in real time. The columns that matter are READY, STATUS, and RESTARTS. READY shows containers ready vs total — 1/2 means one container is alive but not passing its readiness probe. STATUS shows the pod lifecycle phase: Running, Pending, Terminating, OOMKilled. RESTARTS is the most important column in production. A number climbing silently over days is a memory leak or a crash loop nobody has noticed yet.

kubectl get events -n kubelab --sort-by=.lastTimestamp is the control plane's diary. Every action the cluster took is here: Killing, SuccessfulCreate, Scheduled, Pulled, Started, OOMKilling, BackOff. When something breaks and you do not know why, read the events. The timestamp gap between a Killing event and the next Started event is your actual downtime window — not an estimate, the exact number.

kubectl describe pod -n kubelab <pod-name> is the deepest single-pod view. Three sections matter: Conditions (Ready: True/False tells you if the pod is in the Service endpoints), Last State (shows the previous container's exit reason — OOMKilled, exit code 137, or a crash), and Events at the bottom (the scheduler's reasoning for every placement decision). This is the first command to run when a pod is misbehaving.

kubectl get endpoints -n kubelab backend-service shows which pod IPs are actually receiving traffic right now. A pod can show Running in kubectl get pods and be completely absent from this list. That is a readiness probe failure. If this list is empty, no request to that Service will succeed regardless of how many pods show Running. Check this whenever users report errors but pods look healthy.

kubectl logs -n kubelab <pod-name> shows the container's stdout and stderr. Use -f to follow the stream. After a pod restarts, use --previous to see the logs from the container that just exited, essential when you need to know what the app was doing right before an OOMKill or crash. Logs are per container and are gone once the pod is replaced, so grab them before the ReplicaSet creates a new pod with a new name.

A full event sequence during Kill Pod recovery looks like this:

kubectl get events -n kubelab --sort-by=.lastTimestamp | tail -10
REASON            MESSAGE
Killing           Stopping container backend          ← SIGTERM sent
SuccessfulCreate  Created pod backend-xyz789          ← ReplicaSet fired
Scheduled         Successfully assigned to worker-2   ← Scheduler placed it
Pulled            Container image already present     ← no pull delay
Started           Started container backend           ← running

The line between Killing and Started is your actual recovery time. In a healthy cluster with a cached image it is 3–8 seconds. If it takes longer, check the Scheduled line, the scheduler may have struggled to find a node.

Two Prometheus Queries Worth Memorizing

First query: silent restart loop. rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h]) counts how many times containers in that namespace have restarted over the last hour and expresses it as a rate (restarts per second). A healthy workload rarely restarts. If this rate is high (for example more than 3 restarts per hour), something is killing the container repeatedly, often an OOMKill or a crash. Alert when it exceeds a threshold so you see the pattern before users report errors.

Second query: invisible CPU throttling. rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m]) measures how much time, per second, the Linux scheduler spent throttling containers in that namespace over the last 5 minutes. A result of 0.25 means the container was frozen 25% of the time. High latency with no restarts and "normal" CPU usage in kubectl top often means the CPU limit is too low and the kernel is throttling the process. Alert when this rate exceeds about 0.25 (25% throttled).

# Silent restart loop — alert when this exceeds 3 per hour
rate(kube_pod_container_status_restarts_total{namespace="kubelab"}[1h])

# Invisible throttling — alert when this exceeds 25%
rate(container_cpu_cfs_throttled_seconds_total{namespace="kubelab"}[5m])

Run these against your own cluster. Not just KubeLab. These are production queries.

6. How to Use This for Production Debugging

The repo includes docs/diagnose.md, a symptom-to-simulation map. Find the simulation that reproduces your issue, run it in KubeLab, and understand the mechanics before you touch production.

Exit code 137, pods restarting. Run the Memory Stress simulation. Confirm with kubectl describe pod | grep -A 5 "Last State:" and look for Reason: OOMKilled. Raise limits or find the leak. The simulation shows both.

High latency, pods look healthy, zero restarts. Run the CPU Stress simulation. Check container_cpu_cfs_throttled_seconds_total in Prometheus. If it climbs, your CPU limit is too low and the pod is frozen by CFS.

503 on some requests, pods show Running. Run the Readiness Probe Failure simulation. Check kubectl get endpoints — one pod IP is missing despite Running. The pod gets zero traffic.

Pods stuck Pending after a node went down. Run the Drain Node simulation. Run kubectl describe pod <pending-pod> and read Events. The scheduler will state why it cannot place the pod, often insufficient capacity or a PVC on the failed node.

Conclusion

You just broke a real Kubernetes cluster seven ways and watched it fix itself each time. You have seen the ReplicaSet controller fire, read an OOMKill from kubectl describe, watched endpoints go empty during a cascading failure, and understood why a pod can be Running and receiving zero traffic at the same time.

What you practiced here applies to other clusters, staging or production you can read but not safely break. That muscle memory (events, endpoints, restart counter) is what you reach for at 3 am when something is wrong. KubeLab is the safe place to build that reflex.

The repo holds more than this article covered. Explore mode lets you run simulations without the guided flow. The full interview prep doc at docs/interview-prep.md has answers to the 13 most common Kubernetes interview questions. The observability guide at docs/observability.md covers Prometheus and Grafana setup in detail.

If this helped you, star the repo at https://github.com/Osomudeya/kube-lab and share it with someone who is learning Kubernetes the hard way.



Read the whole story
alvinashcraft
42 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

How to Use Docker Compose for Production Workloads — with Profiles, Watch Mode, and GPU Support

1 Share

There's a perception problem with Docker Compose. Ask a room full of platform engineers what they think of it, and you'll hear some version of: "It's great for local dev, but we use Kubernetes for real work."

I get it. I held that same opinion for years. Compose was the thing I used to spin up a Postgres database on my laptop, not something I'd trust with a staging environment, let alone a workload that needed GPU access.

Then 2024 and 2025 happened. Docker shipped a set of features that quietly transformed Compose from a developer convenience tool into something that can handle complex deployment scenarios. Profiles let you manage multiple environments from a single file. Watch mode killed the painful rebuild cycle that made container-based development feel sluggish. GPU support opened the door to ML inference workloads. And a bunch of smaller improvements (better health checks, Bake integration, structured logging) filled in the gaps that used to make Compose feel like a toy.

Here's what I'll cover: using Docker Compose profiles to manage multiple environments from one file, setting up watch mode for instant code syncing during development, configuring GPU passthrough for machine learning workloads, implementing proper health checks and startup ordering so your services stop crashing on cold starts, and using Bake to bridge the gap between your local Compose workflow and production image builds. I'll also tell you where Compose still falls short and where you should reach for something else.

Prerequisites

You should be comfortable with Docker basics and have written a compose.yaml file before. You'll need Docker Compose v2 installed. The minimum version depends on which features you want: service_healthy dependency conditions require v2.20.0+, watch mode requires v2.22.0+, and the gpus: shorthand requires v2.30.0+. Run docker compose version to check what you have.

Table of Contents

The Modern Compose File: What's Changed

If you haven't looked at a Compose file recently, the first thing you'll notice is that the version field is gone. Docker Compose v2 ignores it entirely, and including it actually triggers a deprecation warning. A modern compose.yaml starts cleanly with your services, no preamble needed.

But the structural changes go deeper than that. Here's what a modern, production-aware Compose file looks like for a typical web application stack:

services:
  api:
    image: ghcr.io/myorg/api:${TAG:-latest}
    env_file: [configs/common.env]
    environment:
      - NODE_ENV=${NODE_ENV:-production}
    ports:
      - "8080:8080"
    depends_on:
      db:
        condition: service_healthy
    deploy:
      resources:
        limits:
          memory: 512M
          cpus: "1.0"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3

  db:
    image: postgres:16-alpine
    volumes:
      - db-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      retries: 5

volumes:
  db-data:

Look at what's in there: resource limits, health checks with dependency conditions, proper volume management. These aren't nice-to-haves. They're the features that make Compose viable beyond your laptop.

Health checks in particular solve one of Compose's oldest and most annoying pain points: the race condition where your web server starts before the database is actually ready to accept connections. If you've ever added sleep 10 to a startup script and crossed your fingers, you know what I'm talking about.

How to Use Profiles to Manage Multiple Environments

This is the feature that changed my relationship with Compose. Before profiles, managing different environments meant choosing between two painful approaches. Either you maintained multiple Compose files (docker-compose.yml, docker-compose.dev.yml, docker-compose.test.yml, docker-compose.prod.yml) and dealt with the inevitable drift between them. Or you used one big bloated file where you commented out services depending on the context. Both approaches were fragile, and both led to those fun "works on my machine" conversations.

Profiles give you a much cleaner path. You assign services to named groups. Services without a profile always start. Services with a profile only start when you explicitly activate that profile. You can also activate profiles with the COMPOSE_PROFILES environment variable instead of the CLI flag, which is handy for CI (see the official profiles docs for the full syntax).

Here's what that looks like:

services:
  api:
    image: myapp:latest
    # No profiles = always starts

  db:
    image: postgres:16
    # No profiles = always starts

  debug-tools:
    image: busybox
    profiles: [debug]
    # Only starts with --profile debug

  prometheus:
    image: prom/prometheus
    profiles: [monitoring]
    # Only starts with --profile monitoring

  grafana:
    image: grafana/grafana
    profiles: [monitoring]
    depends_on: [prometheus]

Now your team operates with simple, memorable commands:

# Development: just the core stack
docker compose up -d

# Development with observability
docker compose --profile monitoring up -d

# CI: core stack only (no monitoring overhead)
docker compose up -d

# Full stack with debugging
docker compose --profile debug --profile monitoring up

One Compose file. No drift. No guesswork about which override file to pass.

Real-World Profile Patterns I've Used

Four patterns I keep coming back to:

The "infra-only" pattern. This is for developers who run application code natively on their host machine but need infrastructure services like databases, message queues, and caches in containers. You leave infrastructure services without a profile and put application services behind one. Your backend developer runs docker compose up to get Postgres and Redis, then starts the API directly on their host with their favorite debugger attached.

The "mock vs. real" pattern. You put a payments-mock service in the dev profile and a real payments gateway service in the prod profile. Same Compose file, totally different behavior depending on context. This one saved my team from accidentally hitting a live payment API during development more than once.

The "CI optimization" pattern. Heavy services like Selenium browsers and monitoring stacks go behind profiles so your CI pipeline skips them. Your test suite runs faster without that overhead, and you only pull those services in when you actually need end-to-end integration tests.

The "AI/ML workloads" pattern. GPU-dependent services (inference servers, model training containers) go into a gpu profile. Developers without GPUs can still work on the rest of the stack without anything breaking.

One practical tip that's saved me a lot of headaches: document your profiles in the project's README. It sounds obvious, but when a new team member runs docker compose up and wonders why the monitoring dashboard isn't starting, they need a single place to find the answer. A quick table listing each profile and what it includes will save you from answering the same Slack question every onboarding cycle.

How to Use Watch Mode to End the Rebuild Cycle

If profiles solved the environment management problem, watch mode solved the developer experience problem.

You probably know the old workflow for container-based development. It went like this: edit code, run docker compose build, run docker compose up, test your change, find a bug, edit again, rebuild, restart, test. Each iteration costs you thirty seconds to a minute of waiting. Over a full day of active development, you're losing an hour or more just sitting there watching build logs scroll by.

Watch mode (introduced in Compose v2.22.0 and significantly improved in later releases) monitors your local files and automatically takes action when something changes. It supports three synchronization strategies, and picking the right one for each situation is the key to making it work well. The official watch mode docs cover the full spec if you want to dig deeper.

sync copies changed files directly into the running container. This works best for interpreted languages like Python, JavaScript, and Ruby, and for frameworks with hot module reloading like React, Vue, or Next.js. The file lands in the container, the framework picks up the change, and your browser updates. No rebuild, no restart. If you're working with a compiled language like Go, Rust, or Java, sync won't help you since the code needs to be recompiled. Use rebuild for those instead.

rebuild triggers a full image rebuild and container replacement. You want this for dependency changes, like when you update package.json or requirements.txt, or when you modify the Dockerfile itself. In those cases, syncing files isn't enough. You need a fresh image.

sync+restart syncs files into the container, then restarts the main process. This is ideal for configuration file changes like nginx.conf or database configs, where the application needs to reload to pick up the new settings but the image itself is fine.

Here's what a real-world watch configuration looks like for a Node.js application:

services:
  api:
    build: .
    ports: ["3000:3000"]
    command: npx nodemon server.js
    develop:
      watch:
        - action: sync
          path: ./src
          target: /app/src
          ignore:
            - node_modules/
        - action: rebuild
          path: package.json
        - action: sync+restart
          path: ./config
          target: /app/config

You start it with docker compose up --watch, or you can run docker compose watch as a standalone command if you'd rather keep the file sync events separate from your application logs.

A few things to know before you set this up. Watch mode only works with services that have a local build: context. If you're pulling a prebuilt image from a registry, there's nothing for Compose to sync or rebuild, so watch will ignore that service. Your container also needs basic file utilities (stat, mkdir) installed, and the container USER must have write access to the target path. If you're using a minimal base image like scratch or distroless, the sync action won't work. And if you're on an older Compose version, check which actions are supported: sync+restart and sync+exec were added in later minor releases after the initial v2.22.0 launch.

It's a massive improvement. Edit a source file, save it, and the change is live in under a second for frameworks with hot reload. No context switching to run build commands. No waiting. Just code.

Watch Mode vs. Bind Mounts

A fair question you might be asking: bind mounts have provided a form of live-reload for years. Why does watch mode need to exist?

Bind mounts work, but they come with platform-specific issues that have plagued Docker Desktop for a long time. On macOS and Windows, bind mounts go through a filesystem sharing layer between the host OS and the Linux VM running Docker. This introduces permission quirks, performance problems on large directories (ever watched a node_modules folder choke a bind mount on macOS?), and inconsistent file notification behavior that makes hot reload unreliable.

Watch mode sidesteps these issues by explicitly syncing files at the application level. It's more predictable, works consistently across platforms, and gives you more control over what happens when a file changes.

That said, bind mounts still work well for many use cases, especially if you're on native Linux where the performance overhead doesn't exist. Watch mode is the better choice for teams that have run into cross-platform issues, or for anyone who wants the automatic rebuild and restart triggers that bind mounts can't provide.

How to Set Up GPU Support for Machine Learning Workloads

This is the feature that made me rethink what Compose can do.

Docker has supported GPU passthrough for individual containers for years through the NVIDIA Container Toolkit and the --gpus flag. But configuring GPU access in Compose files used to require clunky runtime declarations that were poorly documented and changed between Compose versions. It was the kind of thing where you'd find a Stack Overflow answer from 2021, try it, and discover it didn't work anymore.

The modern Compose spec handles it cleanly through the deploy.resources.reservations.devices block:

services:
  inference:
    image: myorg/model-server:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

If you're on Compose v2.30.0 or later, there's also a shorter syntax using the gpus: field:

services:
  inference:
    image: myorg/model-server:latest
    gpus:
      - driver: nvidia
        count: 1

Both approaches do the same thing. The deploy.resources syntax works on older Compose versions and gives you more control (like setting device_ids to pin specific GPUs). The gpus: shorthand is cleaner when you just need basic access.

One thing that will trip you up if you skip it: your host machine needs the right GPU drivers and nvidia-container-toolkit installed before any of this works. Run nvidia-smi on the host first. If that command doesn't show your GPUs, Compose won't see them either. For CUDA workloads, use official GPU base images like nvidia/cuda or the PyTorch/TensorFlow GPU images. The Compose GPU access docs walk through the full setup.

That's the whole thing. When you run docker compose up, the inference service gets access to one NVIDIA GPU. You can set count to "all" if you want every available GPU, or use device_ids to assign specific GPUs to specific services.

How to Combine Multi-GPU Workloads with Profiles

Here's where profiles and GPU support work really well together. Consider an ML workload where you need an LLM for text generation, an embedding model for vector search, and a vector database:

services:
  vectordb:
    image: milvus/milvus:latest
    # Runs on CPU, no profile needed

  llm-server:
    image: ollama/ollama:latest
    profiles: [gpu]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    volumes:
      - model-cache:/root/.ollama

  embedding-server:
    image: myorg/embeddings:latest
    profiles: [gpu]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

Developers without GPUs work on the application logic with just docker compose up. The vector database starts, they can write code against its API, and everything runs fine. When it's time to test the full ML pipeline, someone with a multi-GPU workstation runs docker compose --profile gpu up and gets the complete stack with specific GPU assignments.

This pattern has become central to our AIOps platform development. The team building alerting logic doesn't need GPUs. The team training anomaly detection models does. One Compose file serves both teams.

How to Configure Health Checks, Dependencies, and Startup Ordering

One of Compose's most underappreciated improvements is how it handles service dependencies. The depends_on directive now supports conditions that actually mean something (this requires Compose v2.20.0+, see the startup ordering docs for the full picture):

depends_on:
  db:
    condition: service_healthy
  redis:
    condition: service_started

When you combine this with proper health checks, you eliminate the "sleep 10 and hope" pattern that plagues so many Compose setups. Your API service actually waits until PostgreSQL is accepting connections before it tries to start. Not just until the container is running, but until the database process inside it has passed its health check.

One detail that catches people: tune your start_period. Databases like PostgreSQL need time to initialize on first boot, especially if they're running migrations. Without a start_period, the health check starts counting retries immediately and can declare the service unhealthy before it even had a chance to finish starting up. A config like this works well for most database services:

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 5s
  timeout: 2s
  retries: 10
  start_period: 30s

The start_period gives the container 30 seconds of grace time where failed health checks don't count against the retry limit.

This might seem like a small detail, but if you've ever worked on a stack with eight or ten interconnected services, you know how much time you can waste debugging cascading failures during cold starts. Proper startup ordering prevents all of that and makes your local environment behave much more like production.

How to Use Bake for Production Image Builds

I mentioned Bake integration earlier, and it's worth its own section because it solves a problem you'll hit as soon as you start using Compose for anything beyond local dev: your development Compose file and your production build process have different needs.

During development, you want fast builds, local caches, and single-platform images. For production, you want tagged images pushed to a registry, multi-platform builds, and build attestations. Trying to cram both into your compose.yaml gets messy fast.

Docker Bake (docker buildx bake) can read your compose.yaml and generate build targets from it, but you can override and extend those targets with a separate docker-bake.hcl file. This keeps your development workflow clean while giving CI the knobs it needs. The Bake documentation covers the full HCL syntax and Compose integration.

Here's a minimal docker-bake.hcl:

group "default" {
  targets = ["api", "worker"]
}

target "api" {
  context    = "api"
  dockerfile = "Dockerfile"
  tags       = ["registry.example.com/team/api:release"]
  platforms  = ["linux/amd64"]
}

target "worker" {
  context    = "worker"
  dockerfile = "Dockerfile"
  tags       = ["registry.example.com/team/worker:release"]
}

Then your CI pipeline runs docker buildx bake to produce release images, while developers keep using docker compose up --build locally. The two workflows share the same Dockerfiles but have separate build configurations where they need them.

The pattern I've landed on: use Compose for local development and CI test environments, use Bake in CI to produce the release images, and push those images into whatever deployment target your team uses (staging server, Kubernetes cluster, edge node). Compose gets you from code to running containers fast. Bake gets you from code to production-ready images with proper tags and attestations.

What Compose Is Not (An Honest Assessment)

I've spent this entire article making the case that Compose has grown up. But I should also tell you where it falls short. I'd rather you hear it from me now than discover it the hard way in production.

Compose is not a container orchestrator. It doesn't schedule work across multiple hosts. It doesn't do automatic failover. It won't give you rolling updates with zero downtime, and it has no concept of service mesh networking. If you need any of those things, you need Kubernetes, Nomad, or Docker Swarm (if you're still using it).

Compose doesn't replace Helm or Kustomize. If you're deploying to Kubernetes, Compose files don't translate directly. Docker offers Compose Bridge to convert Compose files into Kubernetes manifests, but it's still experimental and won't handle complex Kubernetes-specific configurations like custom resource definitions or ingress rules.

Compose doesn't handle secrets well in production. The secrets support exists, but it's limited compared to HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets. For anything beyond a staging environment, you'll want an external secrets management solution.

The sweet spot for modern Compose is clear: local development, CI/CD testing environments, single-node staging environments, and workloads where a single powerful machine (particularly for GPU work) is the right deployment target. Within that scope, Compose is excellent. Outside of it, you'll hit walls fast.

If you do run Compose in a staging or single-node production setup, a few more things are worth adding that I haven't covered here: restart: unless-stopped on every service so containers come back after a host reboot, a logging driver config so your logs go somewhere searchable instead of disappearing into docker logs, and a backup strategy for your named volumes. These aren't Compose-specific problems, but Compose won't solve them for you either.

A Practical Adoption Path

If you're currently working with a basic Compose setup and want to start using these features, here's the order I'd recommend. Each step is incremental, backward-compatible, and valuable on its own. You don't have to do all of this at once.

Week 1: Add health checks and proper depends_on conditions. This alone will eliminate the most common frustration: services crashing on startup because their dependencies aren't ready yet. Start with your database and your main application service. Once those two are wired up with condition: service_healthy, you'll notice the difference immediately.

healthcheck:
  test: ["CMD-SHELL", "pg_isready -U postgres"]
  interval: 5s
  timeout: 2s
  retries: 10
  start_period: 30s

Week 2: Introduce profiles. Start by putting your monitoring stack behind a monitoring profile and your debug tools behind a debug profile. Then delete whatever extra Compose files you've been maintaining. Having one source of truth instead of four files that are almost-but-not-quite the same makes everything simpler.

Week 3: Set up watch mode for your most-edited service. Pick the service where your developers spend the most time iterating. Get watch mode working there first. Once the team sees the difference (saving a file and seeing the change reflected in under a second) they'll ask for it on everything else.

Week 4: Add resource limits. Define memory and CPU limits for every service. This prevents one runaway container from starving the rest and gives you a realistic preview of how your services behave under production constraints. It's also useful for catching memory leaks early.

deploy:
  resources:
    limits:
      memory: 512M
      cpus: "1.0"

Wrapping Up

Docker Compose in 2026 is not the same tool it was a few years ago. Profiles, watch mode, GPU support, proper dependency management, and Bake integration have turned it into something that can handle real, complex workloads, as long as those workloads fit on a single node.

It's not Kubernetes, and it shouldn't try to be. But for local development, CI pipelines, staging environments, and single-machine GPU workloads, it's become hard to argue against. If you've been dismissing Compose because of what it used to be, the current version deserves a second look.

If you found this useful, you can find me writing about DevOps, containers, and AIOps best practices on my blog.



Read the whole story
alvinashcraft
50 seconds ago
reply
Pennsylvania, USA
Share this story
Delete

Microsoft Copilot in Outlook: Streamline Your Email Workflow with AI

1 Share
Master email management with AI—learn how to use Copilot in Outlook to summarize threads, draft replies, and take control of your inbox.
Read the whole story
alvinashcraft
57 seconds ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories