Sr. Content Developer at Microsoft, working remotely in PA, TechBash conference organizer, former Microsoft MVP, Husband, Dad and Geek.
152064 stories
·
33 followers

Kubernetes v1.35: New level of efficiency with in-place Pod restart

1 Share

The release of Kubernetes 1.35 introduces a powerful new feature that provides a much-requested capability: the ability to trigger a full, in-place restart of the Pod. This feature, Restart All Containers (alpha in 1.35), allows for an efficient way to reset a Pod's state compared to resource-intensive approach of deleting and recreating the entire Pod. This feature is especially useful for AI/ML workloads allowing application developers to concentrate on their core training logic while offloading complex failure-handling and recovery mechanisms to sidecars and declarative Kubernetes configuration. With RestartAllContainers and other planned enhancements, Kubernetes continues to add building blocks for creating the most flexible, robust, and efficient platforms for AI/ML workloads.

This new functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.

The problem: when a single container restart isn't enough and recreating pods is too costly

Kubernetes has long supported restart policies at the Pod level (restartPolicy) and, more recently, at the individual container level. These policies are great for handling crashes in a single, isolated process. However, many modern applications have more complex inter-container dependencies. For instance:

  • An init container prepares the environment by mounting a volume or generating a configuration file. If the main application container corrupts this environment, simply restarting that one container is not enough. The entire initialization process needs to run again.
  • A watcher sidecar monitors system health. If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.
  • A sidecar that manages a remote resource fails. Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.

In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and have a controller (like a Job or ReplicaSet) create a new one. This process is slow and expensive, involving the scheduler, node resource allocation and re-initialization of networking and storage.

This inefficiency becomes even worse when handling large-scale AI/ML workloads (>= 1,000 Nodes with one Pod per Node). A common requirement for these synchronous workloads is that when a failure occurs (such as a Node crash), all Pods in the fleet must be recreated to reset the state before training can resume, even if all the other Pods were not directly affected by the failure. Deleting, creating and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead of this failure could cost $100,000 per month in wasted resources.

Handling these failures for AI/ML training jobs requires a complex integration touching both the training framework and Kubernetes, which are often fragile and toilsome. This feature introduces a Kubernetes-native solution, improving system robustness and allowing application developers to concentrate on their core training logic.

Another major benefit of restarting Pods in place is that keeping Pods on their assigned Nodes allows for further optimizations. For example, one can implement node-level caching tied to a specific Pod identity, something that is impossible when Pods are unnecessarily being recreated on different Nodes.

Introducing the RestartAllContainers action

To address this, Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers. When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in-place restart of the Pod.

This in-place restart is highly efficient because it preserves the Pod's most important resources:

  • The Pod's UID, IP address and network namespace.
  • The Pod's sandbox and any attached devices.
  • All volumes, including emptyDir and mounted volumes from PVCs.

After terminating all running containers, the Pod's startup sequence is re-executed from the very beginning. This means all init containers are run again in order, followed by the sidecar and regular containers, ensuring a completely fresh start in a known-good environment. With the exception of ephemeral containers (which are terminated), all other containers—including those that previously succeeded or failed—will be restarted, regardless of their individual restart policies.

Use cases

1. Efficient restarts for ML/Batch jobs

For ML training jobs, rescheduling a worker Pod on failure is a costly operation that wastes valuable compute resources. On a 1,000-node training cluster, rescheduling overhead can waste over $100,000 in compute resources monthly.

With RestartAllContainers actions you can address this by enabling a much faster, hybrid recovery strategy: recreate only the "bad" Pods (e.g., those on unhealthy Nodes) while triggering RestartAllContainers for the remaining healthy Pods. Benchmarks show this reduces the recovery overhead from minutes to a few seconds.

With in-place restarts, a watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher can exit with a designated code to trigger a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.

Read more details about future development and JobSet features at KEP-467 JobSet in-place restart.

apiVersion: v1
kind: Pod
metadata:
 name: ml-worker-pod
spec:
 restartPolicy: Never
 initContainers:
 # This init container will re-run on every in-place restart
 - name: setup-environment
 image: my-repo/setup-worker:1.0
 - name: watcher-sidecar
 image: my-repo/watcher:1.0
 restartPolicy: Always
 restartPolicyRules:
 - action: RestartAllContainers
 onExit:
 exitCodes:
 operator: In
 # A specific exit code from the watcher triggers a full pod restart
 values: [88]
 containers:
 - name: main-application
 image: my-repo/training-app:1.0

2. Re-running init containers for a clean state

Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume. If the main application fails in a way that corrupts this shared state, you need the init container to rerun.

By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.

3. Handling high rate of similar tasks execution

There are cases when tasks are best represented as a Pod execution. And each task requires a clean execution. The task may be a game session backend or some queue item processing. If the rate of tasks is high, running the whole cycle of Pod creation, scheduling and initialization is simply too expensive, especially when tasks can be short. The ability to restart all containers from scratch enables a Kubernetes-native way to handle this scenario without custom solutions or frameworks.

How to use it

To try this feature, you must enable the RestartAllContainersOnContainerExits feature gate on your Kubernetes cluster components (API server and kubelet) running Kubernetes v1.35+. This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.

Once enabled, you can add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.

The feature is designed to be easily usable on existing apps. However, if an application does not follow some best practices, it may cause issues for the application or for observability tooling. When enabling the feature, make sure that all containers are reentrant and that external tooling is prepared for init containers to re-run. Also, when restarting all containers, the kubelet does not run preStop hooks. This means containers must be designed to handle abrupt termination without relying on preStop hooks for graceful shutdown.

Observing the restart

To make this process observable, a new Pod condition, AllContainersRestarting, is added to the Pod's status. When a restart is triggered, this condition becomes True and it reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew. This provides a clear signal to users and other cluster components about the Pod's state.

All containers restarted by this action will have their restart count incremented in the container status.

Learn more

We want your feedback!

As an alpha feature, RestartAllContainers is ready for you to experiment with and any use cases and feedback are welcome. This feature is driven by the SIG Node community. If you are interested in getting involved, sharing your thoughts, or contributing, please join us!

You can reach SIG Node through:

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

An Astro site for my CSS Snippets

1 Share

As I think I've mentioned a few times already, I'm learning Astro and attempting to build random stuff with it just as an excuse to help practice and learn. With that in mind, during the Christmas break and between marathon sessions of Baldur's Gate 3, I built a little site I thought I'd share here on the blog. To be clear, this is nothing special, and doesn't come close to using all of the possible Astro features of course, but it was a useful coding exercise for myself and fun to build.

The web platform as a whole has gotten dramatically better over the past decade, and CSS improvements are a big part of that. There is a huge amount of new CSS features I'm "kinda" aware of but don't really have much experience with. One of the things I do to help me in that regard is keep notes of CSS snippets I find myself using again and again so I don't have to Google for them. I use Microsoft OneNote to track these and just write it down in quick and dirty blocks of text like so:

Center vertically in div: 
align-content: center 
 
Center iFrame: 
display:block;margin:auto 

CSS for Borders: 
fieldset {  
     border-style:solid; 
     border-width:thin; 
} 

Table CSS for borders: 
table { 
    border-collapse: collapse; 
    border: 1px solid black; 
    width: 100%; 
    max-width: 500px; 
} 

th, td { 
    border: 1px solid black; 
    padding: 5px;  
} 

Two Cols: 
.twocol { 
    display: grid; 
    grid-template-columns: 33% 66%; 
} 

There isn't any additional information here as I know what I'm typically searching for and just do a quick copy, paste, and modify to suit whatever I'm building.

I thought it might be interesting to take these tips and create a simple Astro site out of them. I'd use one Markdown source file per tip (which admittedly it perhaps overkill for some of these short snippets) and see if Astro could render the code and the output.

If you don't actually care about how I built it, you can go ahead and navigate to https://css-snippets.netlify.app/ and check it out. Here's a sample of one of the snippets:

screenshot from site

Alright, so here's how I built it.

Source Markdown

As I mentioned, I wanted my content to be driven by simple Markdown files. For this, I made use of Astro's content collections feature. First, I created a directory for my snippets called... snippets. In there I placed one Markdown file per snippet. Each file has a title, a set of tags, and the CSS snippet along with HTML to demonstrate it. Here's one for zebra striping table rows:

---
title: Zebra stripe a table
tags: ["tables"]
---

<style>
/* just to make it easier to see */
table {
    width: 500px;
}

tbody tr {
  background-color: #0a5b20;
}

tbody tr:nth-child(even) {
  background-color: #000000; 
}
</style>

<table>
    <thead>
    <tr>
        <td>Name</td>
        <td>Age</td>
    </tr>
    </thead>
    <tbody>
    <tr>
        <td>Luna</td>
        <td>13</td>
    </tr>
    <tr>
        <td>Elise</td>
        <td>15</td>
    </tr>
    <tr>
        <td>Pig</td>
        <td>10</td>
    </tr>
    <tr>
        <td>Zelda</td>
        <td>2</td>
    </tr>
    </tbody>
</table>

To let Astro know about the collection, I then added content.config.ts to the root of my src file and defined how it should find those Markdown files:

import { defineCollection } from 'astro:content';
import { glob } from 'astro/loaders';

const snippets = defineCollection({ 
    loader: glob({pattern: '*.md', base:'./snippets/'}),
});

export const collections = { snippets };

This was enough to make it available to my home page.

Rendering Snippets

First, I added a simple list to my home page. Right now this just lists everything, but I've only got a few snippets.

---
import { getCollection, getEntry } from 'astro:content';
import BaseLayout from '../layouts/BaseLayout.astro';
const allSnippets = await getCollection('snippets');
---

<BaseLayout pageTitle="Welcome">

<ul>
{allSnippets.map(snippet => (
	<li><a href=`snippets/${snippet.id}`>{snippet.data.title}</a></li>
))}
</ul>

</BaseLayout>

I won't bother sharing the layout file, but will note I made use of a nice little CSS framework, Simple.css.

Next, I added a template that would render one file per snippet. For this, I made an Astro file named src/pages/snippets/[id].astro. The [id] portion makes it dynamic. This page both handles the logic of "how do I know what routes to support" and "how do I render each one":

---
import { Code } from 'astro:components';
import { getCollection } from 'astro:content';
import BaseLayout from '../../layouts/BaseLayout.astro';

export async function getStaticPaths() {
  const snippets = await getCollection('snippets');
  return snippets.map(snippet => ({
    params: { id: snippet.id },
    props: { snippet },
  }));
}

const { snippet } = Astro.props;
---

<BaseLayout pageTitle={snippet.data.title}>

<p>
  Tags:
  {snippet.data.tags.map(tag => (
    <a href=`/tags/${tag}` class="tag">{tag} </a>
  ))}
</p>

<h3>Code</h3>

<Code code={snippet.body} lang="html" />

<h3>Output</h3>
<div set:html={snippet.body}></div>

</BaseLayout>

On top you can see where I get my collection and define the paths. I just use the Markdown's id value (which comes from the filename) and pass the content in as well. That's picked up in the template as snippet and then rendered both using Astro's native Code component for source code rendering and as raw HTML in a div.

The last part of the site is similar - handling tag pages - but the logic is a bit more complex. First, I added src/pages/tags/[id].astro. Here's how I handled the logic:

---
import { getCollection } from 'astro:content';
import BaseLayout from '../../layouts/BaseLayout.astro';
import { toTitleCase } from '../../utils/formatter.js';

export async function getStaticPaths() {
    const snippets = await getCollection('snippets');
    let tagPages = {};

    snippets.forEach((snippet) => {
        snippet.data.tags.forEach((tag) => {
            if (!tagPages[tag]) {
                tagPages[tag] = [];
            }
            tagPages[tag].push(snippet);
        });
    });

    return Object.keys(tagPages).map((tag) => ({
        params: { id: tag },
        props: { tag, pages:tagPages[tag] },
    }));
}

const { tag, pages } = Astro.props;
---

<BaseLayout pageTitle={toTitleCase(tag)}>

    <ul>
{pages.map(snippet => (
	<li><a href=`/snippets/${snippet.id}`>{snippet.data.title}</a></li>
))}
</ul>

</BaseLayout>

Basically, loop over my content, get unique tags, and create an array of pages for each tag. You can see an example of this here: https://css-snippets.netlify.app/tags/tables/

Deployment

The last step was deploying it, and here I had multiple options. I chose Netlify as I host most of my sites there. Webflow supports Astro apps as well, but they are tied to existing web sites, not really standalone. Even though my GitHub repo has a bunch of Astro crap in it, I used the Netlify CLI to connect it and set it up. Netlify supports dynamic as well as static Astro apps, and you should check their docs for more information on that, but for me this was literally about a 5 minute process at most. "It just worked" which is the best thing a dev can say about something. As I shared above, you can browse the site here, https://css-snippets.netlify.app/, and if you want to see all of the source, you can find it here: https://github.com/cfjedimaster/astro-tests/tree/main/css-snippets

As always, let me know what you think, and I've got my next little Astro site planned already! :)

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

Rectangle-Pilling

1 Share

Besides “slopsquatting”, another tech term I ran into recently, this time describing a phenomenon in the AR/VR world, is “rectangle-pilling”. My colleague Tony brought this gem back from Siggraph. It identifies a tendency among designers to repeatedly recreate 2D design metaphors in spatial computing scenarios. For instance, you’ll notice that while Apple is exporting its glass visual metaphor from the AVP to its other devices (yay!) it still emphasizes windows layouts on the AVP. Originally this seemed like a good way to help users of other Apple devices quickly learn how to use the AVP, but instead of being a bridge, it seems now to be sticking around as a core element of their spatial design language. As AndroidXR rolls out, it is rolling out with Google’s “Material Design 3” design language, which is also a flat-design metaphor.
The trend is commonly known as 2D-in-3D. The term “rectangle-pilling” describes the way in which designers who believe in promoting 2D-in-3D currently dominate AR/VR companies, shoving out those who point out that instead of using old media metaphors, it would be better to explore “true” 3D design.
How did the 2D-in-3D crowd come to dominate? To answer this, it is worth pointing out that “digital design” became a big thing in the 2000s as people with artistic leanings found they could earn exciting livings as web developers. Digital design got its curriculum from print design and created a culture in which all one’s knowledge and status came from understanding how fonts, borders, and images fit onto a rectangular space — and applying the golden ratio at every step. In 2007 there was a brief movement toward isomorphic design with the rise of smartphones but this was quickly reversed and “flat design” ascended again.
Architects, on the other hand, are natural spatial designers. They think in 3D and are constantly converting 3D spaces into 2D instructions and back again.
But when AR/VR companies start hiring designers — the sorts of people who other designers feel have design chops — they go to the pool of digital design people rather than the pool of architects. And the design culture in these companies will end up reinforcing this notion that 2D-in-3D is the right way to do things.
Which leaves people who natively think in 3D feeling like they’ve just been “rectangle-pilled”.
There are, of course, good reasons to have 2D elements in spatial computing, like legibility when surfacing information. But for people who feel they’ve gotten “rectangle-pilled”, there is the sense that 2D-in-3D has become a crutch rather than a bridge to something new.
Potentially, constantly seeing 2D design elements reinforced in VR and AR makes people wonder what the point of VR and AR really is. Do I really need to read a 2D webpage in my VR headset? Can’t I just do that on my tablet?

Read the whole story
alvinashcraft
1 minute ago
reply
Pennsylvania, USA
Share this story
Delete

124: Every Electric Vehicle Coming To The US In 2026 For The First Time

1 Share
In this episode:
• We talk about every electric vehicle coming to the US in 2026 for the first time.




Download audio: https://dts.podtrac.com/redirect.mp3/audioboom.com/posts/8825764.mp3?modified=1767407053&sid=5141110&source=rss
Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

IoT Coffee Talk: Episode 294 - Unlikely Predictions (The Merry New Year 26' Edition)

1 Share
From: Iot Coffee Talk
Duration: 1:07:50
Views: 2

Welcome to IoT Coffee Talk, where hype comes to die a terrible death. We have a fireside chat about all things #IoT over a cup of coffee or two with some of the industry's leading business minds, thought leaders and technologists in a totally unscripted, organic format.

This week Rob, Dimitri, and Leonard jump on Web3 to host a discussion about:

🎶 🎙️ BAD KARAOKE! 🎸 🥁 "Auld Lang Syne" (not so great interpretation by Leonard Lee)
🐣 Ethics, conflicts of interest, and insider trading in the 80's
🐣 The number two greatest Wall Street movie in all of Hollywood's history!
🐣 How did IoT Coffee Talk outlived the Covid Pandemic and not become a victim of it?
🐣 Why are relational databases and mainframes still really fast and great for a lot of thing?
🐣 We think digital twin will be big in 2026. How do we not hype it up and go sideways again?
🐣 Why is PL/SQL and TransactSQL the agentic AI of the SQL era?
🐣 We figured out why 6G ISAC (Integrated Sensing and Communications) will be important, and it has to do with digital twin.
🐣 Rob gives us the history of Web software and development.
🐣 Why the financial services industry drove the lasting innovation out of the Dotcom collapse, not eCommerce.
🐣 Why crypto is not a store of value. It is a money laundering exchange.
🐣 Hollywood will green light "Her 2: The Reckoning" and "Terminator 7: Self-inflicted Human Extinction"
🐣 Rob, Dimitri, and Leonard give their unlikely predictions for 2026!

It's a great episode. Grab an extraordinarily expensive latte at your local coffee shop and check out the whole thing. You will get all you need to survive another week in the world of IoT and greater tech!

Tune in! Like! Share! Comment and share your thoughts on IoT Coffee Talk, the greatest weekly assembly of Onalytica and CBT tech and IoT influencers on the planet!!

If you are interested in sponsoring an episode, please contact Stephanie Atkinson at Elevate Communities. Just make a minimally required donation to www.elevatecommunities.org and you can jump on and hang with the gang and amplify your brand on one of the top IoT/Tech podcasts in the known metaverse!!!

Take IoT Coffee Talk on the road with you on your favorite podcast platform. Go to IoT Coffee Talk on Buzzsprout, like, subscribe, and share: https://lnkd.in/gyuhNZ62

Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete

PPP 488 | How to Be a Less Terrible Boss, with Joel Hilchey

1 Share

Summary

In this episode, Andy talks with Joel Hilchey, speaker, facilitator, and author of The 6½ Habits of Highly Defective Bosses. Joel brings humor, honesty, and a refreshing amount of grace to a topic many leaders quietly struggle with: becoming a boss without training, preparation, or a clear roadmap.

Andy and Joel explore what it really means to be an "accidental boss" and why most bad bosses are not bad people. They unpack the four quadrants every leader must balance: tasks vs. people and short-term vs. long-term, and why focusing only on tasks can quietly erode trust and engagement. You'll hear practical ideas for avoiding mediocrity mongering, removing everyday hassles that drain teams, and providing clarity instead of whiplash leadership.

The conversation also touches on why aiming to be "less terrible" is a surprisingly powerful leadership goal, how recognition can become a force multiplier, and why lessons from leadership often show up at home as well.

If you're leading projects or people and want practical, human-centered ways to become a better boss one step at a time, this episode is for you!

Sound Bites

  • "Most bad bosses are actually good people with bad ideas."
  • "If you focus only on tasks, people will hate working for you."
  • "People don't expect perfection from their boss, but they do expect effort."
  • "Recognition is one of the highest leverage tools a leader has."
  • "The essence of strategy is saying no."
  • "Be a lighthouse for your team, not a disco ball."
  • "If you notice yourself getting frustrated that people are doing stuff that's off task or that feels off task to you, like why is this person taking time to do that? That's on you as the leader to say, oh, I must not have made this strategy clear."
  • "You can spend the money without asking, but you must tell me you spent it next time we meet."

Chapters

  • 00:00 Introduction
  • 02:08 Start of Interview
  • 02:20 Becoming an Accidental Boss
  • 07:10 The Four Leadership Quadrants
  • 12:10 Warning Signs You Are Neglecting People
  • 15:15 When Task Focus Goes Too Far
  • 21:24 Mediocrity Mongering and Good Enough Work
  • 25:47 The Value of a Crappy First Draft
  • 30:00 Removing Hassles from Team Work
  • 35:30 Lighthouse vs. Disco Ball Leadership
  • 39:40 Why Being 'Less Terrible' Matters
  • 45:40 Applying Leadership Lessons at Home
  • 48:31 End of Interview
  • 49:15 Andy Comments After the Interview
  • 52:38 Outtakes

Learn More

You can learn more about Joel and his work at JoelHilchey.com. Make sure to try the complimentary assessment Joel refers to in the interview.

For more learning on this topic, check out:

  • Episode 468 with James Turk. It's a practical discussion about what to do when you are suddenly in charge.
  • Episode 467 with Sabina Nawaz, former executive coach to Bill Gates, sharing insights on what no one usually tells you about becoming the boss.
  • Episode 419 with Molly McGrath. Her book focuses on fixing your boss, but it almost always inspires listeners to become better leaders themselves.

Level Up Your AI Skills

During the episode, Andy mentioned our AI Made Simple class. Join listeners from around the world who are learning how to prepare for an AI-infused future.

Just go to ai.PeopleAndProjectsPodcast.com. Thanks!

Pass the PMP Exam This Year

If you or someone you know is thinking about getting PMP certified, we've put together a helpful guide called The 5 Best Resources to Help You Pass the PMP Exam on Your First Try. We've helped thousands of people earn their certification, and we'd love to help you too. It's totally free, and it's a great way to get a head start.

Just go to 5BestResources.PeopleAndProjectsPodcast.com to grab your copy. I'd love to help you get your PMP this year!

Join Us for LEAD52

I know you want to be a more confident leader, that's why you listen to this podcast. LEAD52 is a global community of people like you who are committed to transforming their ability to lead and deliver. It's 52 weeks of leadership learning, delivered right to your inbox, taking less than 5 minutes a week. And it's all for free.

Learn more and sign up at GetLEAD52.com. Thanks!

Thank you for joining me for this episode of The People and Projects Podcast!

Talent Triangle: Power Skills

Topics: Leadership, People Management, Accidental Managers, Team Culture, Recognition, Project Leadership, Manager Development, Communication, Prioritization, Continuous Improvement

The following music was used for this episode:

Music: Brooklyn Nights by Tim Kulig
License (CC BY 4.0): https://filmmusic.io/standard-license

Music: Tuesday by Sascha Ende
License (CC BY 4.0): https://filmmusic.io/standard-license





Download audio: https://traffic.libsyn.com/secure/peopleandprojectspodcast/488-JoelHilchey.mp3?dest-id=107017
Read the whole story
alvinashcraft
2 minutes ago
reply
Pennsylvania, USA
Share this story
Delete
Next Page of Stories