Abstract Link to heading
Often for machine learning workloads, a single node of a large job crashing can stall the entire job. We run a very large AWS EKS cluster with large distributed workloads. This post is about investigating one issue where nodes would seemingly go NotReady for several minutes without a clear cause, causing job failures.
My first thought was DNS issues since I saw timeout errors in the logs. That turned out to be a red herring—all networking for the node wasn’t working. The root cause was the Linux OOM killer printing many thousands of process names to a serial console while holding a kernel spinlock. This blocked network interrupt processing on the same CPU for minutes, causing the network driver to timeout and reset.
The Initial Problem Link to heading
We started getting alerts that nodes in our cluster were entering a NotReady state. The strange part was that the nodes would come back after a few minutes. They’d go NotReady, jobs would crash, and then 3-4 minutes later the same node would be Ready again. This pattern repeated across multiple nodes over several days.
As a ratio of total capacity, the incidents were rare, but even rare issues can have a large impact when they stall the progress of a large number of nodes. Even more frustratingly, when I inspected any of the problematic nodes, they would appear healthy by the time I looked at them.
First Clue: DNS Timeouts Link to heading
My first thought was to look at the kubelet logs on one of the affected nodes. The logs showed DNS resolution timeouts:
kubelet: Failed to update lease
err="Put https://abc123def456.gr7.us-region-1.eks.amazonaws.com: dial tcp:
lookup abc123def456.gr7.us-region-1.eks.amazonaws.com on 10.10.0.2:53:
read udp 10.10.0.10:12345->10.10.0.2:53: i/o timeout"
I’ve seen DNS issues before in large clusters. The VPC DNS resolver at 10.10.0.2
is accessed through the
link-local address space, which has rate limits. My initial theory was that we were hitting packet rate limits
and DNS queries were getting dropped.
This seemed plausible. We run large workloads with lots of network activity, and I’d debugged similar issues before where excessive traffic to link-local addresses caused problems. UDP timeouts are a classic symptom of packet drops since there’s no retry mechanism.
I lowered some DNS traffic, told myself good job, and went to work on something else. Unfortunately, the problem persisted :(
Going Back to the Logs: Network is Completely Down Link to heading
After the issue happened again, I went back to the logs more carefully. I ran journalctl -u kubelet
and
looked at the timestamps of all the errors. They were all clustered in a tight window:
Sep 08 09:20:24 kernel: kubelet: Error updating node status, will retry
err="error getting node: Get https://abc123def456.gr7.us-region-1.eks.amazonaws.com:443:
read tcp 10.10.0.10:54321->10.10.1.12:443: connection timed out"
Sep 08 09:20:44 kernel: kubelet: Failed to update lease
err="Put https://abc123def456.gr7.us-region-1.eks.amazonaws.com: dial tcp:
lookup abc123def456.gr7.us-region-1.eks.amazonaws.com on 10.10.0.2:53:
read udp 10.10.0.10:12345->10.10.0.2:53: i/o timeout"
I noticed the Kubelet errors started at 09:20 and looked recovered around 09:24.
Finding the ENA Driver Timeouts Link to heading
I also checked dmesg -T
to see if there were any kernel-level network errors during that window:
Sep 08 09:19:41 kernel: ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 2.
14028 msecs since last interrupt, 14304 msecs since last napi execution, napi scheduled: 1
...
Sep 08 09:19:51 kernel: ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 502.
23552 msecs since last interrupt, 23828 msecs since last napi execution, napi scheduled: 1
...
Sep 08 09:19:51 kernel: ena 0000:47:00.0 eth0: Lost TX completions are above the threshold (171 > 128).
Completion transmission timeout: 5000 (msec). 23552 msecs since last interrupt
...
Sep 08 09:19:51 kernel: ena 0000:47:00.0 eth0: Resetting the device
...
Sep 08 09:23:33 kernel: ena 0000:47:00.0: Device reset completed successfully
The ENA (Elastic Network Adapter) driver was complaining that it hadn’t received an interrupt. Eventually it gives up and resets the device. The reset itself may have been quick, but the device cannot reset completely until CPU 19 is unlocked: three minutes later.
Network drivers rely on interrupts to know when packets have been transmitted or received. Without interrupts, the driver can’t acknowledge transmitted packets, the TX queue fills up, and eventually the network stops functioning. The ENA driver detects this condition and performs a reset, but the reset can’t fix the underlying problem: CPU 19 continues blocking interrupts for over 3 more minutes while dumping output to the serial console.
Two critical things come from these logs:
- If it was 14304 msec without an interrupt, then the issue begins around 09:19:54
- The ENA device reset happens directly before the kubelet recovers
The Parallel Clue: Kernel Dumps and Stack Traces Link to heading
Now that I knew the issue started at 09:19:54, I looked more carefully at what was happening at that exact time. I found this:
Sep 08 09:19:54 kernel: exe invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=800
Sep 08 09:19:54 kernel: CPU: 19 PID: 219191 Comm: exe Tainted: G C OE 5.xx.xxx-123.123
Sep 08 09:19:54 kernel: Call Trace:
Sep 08 09:19:54 kernel: dump_stack+0x57/0x70
Sep 08 09:19:54 kernel: dump_header+0x4a/0x1f4
Sep 08 09:19:54 kernel: oom_kill_process.cold+0xb/0x10
The Linux OOM (Out-of-Memory) killer was invoked. A container hit its memory limit and the kernel needed to kill processes to free up memory. This happens occasionally and isn’t ideal, but it shouldn’t break the entire node.
But then I noticed something odd. Right after the OOM killer message, there were many thousands of lines like this:
Sep 08 09:19:54 kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Sep 08 09:19:54 kernel: [ 189699] 0 189699 318840 6209 167936 0 800 p1
Sep 08 09:19:54 kernel: [ 189704] 0 189704 96124552 12219715 231645184 0 800 p1
Sep 08 09:19:54 kernel: [ 189743] 0 189743 4114 6 20480 0 800 p1
...
(many thousands more lines)
The kernel was dumping information about every single process in the container and there were many thousands.
It was clear the OOM dump was a strong cause, but I couldn’t yet tie this back to why the network driver would want to reset.
The CPU 19 Connection Link to heading
One clue was from the line section CPU: 19 PID: 219191
above. The OOM killer ran on CPU 19.
Network drivers rely on CPU interrupts to handle packets. I wondered if there was a connection.
CPU 19 handling both serial console output (with spinlock) and network interrupts.
Looking back at the ENA driver logs, I noticed they mentioned qid 1 (queue ID 1):
ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 2.
Network queues are named eth0-Tx-Rx-0, eth0-Tx-Rx-1, etc. So qid 1 corresponds to eth0-Tx-Rx-1. I checked which CPU was handling interrupts for all the network queues:
$ cat /proc/interrupts | head -1 && cat /proc/interrupts | grep 'eth0-Tx-Rx' | head -4
CPU0 CPU1 CPU2 ... CPU19 CPU20 ...
1794: 0 0 0 ... 0 23605613 ... eth0-Tx-Rx-0
1795: 0 0 0 ... 22511345 0 ... eth0-Tx-Rx-1
1796: 0 0 0 ... 0 0 ... eth0-Tx-Rx-2
1797: 0 0 0 ... 0 0 ... eth0-Tx-Rx-3
IRQ 1795 (eth0-Tx-Rx-1, the queue having problems) had over 22 million interrupts on CPU 19 since boot, and zero on all other CPUs. This proved it was pinned to CPU 19. I verified:
$ cat /proc/irq/1795/smp_affinity_list
19
This was the connection. CPU 19 was responsible for both:
- Dumping many thousands of process lines to the kernel log during the OOM event
- Processing network interrupts for queue 1 of the ENA driver
If CPU 19 was busy with the OOM dump and couldn’t process interrupts, that would explain why the network froze.
Kernel Locks and Interrupt Processing Link to heading
Initially I didn’t think it mattered that CPU 19 was busy since I thought the kernel should share CPU time between tasks, using preemptive multitasking. Unfortunately, this was blocked because when the kernel is writing to the console, it disables interrupts.
The kernel’s printk()
function (which writes log messages) has to work in any context, even during kernel
panics or crashes. To guarantee this, it takes a spinlock with interrupts disabled while writing to all
configured consoles:
// Simplified from kernel/printk/printk.c
// See: https://github.com/torvalds/linux/blob/v5.10/kernel/printk/printk.c
void console_unlock(void)
{
// The actual implementation is more complex, but the key behavior is:
// 1. Uses console_lock_spinning_enable/disable which disables interrupts
// 2. Loops through all registered consoles
// 3. Writes buffered messages to each console (including slow serial ports)
// 4. Cannot be preempted during console writes
console_lock_spinning_enable(); // Disables interrupts on this CPU
for_each_console(con) {
con->write(con, text, len); // Write to each console
}
console_lock_spinning_disable(); // Re-enables interrupts
}
While interrupts are disabled, the CPU can’t be preempted. It can’t handle hardware interrupts. It can’t do anything else. It’s completely dedicated to writing that console output.
Normally this isn’t a problem because console output is intended to be small. Writing to an in-memory ring buffer (dmesg) takes microseconds. But writing to a serial console is different. Serial consoles can be slow.
If you’re writing many thousands of lines to the serial console, and each line is ~145 characters, you can hold that spinlock (with interrupts disabled) for minutes.
Normal interrupt processing vs blocked interrupt processing during OOM dump.
Finding the OOM Connection Link to heading
At this point I knew:
- CPU 19 was handling OOM dumps
- CPU 19 was handling network interrupts
- The kernel disables interrupts while writing to the console
But I still didn’t understand why it would take minutes. The OOM killer message itself is just a few lines. Then I looked more carefully at what the OOM killer was logging:
Sep 08 09:19:54 kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Sep 08 09:19:54 kernel: [ 189699] 0 189699 318840 6209 167936 0 800 p1
Sep 08 09:19:54 kernel: [ 189704] 0 189704 96124552 12219715 231645184 0 800 p1
Sep 08 09:19:54 kernel: [ 189743] 0 189743 4114 6 20480 0 800 p1
...
(many thousands more lines)
The kernel was dumping detailed information about every single process in the container. I counted:
$ journalctl 2>/dev/null | grep 'Sep 08 09:2' | grep -c '\[ *[0-9]'
20860
Over 20,000 processes. I checked what controls this behavior:
$ cat /proc/sys/vm/oom_dump_tasks
1
The oom_dump_tasks
setting was enabled. When enabled, the kernel dumps
information about every process during an OOM event. This is helpful for debugging, but on a system with many
thousands of processes, it generates massive console output.
But where was this output going? I checked the console configuration:
$ cat /proc/cmdline | grep -o 'console=[^ ]*'
console=tty0
console=ttyS0,115200n8
And what controls the consoles:
$ cat /proc/consoles
ttyS0 -W- (EC p a) 4:64
tty0 -WU (E p ) 4:1
Both consoles have the E flag in parentheses, meaning both receive printk messages. The ttyS0
serial
console also has the C flag, making it the preferred console. This means all those 20,000 lines were being
written to both consoles, including the slow serial console.
The Serial Console Bottleneck Link to heading
CPU 19 was dumping 20,000+ lines to the serial console while holding a spinlock with interrupts disabled. I was curious if this was too much or not, so I checked the actual baud rate:
$ stty -F /dev/ttyS0 -a | head -1
speed 115200 baud; rows 24; columns 80; line = 0;
115200 baud means 115,200 bits per second. With 8N1 serial encoding (8 data bits + 1 start bit + 1 stop bit = 10 bits per character), that’s:
- 115,200 bits/sec ÷ 10 bits/char = 11,520 characters per second
Now for the math:
- Each process line: ~145 characters
- Total processes: 20k
- Total characters: 20k × 145 = ~3,000,000 characters
- Time to transmit: 3,000,000 ÷ 11,520 ~= 260 seconds (over 4 minutes)
This matches reasonably to the 3 minute lockout we observed above.
Calculating the serial console bottleneck duration.
What Kubernetes Saw Link to heading
While CPU 19 was blocked writing to the serial console, the node had no network connectivity. From Kubernetes' perspective, this looked like a node failure.
Node state transitions from the Kubernetes control plane’s perspective.
The kubelet is responsible for maintaining the node’s presence in the cluster. It does this by:
- Sending heartbeats to update the node’s status
- Renewing its lease with the API server
Both of these require network connectivity. When the network froze, the kubelet couldn’t do either:
Sep 08 09:20:24 kernel: kubelet: Error updating node status, will retry
err="error getting node: Get https://abc123def456.gr7.us-region-1.eks.amazonaws.com:443:
read tcp 10.10.0.10:54321->10.10.1.12:443: connection timed out"
Sep 08 09:20:44 kernel: kubelet: Failed to update lease
err="Put https://abc123def456.gr7.us-region-1.eks.amazonaws.com: dial tcp:
lookup abc123def456.gr7.us-region-1.eks.amazonaws.com on 10.10.0.2:53:
read udp 10.10.0.10:12345->10.10.0.2:53: i/o timeout"
After enough failed heartbeats, the Kubernetes control plane marks the node as NotReady. This is exactly what we saw in the original problem: nodes going NotReady for 3-4 minutes, then recovering.
The pods were still technically running on the node, but from the cluster’s perspective the node was unreachable. For distributed workloads that span many nodes, if any single node becomes unreachable, the entire job fails. The work can’t continue without all nodes being able to communicate with each other and the control plane.
Putting It All Together Link to heading
Here’s what happened:
Timeline of events from OOM killer invocation to network recovery.
Complete system architecture showing the path from OOM event to network outage.
09:19:54 - A container hit its memory cgroup limit. The process that tried to allocate memory was running on CPU 19. The kernel invoked the OOM killer on that CPU.
Because oom_dump_tasks=1
, the kernel started dumping information about all 20k processes to the console.
The kernel’s printk()
function acquired a spinlock and disabled interrupts on CPU 19. It began writing to both
consoles, including the serial console at 115200 baud.
09:19:41 (14 seconds later) - The ENA driver noticed that network queue 1 hadn’t processed any interrupts in 14 seconds:
ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 2.
14028 msecs since last interrupt
The OOM dump was still going, CPU 19 was still blocked, and network interrupts still weren’t being processed.
09:19:51 (24 seconds later) - The ENA driver gave up waiting and decided to reset the network interface:
ena 0000:47:00.0 eth0: Lost TX completions are above the threshold (171 > 128).
ena 0000:47:00.0 eth0: Resetting the device
At this point, the serial console output was still writing. The network was completely down and would remain down until the reset completed. During this time kubelet cannot post node status updates and the node is seen as down.
09:23:33 (3 minutes and 39 seconds after the OOM event) - The ENA driver reset finally completed:
ena 0000:47:00.0: Device reset completed successfully
This timing matches our math: 262 seconds to write 3 million characters at 11,520 chars/sec, plus overhead.
During those 3+ minutes, the node couldn’t communicate with the Kubernetes API server. The kubelet couldn’t send heartbeats. The node was marked NotReady, any jobs running on it lost network access, and important pods crashed due to a lack of networking.
The Fix Link to heading
Once I understood the root cause, there were several options to consider:
Option 1: Don’t OOM in the First Place Link to heading
The ideal solution is to avoid hitting memory limits. This means properly sizing container memory limits and monitoring memory usage to catch leaks or unexpected growth before they trigger OOM events. This is always good practice but doesn’t help with unexpected spikes or bugs that cause excessive memory allocation.
Option 2: Disable OOM Process Dumps Link to heading
Set vm.oom_dump_tasks=0
to prevent the kernel from dumping process information during OOM events:
# Temporary
echo 0 > /proc/sys/vm/oom_dump_tasks
# Persistent
echo "vm.oom_dump_tasks = 0" >> /etc/sysctl.conf
Trade-offs: You lose debugging information during OOM events. For systems with thousands of processes, this information is rarely useful anyway—the dump is too large to analyze effectively. The OOM kill still happens and is logged, just without the full process list.
Option 3: Disable or Reduce Console Output Link to heading
Remove the serial console from the kernel command line or reduce the console log level:
# Remove console=ttyS0,115200n8 from kernel command line
# OR
# Set console log level to only show critical messages
loglevel=3
Trade-offs: Serial console access is crucial for debugging unresponsive instances. Disabling it entirely makes it harder to diagnose kernel panics and other system-level failures. Reducing the log level helps but doesn’t eliminate the problem if other kernel operations generate large amounts of console output.
Related Issues and References Link to heading
Linux Kernel Documentation Link to heading
- Linux Serial Console - Serial console configuration and behavior
- The kernel’s command-line parameters - Including console= parameter
- Documentation for /proc/sys/vm/ - Including oom_dump_tasks
- The /proc Filesystem - Including /proc/consoles
- Console driver interface - Linux kernel printk.c source code
- 8-N-1 Serial Encoding - Serial communication frame structure
Kernel Mailing List Discussions Link to heading
- printk: Avoid softlockups in console_unlock() - Jan Kara’s patch addressing console_unlock() blocking
- Interrupts disabled for too long in printk - Discussion of interrupt blocking during console output
AWS ENA Driver Link to heading
- Troubleshoot the ENA kernel driver on Linux - Official AWS ENA troubleshooting guide
- ENA Driver GitHub Repository - AWS ENA driver source code and issues
- ENA Linux Best Practices - Performance tuning and configuration
Kubernetes Documentation Link to heading
- Nodes - Node lifecycle and status
- Node Status - Node heartbeat mechanism
- Leases - Kubelet lease renewal for heartbeats
- Taints and Tolerations - Node taints and pod eviction behavior
Related Discussions Link to heading
- printk interrupt disabling and locking - Stack Overflow discussion on why printk disables interrupts
- The Impact of Non-Blocking Consoles on Linux Boot Time - Analysis of serial console performance impact