Detective timeline: DNS timeouts (false lead), ENA logs, CPU 19, 20k processes, serial console bottleneck. Cork board with SOLVED stamp. — The investigation timeline: From false DNS timeout leads to discovering the real culprit—a serial console bottleneck causing 3+ minute network outages.

Abstract Link to heading

Often for machine learning workloads, a single node of a large job crashing can stall the entire job. We run a very large AWS EKS cluster with large distributed workloads. This post is about investigating one issue where nodes would seemingly go NotReady for several minutes without a clear cause, causing job failures.

My first thought was DNS issues since I saw timeout errors in the logs. That turned out to be a red herring—all networking for the node wasn’t working. The root cause was the Linux OOM killer printing many thousands of process names to a serial console while holding a kernel spinlock. This blocked network interrupt processing on the same CPU for minutes, causing the network driver to timeout and reset.

The Initial Problem Link to heading

We started getting alerts that nodes in our cluster were entering a NotReady state. The strange part was that the nodes would come back after a few minutes. They’d go NotReady, jobs would crash, and then 3-4 minutes later the same node would be Ready again. This pattern repeated across multiple nodes over several days.

As a ratio of total capacity, the incidents were rare, but even rare issues can have a large impact when they stall the progress of a large number of nodes. Even more frustratingly, when I inspected any of the problematic nodes, they would appear healthy by the time I looked at them.

First Clue: DNS Timeouts Link to heading

My first thought was to look at the kubelet logs on one of the affected nodes. The logs showed DNS resolution timeouts:

kubelet: Failed to update lease
err="Put https://abc123def456.gr7.us-region-1.eks.amazonaws.com: dial tcp:
lookup abc123def456.gr7.us-region-1.eks.amazonaws.com on 10.10.0.2:53:
read udp 10.10.0.10:12345->10.10.0.2:53: i/o timeout"

I’ve seen DNS issues before in large clusters. The VPC DNS resolver at 10.10.0.2 is accessed through the link-local address space, which has rate limits. My initial theory was that we were hitting packet rate limits and DNS queries were getting dropped.

This seemed plausible. We run large workloads with lots of network activity, and I’d debugged similar issues before where excessive traffic to link-local addresses caused problems. UDP timeouts are a classic symptom of packet drops since there’s no retry mechanism.

I lowered some DNS traffic, told myself good job, and went to work on something else. Unfortunately, the problem persisted :(

Going Back to the Logs: Network is Completely Down Link to heading

After the issue happened again, I went back to the logs more carefully. I ran journalctl -u kubelet and looked at the timestamps of all the errors. They were all clustered in a tight window:

Sep 08 09:20:24 kernel: kubelet: Error updating node status, will retry
err="error getting node: Get https://abc123def456.gr7.us-region-1.eks.amazonaws.com:443:
read tcp 10.10.0.10:54321->10.10.1.12:443: connection timed out"

Sep 08 09:20:44 kernel: kubelet: Failed to update lease
err="Put https://abc123def456.gr7.us-region-1.eks.amazonaws.com: dial tcp:
lookup abc123def456.gr7.us-region-1.eks.amazonaws.com on 10.10.0.2:53:
read udp 10.10.0.10:12345->10.10.0.2:53: i/o timeout"

I noticed the Kubelet errors started at 09:20 and looked recovered around 09:24.

Finding the ENA Driver Timeouts Link to heading

I also checked dmesg -T to see if there were any kernel-level network errors during that window:

Sep 08 09:19:41 kernel: ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 2.
14028 msecs since last interrupt, 14304 msecs since last napi execution, napi scheduled: 1
...
Sep 08 09:19:51 kernel: ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 502.
23552 msecs since last interrupt, 23828 msecs since last napi execution, napi scheduled: 1
...
Sep 08 09:19:51 kernel: ena 0000:47:00.0 eth0: Lost TX completions are above the threshold (171 > 128).
Completion transmission timeout: 5000 (msec). 23552 msecs since last interrupt
...
Sep 08 09:19:51 kernel: ena 0000:47:00.0 eth0: Resetting the device
...
Sep 08 09:23:33 kernel: ena 0000:47:00.0: Device reset completed successfully

The ENA (Elastic Network Adapter) driver was complaining that it hadn’t received an interrupt. Eventually it gives up and resets the device. The reset itself may have been quick, but the device cannot reset completely until three minutes later.

Network drivers rely on interrupts to know when packets have been transmitted or received. Without interrupts, the driver can’t acknowledge transmitted packets, the TX queue fills up, and eventually the network stops functioning. The ENA driver detects this condition and performs a reset.

Two critical things come from these logs:

If it was 14304 msec without an interrupt, then the likely cause begins around 09:19:54
The ENA device reset happens directly before the kubelet recovers

The Parallel Clue: Kernel Dumps and Stack Traces Link to heading

Now that I knew the issue started at 09:19:54, I looked more carefully at what was happening at that exact time. I found this:

Sep 08 09:19:54 kernel: exe invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=800
Sep 08 09:19:54 kernel: CPU: 19 PID: 219191 Comm: exe Tainted: G         C OE     5.xx.xxx-123.123
Sep 08 09:19:54 kernel: Call Trace:
Sep 08 09:19:54 kernel:  dump_stack+0x57/0x70
Sep 08 09:19:54 kernel:  dump_header+0x4a/0x1f4
Sep 08 09:19:54 kernel:  oom_kill_process.cold+0xb/0x10

The Linux OOM (Out-of-Memory) killer was invoked. A container hit its memory limit and the kernel needed to kill processes to free up memory. This happens occasionally and isn’t ideal, but it shouldn’t break the entire node.

It was clear the OOM dump was a strong cause, but I couldn’t yet tie this back to why the network driver would want to reset.

The CPU 19 Connection Link to heading

One clue was from the line section CPU: 19 PID: 219191 above. The OOM killer ran on CPU 19. Network drivers rely on CPU interrupts to handle packets. I wondered if there was a connection.

Looking back at the ENA driver logs, I noticed they mentioned qid 1 (queue ID 1):

ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 2.

Network queues are named eth0-Tx-Rx-0, eth0-Tx-Rx-1, etc. So qid 1 corresponds to eth0-Tx-Rx-1. I checked which CPU was handling interrupts for all the network queues:

$ cat /proc/interrupts | head -1 && cat /proc/interrupts | grep 'eth0-Tx-Rx' | head -4
     CPU0   CPU1   CPU2   ...   CPU19   CPU20   ...
1794:   0      0      0    ...      0   23605613 ...   eth0-Tx-Rx-0
1795:   0      0      0    ...  22511345      0  ...   eth0-Tx-Rx-1
1796:   0      0      0    ...      0       0  ...   eth0-Tx-Rx-2
1797:   0      0      0    ...      0       0  ...   eth0-Tx-Rx-3

IRQ 1795 (eth0-Tx-Rx-1, the queue having problems) had over 22 million interrupts on CPU 19 since boot, and zero on all other CPUs. This proved it was pinned to CPU 19. I verified:

$ cat /proc/irq/1795/smp_affinity_list
19

This was the connection. CPU 19 was responsible for both:

Dumping many thousands of process lines to the kernel log during the OOM event
Processing network interrupts for queue 1 of the ENA driver

Diagram showing CPU 19 contention between serial console output and network interrupt processing. — CPU 19 handling both serial console output (with spinlock) and network interrupts.

If CPU 19 was busy with the OOM dump and couldn’t process interrupts, that would explain why the network froze.

Kernel Locks and Interrupt Processing Link to heading

Initially I didn’t think it mattered that CPU 19 was busy since I thought the kernel should share CPU time between tasks, using preemptive multitasking. Unfortunately, this was blocked because when the kernel is writing to the console, it disables interrupts.

The kernel’s printk() function (which writes log messages) has to work in any context, even during kernel panics or crashes. To guarantee this, it takes a spinlock with interrupts disabled while writing to all configured consoles:

// Simplified from kernel/printk/printk.c
// See: https://github.com/torvalds/linux/blob/v5.10/kernel/printk/printk.c
void console_unlock(void)
{
    // The actual implementation is more complex, but the key behavior is:
    // 1. Uses console_lock_spinning_enable/disable which disables interrupts
    // 2. Loops through all registered consoles
    // 3. Writes buffered messages to each console (including slow serial ports)
    // 4. Cannot be preempted during console writes

    console_lock_spinning_enable();  // Disables interrupts on this CPU

    for_each_console(con) {
        con->write(con, text, len);  // Write to each console
    }

    console_lock_spinning_disable();  // Re-enables interrupts
}

While interrupts are disabled, the CPU can’t be preempted. It can’t handle hardware interrupts. It can’t do anything else. It’s completely dedicated to writing that console output.

Normally this isn’t a problem because console output is intended to be small. Writing to an in-memory ring buffer (dmesg) takes microseconds. But writing to a serial console is different. Serial consoles can be slow.

If you’re writing many thousands of lines to the serial console, and each line is ~145 characters, you can hold that spinlock (with interrupts disabled) for minutes.

Side-by-side comparison showing normal network interrupt flow versus blocked flow when spinlock is held. — Normal interrupt processing vs blocked interrupt processing during OOM dump.

Finding the OOM Connection Link to heading

At this point I knew:

CPU 19 was handling OOM dumps
CPU 19 was handling network interrupts
The kernel disables interrupts while writing to the console

But I still didn’t understand why it would take minutes. The OOM killer message itself is just a few lines. Then I looked more carefully at what the OOM killer was logging:

Sep 08 09:19:54 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Sep 08 09:19:54 kernel: [ 189699]     0 189699   318840     6209   167936        0           800 p1
Sep 08 09:19:54 kernel: [ 189704]     0 189704 96124552 12219715 231645184        0           800 p1
Sep 08 09:19:54 kernel: [ 189743]     0 189743     4114        6    20480        0           800 p1
...
(many thousands more lines)

The kernel was dumping detailed information about every single process in the container. I counted:

$ journalctl 2>/dev/null | grep 'Sep 08 09:2' | grep -c '\[ *[0-9]'
20860

Over 20,000 processes. I checked what controls this behavior:

$ cat /proc/sys/vm/oom_dump_tasks
1

The oom_dump_tasks setting was enabled. When enabled, the kernel dumps information about every process during an OOM event. This is helpful for debugging, but on a system with many thousands of processes, it generates massive console output.

But where was this output going? I checked the console configuration:

$ cat /proc/cmdline | grep -o 'console=[^ ]*'
console=tty0
console=ttyS0,115200n8

And what controls the consoles:

$ cat /proc/consoles
ttyS0                -W- (EC p a)    4:64
tty0                 -WU (E  p  )    4:1

Both consoles have the E flag in parentheses, meaning both receive printk messages. The ttyS0 serial console also has the C flag, making it the preferred console. This means all those 20,000 lines were being written to both consoles, including the slow serial console.

The Serial Console Bottleneck Link to heading

CPU 19 was dumping 20,000+ lines to the serial console while holding a spinlock with interrupts disabled. I was curious if this was too much or not, so I checked the actual baud rate:

$ stty -F /dev/ttyS0 -a | head -1
speed 115200 baud; rows 24; columns 80; line = 0;

115200 baud means 115,200 bits per second. With 8N1 serial encoding (8 data bits + 1 start bit + 1 stop bit = 10 bits per character), that’s:

115,200 bits/sec ÷ 10 bits/char = 11,520 characters per second

Now for the math:

Each process line: ~145 characters
Total processes: 20k
Total characters: 20k × 145 = ~3,000,000 characters
Time to transmit: 3,000,000 ÷ 11,520 ~= 260 seconds (over 4 minutes)

This matches reasonably to the 3 minute lockout we observed above.

Infographic showing the mathematical calculation of how long it takes to write 20k processes at 115200 baud. — Calculating the serial console bottleneck duration.

What Kubernetes Saw Link to heading

While CPU 19 was blocked writing to the serial console, the node had no network connectivity. From Kubernetes' perspective, this looked like a node failure.

State diagram showing node transitioning from Ready to NotReady and back to Ready. — Node state transitions from the Kubernetes control plane’s perspective.

The kubelet is responsible for maintaining the node’s presence in the cluster. It does this by:

Sending heartbeats to update the node’s status
Renewing its lease with the API server

Both of these require network connectivity. When the network froze, the kubelet couldn’t do either:

Sep 08 09:20:24 kernel: kubelet: Error updating node status, will retry
err="error getting node: Get https://abc123def456.gr7.us-region-1.eks.amazonaws.com:443:
read tcp 10.10.0.10:54321->10.10.1.12:443: connection timed out"

Sep 08 09:20:44 kernel: kubelet: Failed to update lease
err="Put https://abc123def456.gr7.us-region-1.eks.amazonaws.com: dial tcp:
lookup abc123def456.gr7.us-region-1.eks.amazonaws.com on 10.10.0.2:53:
read udp 10.10.0.10:12345->10.10.0.2:53: i/o timeout"

After enough failed heartbeats, the Kubernetes control plane marks the node as NotReady. This is exactly what we saw in the original problem: nodes going NotReady for 3-4 minutes, then recovering.

The pods were still technically running on the node, but from the cluster’s perspective the node was unreachable. For distributed workloads that span many nodes, if any single node becomes unreachable, the entire job fails. The work can’t continue without all nodes being able to communicate with each other and the control plane.

Putting It All Together Link to heading

Here’s what happened:

Timeline showing the sequence of events from 09:19:54 to 09:23:33, including OOM killer, ENA driver alerts, and network reset. — Timeline of events from OOM killer invocation to network recovery.

Layered architecture diagram showing container, kernel, console, network, and Kubernetes layers. — Complete system architecture showing the path from OOM event to network outage.

09:19:54 - A container hit its memory cgroup limit. The process that tried to allocate memory was running on CPU 19. The kernel invoked the OOM killer on that CPU.

Because oom_dump_tasks=1, the kernel started dumping information about all 20k processes to the console. The kernel’s printk() function acquired a spinlock and disabled interrupts on CPU 19. It began writing to both consoles, including the serial console at 115200 baud.

09:19:41 (14 seconds later) - The ENA driver noticed that network queue 1 hadn’t processed any interrupts in 14 seconds:

ena 0000:47:00.0 eth0: TX hasn't completed, qid 1, index 2.
14028 msecs since last interrupt

The OOM dump was still going, CPU 19 was still blocked, and network interrupts still weren’t being processed.

09:19:51 (24 seconds later) - The ENA driver gave up waiting and decided to reset the network interface:

ena 0000:47:00.0 eth0: Lost TX completions are above the threshold (171 > 128).
ena 0000:47:00.0 eth0: Resetting the device

At this point, the serial console output was still writing. The network was completely down and would remain down until the reset completed. During this time kubelet cannot post node status updates and the node is seen as down.

09:23:33 (3 minutes and 39 seconds after the OOM event) - The ENA driver reset finally completed:

ena 0000:47:00.0: Device reset completed successfully

This timing matches our math: 262 seconds to write 3 million characters at 11,520 chars/sec, plus overhead.

During those 3+ minutes, the node couldn’t communicate with the Kubernetes API server. The kubelet couldn’t send heartbeats. The node was marked NotReady, any jobs running on it lost network access, and important pods crashed due to a lack of networking.

The Fix Link to heading

Once I understood the root cause, there were several options to consider:

Option 1: Don’t OOM in the First Place Link to heading

The ideal solution is to avoid hitting memory limits. This means properly sizing container memory limits and monitoring memory usage to catch leaks or unexpected growth before they trigger OOM events. This is always good practice but doesn’t help with unexpected spikes or bugs that cause excessive memory allocation.

Option 2: Disable OOM Process Dumps Link to heading

Set vm.oom_dump_tasks=0 to prevent the kernel from dumping process information during OOM events:

# Temporary
echo 0 > /proc/sys/vm/oom_dump_tasks

# Persistent
echo "vm.oom_dump_tasks = 0" >> /etc/sysctl.conf

Trade-offs: You lose debugging information during OOM events. For systems with thousands of processes, this information is rarely useful anyway—the dump is too large to analyze effectively. The OOM kill still happens and is logged, just without the full process list.

Option 3: Disable or Reduce Console Output Link to heading

Remove the serial console from the kernel command line or reduce the console log level:

# Remove console=ttyS0,115200n8 from kernel command line
# OR
# Set console log level to only show critical messages
loglevel=3

Trade-offs: Serial console access is crucial for debugging unresponsive instances. Disabling it entirely makes it harder to diagnose kernel panics and other system-level failures. Reducing the log level helps but doesn’t eliminate the problem if other kernel operations generate large amounts of console output.

Linux Kernel Documentation Link to heading

Linux Serial Console - Serial console configuration and behavior
The kernel’s command-line parameters - Including console= parameter
Documentation for /proc/sys/vm/ - Including oom_dump_tasks
The /proc Filesystem - Including /proc/consoles
Console driver interface - Linux kernel printk.c source code
8-N-1 Serial Encoding - Serial communication frame structure

Kernel Mailing List Discussions Link to heading

printk: Avoid softlockups in console_unlock() - Jan Kara’s patch addressing console_unlock() blocking
Interrupts disabled for too long in printk - Discussion of interrupt blocking during console output

AWS ENA Driver Link to heading

Troubleshoot the ENA kernel driver on Linux - Official AWS ENA troubleshooting guide
ENA Driver GitHub Repository - AWS ENA driver source code and issues
ENA Linux Best Practices - Performance tuning and configuration

Kubernetes Documentation Link to heading

Nodes - Node lifecycle and status
Node Status - Node heartbeat mechanism
Leases - Kubelet lease renewal for heartbeats
Taints and Tolerations - Node taints and pod eviction behavior

printk interrupt disabling and locking - Stack Overflow discussion on why printk disables interrupts
The Impact of Non-Blocking Consoles on Linux Boot Time - Analysis of serial console performance impact