Abstract Link to heading

We run a very large AWS EKS cluster with lots of interesting challenges. This post is about a recent investigation into DNS resolution failures that we were able to root cause to an Elastic Network Interface (ENI) packets per second (PPS) limit and a further root cause of the combination of sudo defaults and ndots in our cluster DNS.

This post talks about the investigation process and the resolution.

The Initial Bug Report Link to heading

The initial report came from a report of CI flakes. The flakes originated inside tensorstore and had an error of:

./tensorstore/util/result.h:420: Status not ok: status(): UNAVAILABLE: 
CURL error Could not resolve hostname: Could not resolve host: 
company-bucket.s3.amazonaws.com 

The flakes happened across multiple instances and were not consistent. The curl source code for CURLE_COULDNT_RESOLVE_HOST was pretty generic, so I needed to dig deeper into the actual DNS resolution logs.

I went to the CoreDNS pod logs to see if there was a clue there:

kubectl logs -n kube-system deployment/coredns --tail=1000 | grep -i error

And found this error:

[ERROR] plugin/errors: 2 company-bucket.s3.amazonaws.com. A: read udp 10.1.2.3:44211->10.0.0.2:53: i/o timeout

The first IP address 10.1.2.3 is the CoreDNS pod trying to resolve a DNS request and the second IP address 10.0.0.2 is the Route 53 resolver for the VPC. From the AWS docs here.

An Amazon VPC connects to a Route 53 Resolver at a VPC+2 IP address. This VPC+2 address connects to a Route 53 Resolver within an Availability Zone.

Another set of interesting logs I found were like this:

[ERROR] plugin/errors: 2 ip-10-8-1-3.us-east-2.compute.internal. AAAA:  read udp 10.11.39.79:44211->10.10.0.2:53: i/o timeout

The majority of my failing lookups were for resolving internal hostnames. Also of interest were that these were AAAA (IPv6) queries when I usually only need IPv4 answers, so my traffic was growing 2x (resolving both).

AWS support case Link to heading

At this point, I pointed all the logs and information I had into claude and used it to help open an AWS support case with all the details AWS would need to help investigate. The AWS support UI is not very user friendly, and it can be difficult to aggregate all the information needed: if I don’t give the right information then AWS won’t have what they need to help diagnose the issue.

Claude Code is pretty good at this sort of thing and I have helper scripts that take as input:

  • All the logs I’ve collected
  • The bash history of the investigation
  • The slack threads where we talked about the issue

Claude will ingest all this information, create a support case description, inspect it for privacy or privileged information, and then output a vim terminal with the support case description, priority, and service account details. From Vim, I can tweak context before saving and submitting the case to AWS.

I filed the case and AWS responded. One key insight from the support case was:

Regarding the DNS query limits, please observe that today there is a hard limit of 1024 packet per second (PPS) limit to services that use link-local addresses.

AWS gave me a command to run to check if we were hitting the link-local allowance limit:

ethtool -S eth0 | grep linklocal_allowance_exceeded

which did return a non-zero value on one of our affected nodes.

Packet limits Link to heading

Packet limits are documented here.

There is a 1024 packet per second (PPS) limit to services that use link-local addresses.

And the link local address is documented here.

Link-local addresses are well-known, non-routable IP addresses. Amazon EC2 uses addresses from the link-local address space to provide services that are accessible only from an EC2 instance.

It’s called link local because “they run on the underlying host”.

Putting it all together:

  1. CoreDNS is using UDP to resolve DNS queries
  2. These UDP packages are sent to the link-local address of the Route 53 resolver
  3. Due to the 1024 PPS limit, the packets are dropped
  4. As UDP is a connectionless protocol, the sender does not know that the packets were dropped and times out

Packet Exploration Link to heading

First, I started capturing the DNS traffic on a node running core-dns. Next I sorted to find the largest offenders. I didn’t fully remember the parameters for tcpdump, so I asked claude code to help craft the commands for me. I would review the commands, and if they looked reasonable, run them.

# Capture DNS traffic on the CoreDNS pod
tcpdump -i eth0 -w /var/tmp/dns-capture.pcap 'port 53'

Next, I sorted to find the largest offenders.

$ tcpdump -r dns-capture.pcap -nn dst port 53 -v 2>/dev/null | \
    grep -oE "A\? [^ ]+" | cut -d' ' -f2 | \
    sort | uniq -c | sort -rn | head -20

This was the result:

   1800 company-bucket.s3.us-east-2.amazonaws.com.
    586 company-bucket.s3.us-east-2.amazonaws.com.svc.cluster.local.
    573 company-bucket.s3.us-east-2.amazonaws.com.app-namespace.svc.cluster.local.
    548 company-bucket.s3.us-east-2.amazonaws.com.us-east-2.compute.internal.
    504 company-bucket.s3.us-east-2.amazonaws.com.cluster.local.
    # ... normal expected traffic ...
    # ... then the long tail of hostnames ...
     96 ip-10-0-1-10.us-east-2.compute.internal.us-east-2.compute.internal.
     96 ip-10-0-1-10.us-east-2.compute.internal.svc.cluster.local.
     96 ip-10-0-1-10.us-east-2.compute.internal.app-namespace.svc.cluster.local.
     96 ip-10-0-1-10.us-east-2.compute.internal.cluster.local.
     96 ip-10-0-1-10.us-east-2.compute.internal.

The first few lines were normal AWS or internal services that I expected. The traffic seemed fine, but there was a very long tail of hostname-like queries.

  1. Why was there so much hostname-like traffic?
  2. Why was each hostname-like query appearing 5 times?

Kuberentes DNS and ndots Link to heading

The hostname long tail traffic was the largest offender and due to the size of our cluster, it was a significant amount of traffic given a single hostname’s query would result in 5 queries to the VPC DNS resolver. (Note: Actually 10 queries since there were both A and AAAA records).

Inside an arbitrary pod’s container, I ran cat /etc/resolv.conf to see the resolve conf:

$ kubectl exec -it deployment/some-app -- cat /etc/resolv.conf
search app-namespace.svc.cluster.local svc.cluster.local cluster.local us-east-2.compute.internal
nameserver 172.20.0.10
options ndots:5

Breaking this apart:

search app-namespace.svc.cluster.local svc.cluster.local cluster.local us-east-2.compute.internal

These are the search domains that Kubernetes sets up for each pod. These allow your pod to resolve services either in its own namespace or in the cluster. The last domain (us-east-2.compute.internal) comes from the node.

nameserver 172.20.0.10

This is the cluster’s service CIDR (172.20.0.0/16) + 10, which is the default CoreDNS service IP. From the kubernetes docs:

As a soft convention, some Kubernetes installers assign the 10th IP address from the Service IP range to the DNS service.

The last section, options ndots:5, tells the resolver that any hostname with fewer than 5 dots should be treated as non-fully-qualified and should have the search domains appended. My query amplification occurred because applications weren’t using fully qualified domain names (FQDNs with a . at the end). Without this, the resolver would try each search domain from /etc/resolv.conf in order, appending them to the hostname.

  1. app-namespace.svc.cluster.local
  2. svc.cluster.local
  3. cluster.local
  4. us-east-2.compute.internal

This makes a query for ip-10-0-1-10.us-east-2.compute.internal resolve to 5 queries (4 search domains + 1 as-is):

  1. ip-10-0-1-10.us-east-2.compute.internal.app-namespace.svc.cluster.local
  2. ip-10-0-1-10.us-east-2.compute.internal.svc.cluster.local
  3. ip-10-0-1-10.us-east-2.compute.internal.cluster.local
  4. ip-10-0-1-10.us-east-2.compute.internal.us-east-2.compute.internal
  5. ip-10-0-1-10.us-east-2.compute.internal

Validating cache hits Link to heading

My initial tcpdump was for all port 53 traffic, but the link local traffic was only the non-cached hits. For those, I needed to filter for the source IP of the CoreDNS pod and the destination IP of the VPC DNS resolver.

# CoreDNS upstream queries to VPC DNS (10.10.0.2)
$ tcpdump -r dns-capture.pcap -nn \
    'src host 10.1.2.3 and dst host 10.10.0.2 and dst port 53' \
    -v 2>/dev/null | wc -l

This output was heavily dominated by the long tail of hostname-like queries, which were not cached by CoreDNS.

The source of the traffic Link to heading

I printed out a tcpdump capture and greped for some of the hostnames.

tcpdump -r /var/tmp/i-012a34e5678a32dcc.2025-08-01:02:47:53.pcap00 -nn -v 2>/dev/null | \
  grep "compute.internal" | head -10

02:47:53.589855 IP 10.1.0.1.51749 > 10.1.2.3.53: 
  7274+ A? ip-10-1-0-1.us-east-2.compute.internal.us-east-2.compute.internal. (88)
02:47:53.589988 IP 10.1.2.3.54708 > 10.10.0.2.53: 
  20127+ AAAA? ip-10-1-0-1.us-east-2.compute.internal.us-east-2.compute.internal. (88)

This is the two parts of a cache miss for DNS:

  1. The first query, 10.1.0.1.51749 > 10.1.2.3.53 is some pod asking to resolve its own hostname: ip-10-1-0-1.us-east-2.compute.internal.us-east-2.compute.internal.
  2. The second query 10.1.2.3.54708 > 10.10.0.2.53 is CoreDNS asking the VPC DNS resolver to resolve that hostname because it wasn’t cached.

Finding the pod itself Link to heading

I had the host, but didn’t yet have the pod that was making the request. A major clue was in the pattern of queries I saw:

1. `ip-10-0-1-10.us-east-2.compute.internal.app-namespace.svc.cluster.local`2. `ip-10-0-1-10.us-east-2.compute.internal.svc.cluster.local`3. `ip-10-0-1-10.us-east-2.compute.internal.cluster.local`4. `ip-10-0-1-10.us-east-2.compute.internal.us-east-2.compute.internal`5. `ip-10-0-1-10.us-east-2.compute.internal`

From this pattern, I can tell that the pod was in the app-namespace namespace. There’s also a clue that the pod itself was using host networking since it was resolving an EC2 looking hostname.

# Filter at the kubectl level for the node
kubectl get pods -n app-namespace --field-selector spec.nodeName=ip-10-1-0-1.us-east-2.compute.internal -o json | \
jq -r '.items[] |
  select(.spec.hostNetwork == true) |
  .metadata.name + " " + .spec.nodeName'

Where in the pod? Link to heading

Now that I had a pod, I connected to the pod to verify that it was the source of the traffic.

# SSH into a node-service pod
kubectl exec -it -n app-namespace node-service-xxxxx -- bash

# Run tcpdump to capture DNS traffic
tcpdump -i any -nn port 53 -c 100

Every few seconds, we saw multiple DNS queries for the node’s own hostname:

10.11.240.139.54321 > 10.10.0.2.53: A? ip-10-11-240-139.us-east-2.compute.internal
10.11.240.139.54322 > 10.10.0.2.53: AAAA? ip-10-11-240-139.us-east-2.compute.internal

Unfortunately, the pod’s code was very complex and mostly written by someone else. There wasn’t a great way to time the outputs of the pod with the tcpdump output. Since the traffic had a pattern (many queries every few seconds), I deduced it was a continuous loop of some kind. I binary searched the code of the pod’s primary loop: early returning half the code to continuously narrow down where the exact traffic was coming from until I had a minimal change between two versions: one that included all the extra DNS calls and one that did not.

The code Link to heading

I narrowed it down finally to a few sudo calls

import subprocess
# ...
def some_function():
    cmd = f"sudo ..."
    result = subprocess.run(cmd, shell=True, capture_output=True)
    # ... 

The pod itself was a system level administrative pod, so it makes sense that it would need to run commands with sudo, but I didn’t understand why this specific command was generating so much DNS traffic. I ran the same ... command myself on the pod and didn’t see any DNS traffic, however if I ran sudo ... then I did see DNS traffic. From this, I knew it was sudo related. My first thought was something snuck into whatever subshell sudo was using.

I asked claude code what could be happening and it mentioned Defaults !fqdn, but I didn’t understand at the time what that meant, so I instead asked claude to run the sudo ... command for me and figure out which system calls were making the DNS requests.

I knew ltrace/strace were things I could use, but couldn’t fully remember the syntax. Claude code was helpful here: able to create an ltrace command for me:

sudo ltrace -f -e 'getaddrinfo*+gethostby*+gethostname*' sudo ...

My thought going into the process was that the command itself (...) was creating the DNS queries somehow by doing something different when ran with sudo, but the output was different and include:

[pid 964637] libsudo_util.so.0->gethostname("ip-10-13-7-193.us-east-2.compute"..., 65) = 0
[pid 964637] sudoers.so->getaddrinfo("ip-10-13-7-193.us-east-2.compute"..., nil, 0x7ffecf8ce6d0, 0x7ffecf8ce6c8) = 0

The sudoers.so->getaddrinfo line is clear evidence that sudo itself, and not the command it was running, was making the DNS requests. From this, I went back to Claude’s original suggestion of Defaults !fqdn.

There are lots of threads on the web about people having issues with slow sudo due to this flag: here and here.

The fix Link to heading

Turning off this flag was pretty easy. I added this to our container image:

echo 'Defaults !fqdn' | sudo tee /etc/sudoers.d/00-no-fqdn
sudo chmod 440 /etc/sudoers.d/00-no-fqdn
Detailed DNS flow diagram showing the problem and solution

Complete DNS flow: How sudo triggers hostname lookups that cascade through Kubernetes DNS to hit link-local limits, and the solution

The result Link to heading

sudo commands no longer did hostname lookups and uncached DNS queries almost entirely disappeared after this change.

Graph showing DNS traffic drop with uncached queries in green and cached hits in yellow

DNS traffic drop after fix - Green line shows uncached DNS queries to VPC resolver, Yellow line shows cached hits served by CoreDNS