Pomerium secures agentic access to MCP servers.
Learn more

It’s always DNS part ∞: tracking down a use-after-free bug in Envoy’s DNS Resolver

Share on Bluesky

TL;DR

  • We found a use-after-free bug in Envoy’s DNS resolver, c-ares (CVE-2025-62408, CVE-2025-67514). 

  • Impact: Remote Denial of Service via process crash. In certain situations, an attacker could exploit a specific sequence of DNS responses to trigger a heap use-after-free and crash the application.

  • Affected: c-ares versions <= 1.34.5.


it always happens in prod, at the worst time

Earlier this year, one of our customers reported a fun one: their Pomerium deployment would crash about 10 seconds after startup, every time, under heavy load. Adding to the fun, this was during a rollout of a production workload onto a new, business-critical cluster. Of course.

Initial logs and stack traces were frustratingly sparse, pointing only to a generic segfault deep inside Envoy.

Was it our code? Envoy? The DNS gods finally collecting on all the shit I’ve talked over the years? (Hint: see the title.)

ingredients of the soup: Pomerium, Envoy, and c-ares

If you’re new here (hi, thanks for stopping by our webblog), first a quick lay of the land: 

  • Pomerium is an identity-aware reverse proxy.

  • Pomerium embeds Envoy-proxy as its data plane.

  • Envoy, in turn, uses c-ares for asynchronous DNS resolution.

So if c-ares crashes, Envoy crashes, and by extension Pomerium crashes. 

deep diving envoy & c-ares

With only "segfault somewhere in Envoy" to go on, we provided the customer a special Envoy build compiled with AddressSanitizer (ASan). Sure enough, ASan immediately screamed about a heap use-after-free error deep in c-ares (ares_process.c), pinpointing the read_answers() / process_answer() code path. The backtrace looked suspiciously similar to CVE-2025-31498.

Interestingly, the crashes only occurred in one specific environment: a Kubernetes cluster using NodeLocal DNSCache as the local DNS resolver, under extremely high query load. Other customer clusters were fine. This hinted that a very particular timing or sequence of DNS events was needed to trigger the fault.

At this point we suspected we had found a new derivation of the previously found  use-after-free bug in the DNS resolver.

perfect storm of events & swiss cheese failures

We eventually pinned down the exact sequence of DNS events needed to crash Envoy. It turned out to be a cascade of failures and retries that had to line up just so.

  1. Non-FQDN Query: An application triggers a DNS lookup for a non-fully-qualified domain name (e.g., example.com without the final dot). This is common in Kubernetes clusters where apps rely on search domains.

  2. NXDOMAIN Response: The DNS server returns an NXDOMAIN (non-existent domain) response.

  3. Search Domain Retry: Upon receiving the NXDOMAIN, c-ares automatically attempts to append a search domain and re-run the query (e.g., trying example.com.default.svc.cluster.local). This is normal resolver behavior.

  4. Connection Error: Here’s where things start to go sideways. The retried query fails to even reach a DNS answer because of a connection error. In our customer’s case, this likely happened because the NodeLocal DNS cache had an issue or restarted right at that moment, refusing the UDP request. The connection error triggers a callback to delete the server connection; this is the same connection whose reply is still actively being processed.

  5. Response & Crash: The server finishes processing the initial answers from step (2) and moves into cleanup logic. However, a critical assumption has been broken: the server assumes the connection remains valid between receiving the answer and finishing processing. It attempts to access data inside the connection object, but that connection was already destroyed in step 4.

C’mon you guys, we just dereferenced freed memory.

PoC || GTFO

To prove to ourselves (and upstream maintainers) what was happening, we wrote a minimal unit test using the c-ares test framework to simulate this exact scenario. It forces the NXDOMAIN + search-domain + connection-refused sequence and checks if we can trigger the bug:

C/C++
TEST_P(MockUDPChannelTestAI, ConnectionRefusedOnSearchDomainRetry) {
  // 1 & 2: First query for "www.google.com" gets NXDOMAIN
  DNSPacket badrsp4;
  badrsp4.set_response().set_aa()  // authoritative answer bit
        .add_question(new DNSQuestion("www.google.com", T_A))
        .set_rcode(NXDOMAIN);      // NXDOMAIN response
  EXPECT_CALL(server_, OnRequest("www.google.com", T_A))
        .WillOnce(SetReplyAndFailSend(&server_, &badrsp4));
        // ^^^ Simulate sending the NXDOMAIN response, then fail to send (ECONNREFUSED)

  // 3 & 5: Second query for "www.google.com.first.com" (search domain appended) will succeed
  DNSPacket goodrsp4;
  goodrsp4.set_response().set_aa()
        .add_question(new DNSQuestion("www.google.com.first.com", T_A))
        .add_answer(new DNSARR("www.google.com.first.com", 0x0100, {0x01,0x02,0x03,0x04}));
  EXPECT_CALL(server_, OnRequest("www.google.com.first.com", T_A))
        .WillOnce(SetReply(&server_, &goodrsp4));
        // ^^^ Simulate a normal successful DNS response for the search-domain query

  // 4: Simulate a connection send failure on the first retry (ECONNREFUSED)
  ares_socket_functions sock_funcs = {0};
  sock_funcs.asendv = ares_sendv_fail;
  ares_set_socket_functions(channel_, &sock_funcs, NULL);

  // 5: Perform the getaddrinfo lookup which triggers the above sequence
  struct ares_addrinfo_hints hints{};
  hints.ai_family = AF_INET;
  ares_getaddrinfo(channel_, "www.google.com", /*servname=*/NULL, &hints,
                   AddrInfoCallback, &result);

  Process();  // Drive the c-ares event loop to process the query/response
}

When we ran this test under ASan, it consistently reproduces the same use-after-free we saw in production.

fix, disclosure, patch, and open-source community

Once we had a reproducible test, writing an interim-patch was straightforward.1 We built Pomerium/Envoy with the patched c-ares and gave our customer a custom hotfix build. The crashes stopped immediately, and our patch passed the full c-ares test suite.

We reported our findings (+ repro code with patch) to the c-ares maintainers through their security channels. They acted quickly to validate the report. Given that c-ares is a foundational library for Envoy and many other projects, a use-after-free bug is potentially impactful. Fortunately, the practical exploit scenario is narrow—you’d typically need control over a downstream DNS server (or be in a position to manipulate network responses) to orchestrate the specific NXDOMAIN+refused pattern. We haven’t seen evidence of malicious exploitation in the wild, but a memory safety bug in such a widely-used library is definitely something to address promptly.

Our customer was great to work with throughout this process. Huge kudos to them: they were patient, ran instrumented builds for us, and generally collaborated well with us under pressure. After we delivered the one-off patched build that stabilized their cluster, we decided this fix would remain an internal stop-gap until an official upstream release was published.

timeline

  • 2025-08-22: DNS instability observed in customer’s prod env. 

  • 2025-08-26: Hotfix supplied to customer. 

  • 2025-08-26: Reported to c-ares and Envoy with patch & tests.

  • 2025-12-10: Public disclosure.

reflections

This isn’t a fiery "we found a vuln" post, but a case study in how things should work: weird bug appears, vendor/customer collaborate, root cause is found in a deep dependency, and the open source ecosystem gets a patch. Yay open source!

That said, the adventure reinforced a few things:

  • We own our dependencies. If a library we ship crashes, it's our problem. Even if the bug is "in someone else’s code," the blast radius is ours.

  • Difficult bugs can still be isolated. Even crashes requiring "perfect storm" timing can be isolated with the right tools (ASan) and persistence.

  • It’s always DNS. The meme exists for a reason.

A haiku to live by

1 Actually no, writing an interim-patch was subtle, hard, and made me regret letting people know I know C pretty well and resulted in a week of pain and suffering with a patch I wasn't sure conclusively fixed an edge case on a previously found edge case.

Stay Connected

Stay up to date with Pomerium news and announcements.

More Blog Posts

See All Blog Posts
Blog
Introducing Pomerium Ingress Controller for Kubernetes
Blog
Migrating from Ingress NGINX to Pomerium Ingress Controller
Blog
Announcing Pomerium v0.31

Revolutionize
Your Security

Embrace Seamless Resource Access, Robust Zero Trust Integration, and Streamlined Compliance with Our App.