We found a use-after-free bug in Envoy’s DNS resolver, c-ares (CVE-2025-62408, CVE-2025-67514).
Impact: Remote Denial of Service via process crash. In certain situations, an attacker could exploit a specific sequence of DNS responses to trigger a heap use-after-free and crash the application.
Affected: c-ares versions <= 1.34.5.
Earlier this year, one of our customers reported a fun one: their Pomerium deployment would crash about 10 seconds after startup, every time, under heavy load. Adding to the fun, this was during a rollout of a production workload onto a new, business-critical cluster. Of course.
Initial logs and stack traces were frustratingly sparse, pointing only to a generic segfault deep inside Envoy.
Was it our code? Envoy? The DNS gods finally collecting on all the shit I’ve talked over the years? (Hint: see the title.)
If you’re new here (hi, thanks for stopping by our webblog), first a quick lay of the land:
Pomerium is an identity-aware reverse proxy.
Pomerium embeds Envoy-proxy as its data plane.
Envoy, in turn, uses c-ares for asynchronous DNS resolution.
So if c-ares crashes, Envoy crashes, and by extension Pomerium crashes.
With only "segfault somewhere in Envoy" to go on, we provided the customer a special Envoy build compiled with AddressSanitizer (ASan). Sure enough, ASan immediately screamed about a heap use-after-free error deep in c-ares (ares_process.c), pinpointing the read_answers() / process_answer() code path. The backtrace looked suspiciously similar to CVE-2025-31498.
Interestingly, the crashes only occurred in one specific environment: a Kubernetes cluster using NodeLocal DNSCache as the local DNS resolver, under extremely high query load. Other customer clusters were fine. This hinted that a very particular timing or sequence of DNS events was needed to trigger the fault.
At this point we suspected we had found a new derivation of the previously found use-after-free bug in the DNS resolver.
We eventually pinned down the exact sequence of DNS events needed to crash Envoy. It turned out to be a cascade of failures and retries that had to line up just so.
Non-FQDN Query: An application triggers a DNS lookup for a non-fully-qualified domain name (e.g., example.com without the final dot). This is common in Kubernetes clusters where apps rely on search domains.
NXDOMAIN Response: The DNS server returns an NXDOMAIN (non-existent domain) response.
Search Domain Retry: Upon receiving the NXDOMAIN, c-ares automatically attempts to append a search domain and re-run the query (e.g., trying example.com.default.svc.cluster.local). This is normal resolver behavior.
Connection Error: Here’s where things start to go sideways. The retried query fails to even reach a DNS answer because of a connection error. In our customer’s case, this likely happened because the NodeLocal DNS cache had an issue or restarted right at that moment, refusing the UDP request. The connection error triggers a callback to delete the server connection; this is the same connection whose reply is still actively being processed.
Response & Crash: The server finishes processing the initial answers from step (2) and moves into cleanup logic. However, a critical assumption has been broken: the server assumes the connection remains valid between receiving the answer and finishing processing. It attempts to access data inside the connection object, but that connection was already destroyed in step 4.
C’mon you guys, we just dereferenced freed memory.
To prove to ourselves (and upstream maintainers) what was happening, we wrote a minimal unit test using the c-ares test framework to simulate this exact scenario. It forces the NXDOMAIN + search-domain + connection-refused sequence and checks if we can trigger the bug:
TEST_P(MockUDPChannelTestAI, ConnectionRefusedOnSearchDomainRetry) {
// 1 & 2: First query for "www.google.com" gets NXDOMAIN
DNSPacket badrsp4;
badrsp4.set_response().set_aa() // authoritative answer bit
.add_question(new DNSQuestion("www.google.com", T_A))
.set_rcode(NXDOMAIN); // NXDOMAIN response
EXPECT_CALL(server_, OnRequest("www.google.com", T_A))
.WillOnce(SetReplyAndFailSend(&server_, &badrsp4));
// ^^^ Simulate sending the NXDOMAIN response, then fail to send (ECONNREFUSED)
// 3 & 5: Second query for "www.google.com.first.com" (search domain appended) will succeed
DNSPacket goodrsp4;
goodrsp4.set_response().set_aa()
.add_question(new DNSQuestion("www.google.com.first.com", T_A))
.add_answer(new DNSARR("www.google.com.first.com", 0x0100, {0x01,0x02,0x03,0x04}));
EXPECT_CALL(server_, OnRequest("www.google.com.first.com", T_A))
.WillOnce(SetReply(&server_, &goodrsp4));
// ^^^ Simulate a normal successful DNS response for the search-domain query
// 4: Simulate a connection send failure on the first retry (ECONNREFUSED)
ares_socket_functions sock_funcs = {0};
sock_funcs.asendv = ares_sendv_fail;
ares_set_socket_functions(channel_, &sock_funcs, NULL);
// 5: Perform the getaddrinfo lookup which triggers the above sequence
struct ares_addrinfo_hints hints{};
hints.ai_family = AF_INET;
ares_getaddrinfo(channel_, "www.google.com", /*servname=*/NULL, &hints,
AddrInfoCallback, &result);
Process(); // Drive the c-ares event loop to process the query/response
}
When we ran this test under ASan, it consistently reproduces the same use-after-free we saw in production.
Once we had a reproducible test, writing an interim-patch was straightforward.1 We built Pomerium/Envoy with the patched c-ares and gave our customer a custom hotfix build. The crashes stopped immediately, and our patch passed the full c-ares test suite.
We reported our findings (+ repro code with patch) to the c-ares maintainers through their security channels. They acted quickly to validate the report. Given that c-ares is a foundational library for Envoy and many other projects, a use-after-free bug is potentially impactful. Fortunately, the practical exploit scenario is narrow—you’d typically need control over a downstream DNS server (or be in a position to manipulate network responses) to orchestrate the specific NXDOMAIN+refused pattern. We haven’t seen evidence of malicious exploitation in the wild, but a memory safety bug in such a widely-used library is definitely something to address promptly.
Our customer was great to work with throughout this process. Huge kudos to them: they were patient, ran instrumented builds for us, and generally collaborated well with us under pressure. After we delivered the one-off patched build that stabilized their cluster, we decided this fix would remain an internal stop-gap until an official upstream release was published.
2025-08-22: DNS instability observed in customer’s prod env.
2025-08-26: Hotfix supplied to customer.
2025-08-26: Reported to c-ares and Envoy with patch & tests.
2025-12-10: Public disclosure.
c-ares GHSA-jq53-42q6-pqr5 published; fix released in v1.34.6.
Envoy GHSA-fg9g-pvc4-776f published; fixed in 1.33.14, 1.34.12, 1.35.8, and 1.36.4.
Pomerium v0.31.3 released, bundling Envoy 1.35.8 built against c-ares 1.34.6.
This isn’t a fiery "we found a vuln" post, but a case study in how things should work: weird bug appears, vendor/customer collaborate, root cause is found in a deep dependency, and the open source ecosystem gets a patch. Yay open source!
That said, the adventure reinforced a few things:
We own our dependencies. If a library we ship crashes, it's our problem. Even if the bug is "in someone else’s code," the blast radius is ours.
Difficult bugs can still be isolated. Even crashes requiring "perfect storm" timing can be isolated with the right tools (ASan) and persistence.
It’s always DNS. The meme exists for a reason.
1 Actually no, writing an interim-patch was subtle, hard, and made me regret letting people know I know C pretty well and resulted in a week of pain and suffering with a patch I wasn't sure conclusively fixed an edge case on a previously found edge case.
Stay up to date with Pomerium news and announcements.
Embrace Seamless Resource Access, Robust Zero Trust Integration, and Streamlined Compliance with Our App.