Systems and Tools Notes

When the Manufacturing Certificates Quietly Expired: An ACI EOL Postmortem

I sat down to add one access port. Eight hours later I had three nested problems, one controller that could not rejoin its cluster, and a much clearer view of why some fixes are not worth attempting.

PublishedApril 23, 2026

Reading time8 min read

Tagscisco-aci • apic • pki • tls • certificates • eol-hardware • incident-postmortem • troubleshooting • network-operations

Overview

A routine port-add on a Cisco ACI fabric exposed nine months of silent policy distribution failures, a seven-hour cluster time drift, and a fleet of expired manufacturer certificates on hardware past its support life. The right answer turned out to be the unsatisfying one.

The five-minute task that wasn't

The work item was small. Bind a host-facing port to an existing endpoint group, on both leaves, using a policy group that already existed in the fabric. The same shape of change had been done many times on the same EPG, on the same leaves, by the same team. The change request gave it 30 minutes; I expected it to take five.

I created the access port selector through the GUI on both leaves. Submitted. Refreshed. The selector did not appear in the right-hand pane. I chalked that up to a stale view, looked it up directly through the API, and found that the object existed in the controller's management information tree exactly as expected. State formed, no faults, replicated identically to all three controllers in the cluster. Audit log showed me as the creator with the right timestamp. Everything looked correct except for the one thing that mattered: the leaf had no idea this configuration existed.

Silent policy distribution failure

When you query the leaf-side physical interface object directly, it tells you what configuration the leaf is actually running, not what the controller intended. The interface for my new port came back with usage discovery, switching state disabled, and a last-modified timestamp from nine months ago. That timestamp was the moment that stopped feeling like a configuration error and started feeling like an infrastructure problem. The leaf had not received any policy update for that port, not from this change, not from anything since the previous summer.

I checked a sibling port on the same leaf, in the same EPG, configured by someone else years earlier. It came back with usage epg, switching state enabled, modification timestamp from the last successful policy push window. The historical configuration was working. The new configuration was not even reaching the leaf. There were no faults on either side, and no error in the audit chain. The change had failed in the most expensive way: silently.

Things that looked fine and were genuinely fine: cluster fitness, fabric node states, MIT consistency across all three controllers, audit log, REST submission paths, GUI submission paths, and every faultInst I could query.
Things that looked fine but had a subtle quirk: the tree-view query against parent profiles returns only the parent object, not its children, even when children exist. Direct DN queries against the children are required to actually verify them.
Things that were not fine but did not generate faults: the leaf simply was not receiving policy updates.

A seven-hour drift that wasn't a clock issue

I went to look at cluster health. The output that stopped me was the cluster-wide read of the controller appliance state. Every newly created managed object was being stamped with a logical timestamp that was about seven hours behind the wall clock. Same offset on every controller. Static, not drifting forward or backward over the next few minutes. Wall-clock time on the OS, verified with the standard date utility, was correct on all three controllers and matched my workstation. Time sync to the upstream source was healthy and within tolerance. The thing that was off was not the operating system clock. It was the cluster's internal logical time.

My first instinct was to dismiss it. The OS clocks agreed, the time-sync service was happy, and a logical-time abstraction inside the cluster sounded like an internal accounting detail that was unlikely to break leaf-side policy programming. That was the wrong instinct, and it cost me about an hour.

The certificate audit that explained everything

Once I stopped trying to make the time drift the root cause and started treating it as another symptom, I went looking for a layer below it. The fabric's PKI seemed worth checking. There is a single managed-object class on the controller that records the per-node SSL certificate state, including a boolean for whether the certificate is currently inside its validity window. I queried it. Every node in the fabric, all three controllers, both spines, both leaves, came back with that boolean set to true for expired.

The issuer on the certificates was the Cisco manufacturer-installed CA. These were the certificates that get burned in at the factory and are intended to live for the documented hardware lifetime. The hardware in question was first-generation controllers from 2016, well past its Last Date of Support announcement, and the certificate validity windows had ended within the previous few weeks. There is no admin-facing renewal path for these certificates. Replacement on EOL hardware is a TAC root-shell procedure, and the fabric had no active support contract because the hardware was already scheduled for replacement.

The certificate flag was clear, the issuer was clear, and the validity dates were clear, this was not a parsing edge case or a transient PKI fault, it was simple time-window expiration.
The seven-hour cluster time offset suddenly looked very different in this light: a static backward shift to a moment when the certificates were still inside their validity window would explain why the cluster could keep talking to itself but could not push fresh signed policy to leaves whose own clocks now disagreed.
I cannot prove the cert expiry caused the policy distribution failure that started nine months earlier, only that they share a layer, and that the simplest hypothesis is consistent with both.

The reboot that almost broke quorum

Before I had pivoted to the certificate hypothesis, I had been entertaining a sequential controller reboot as a way to clear the time drift. The first controller went down cleanly. It came back up with its internal logical clock now matching wall time, which was the expected effect. It also could not rejoin its peers.

The DME logs on the rebooted controller showed the inter-controller TLS handshake failing on every retry, roughly every twelve seconds, in both directions. The OpenSSL diagnostics named the reason: certificate has expired, error 10 at depth 0, server certificate verify failed, TLS alert certificate expired. Both peers were rejecting each other on the same condition. The two surviving controllers were still happily talking to each other, and the data plane was completely unaffected, but the cluster was now a hard 2-of-3 with no path back to 3-of-3.

What kept the cluster alive was that the surviving controllers had established their TLS sessions to each other before the certificates lapsed. Long-lived sessions tolerate cert expiry on the existing connection in a way that new handshakes do not. That single property is now the entire margin of safety on this fabric. Any reboot, NIC flap, DME service restart, fabric upgrade, or leaf reload that forces a fresh handshake puts the cluster at risk of falling to 1-of-3, which is the point at which configuration changes start being rejected and recovery becomes a much harder conversation.

The platform hardening that protected me from myself

With the rebooted controller stuck and rejected on cert expiry, the next idea was the obvious one: roll its clock back to a time when the certificates were still valid, see if the handshake succeeds, and figure out a path forward from there. Stop the time-sync service. Set the date. Watch the next reconnect attempt.

Every step of that plan was blocked. The admin account on the controller has no privilege to set the system clock. It cannot stop the time-sync daemon. It cannot list its own sudo privileges. The platform is intentionally hardened so that the highest-privilege account exposed through the documented support interface cannot make system-level changes that would compromise certificate validation, audit integrity, or cluster timekeeping. In the moment, that hardening was deeply frustrating. In hindsight, it was correct.

What I had to walk back from earlier in the day

Two things I had said earlier in the same investigation needed correcting. The first was a Cisco field notice I had cited as describing this situation. When I went back and read it carefully, it described a different defect entirely, a parsing failure in a specific switch image release where new hardware revisions were emitting an unparseable certificate subject line, producing the same expired-flag value through a completely different mechanism. The query I was running to detect my problem was the same one Cisco's bulletin used to detect theirs, which is what made the misattribution easy. The remediations had nothing in common. Citing the wrong field notice in a postmortem would have sent the next engineer down the wrong path.

The second was a controller diagnostic subcommand I had floated as a potential remediation. The name fit the symptom; it sounded like the kind of thing that would renew an internal CA chain and re-issue per-node certificates. I had labeled it as speculation at the time, but that label gets thinned out as a conversation continues, and by the time the suggestion was being repeated it was reading like a candidate fix rather than a guess. When I went and verified, the subcommand was not in any of the published controller documentation, the managed objects it would presumably write to are flagged non-configurable in the schema, and the cert that the inter-controller TLS handshake actually presents on the wire is the manufacturer-installed one, not the kind that an internal CA renewal would touch. Even if the subcommand existed, it would not have helped.

Freeze, migrate, and the lessons that remain

The decision was to do nothing further to the cluster. The data plane was unaffected. The two surviving controllers had a quorum that would hold as long as nothing forced a fresh TLS handshake between them. The original work item that started the day could be either landed on the replacement fabric, already being built and already on a calendar, or implemented by repurposing an in-service port that carried a working pre-expiry policy chain. The migration timeline got moved up the priority list, and the existing cluster got a written do-not-touch policy covering reboots, service restarts, NIC events, fabric upgrades, and any controller diagnostic command that performs a write.

There is no satisfying remediation here. The remediation is the new hardware. Everything else is just keeping the existing system stable enough to bridge to it. That is a reasonable place to land, but it took most of a day, two controller restarts, and several wrong guesses to get there, and the operational lessons are worth more than the technical ones.

On any clustered controller platform with PKI-secured intra-cluster communication, the certificates have an expiration date and that date is on the hardware lifecycle, not the software lifecycle. It needs to be on the same calendar that tracks support contracts and refresh budgets.
When a fabric is past its Last Date of Support and out of TAC contract, any operation that forces a fresh TLS handshake between cluster members becomes a high-risk operation. That includes operations the documentation describes as routine.
A static, non-progressive drift in any timestamp is a clue that something is being held constant on purpose. Find the thing holding it before you try to correct it.
The single most useful diagnostic for fabric-wide PKI health on this platform is a one-line query against the per-node certificate object. It should be part of the standard health check, not something you run only after something else has gone wrong.
Platform hardening that blocks creative recovery is doing its job. If you find yourself frustrated by it during an incident, that is usually a sign the recovery you were about to attempt was the wrong one.