DMIT运维日志–2025.12.07

laumaomao 2025-12-7 135 12/7

DMIT LAX Network Failure Analysis

At approximately 19:35:00 Pacific Time, DMIT deployed a change within the LAX metro to introduce IPv6 over MPLS and IS-IS for the access switches.

  1. DMIT uses loopback addresses for iGP routing on all devices.
  2. However, in the IPv6 RR configuration, we did not standardize the next-hop for IPv6 routes received from access switches, meaning Next-Hop was not changed to Peer-Address, it remain the final interface address.
  3. Due to iBGP behavior, next-hop addresses will not not automatically converted to Peer-Address.
  4. To prevent certain customers from using reserved IPv4/IPv6 addresses as Point-to-Point (PtP) interface address, DMIT's internal network does not propagate specific port addresses. (This made we have to change next-hop for all iBGP routes).
  5. The above things make the edge router cannot resolve the acual next-hop to final interface.
  6. When DMIT's border router fails to find a specific next-hop, it falls back to a Transit table.
  7. On the Transit table, the route in FIB was programmed to the customer table.

These factors collectively caused IPv6 traffic originating from customers to continuously loop through multiple VRFs on a single router until the 128 TTL expired.

This ultimately exhausted backplane bandwidth, resulting in RR disconnections. When RR disconnected, custoemr routing was interrupted, loop traffic dropped, and the network recovered briefly before the looping failure recurred.

This configuration fault caused 3 minutes of downtime and a cumulative 13 minutes of degraded service. DMIT sincerely apologizes for this incident.

DMIT 洛杉矶网络故障分析

在太平洋时间 19:35:00 左右,DMIT 在洛LAX Metro内部署了一项变更,为接入交换机引入了 IPv6 over MPLS 和 IS-IS。

  1. DMIT 在所有设备上使用环回地址进行 iGP 路由。
  2. 然而,在 IPv6 RR 配置中,我们没有对从接入交换机接收的 IPv6 路由的下一跳进行标准化,这意味着 Next-Hop 没有更改为 Peer-Address,它仍然是最终的接口地址。
  3. 由于 iBGP 行为,下一跳地址不会自动转换为 Peer-Address。
  4. 为了防止某些客户使用保留的 IPv4/IPv6 地址作为点对点(PtP)接口地址导致地址冲突,DMIT 的内部网络不传播特定的端口地址。(这使得我们必须更改所有 iBGP 路由的下一跳)。
  5. 上述情况导致边缘路由器无法解析最终接口的原始下一跳。
  6. 当 DMIT 的边界路由器无法找到特定的下一跳时,它就会退回到Transit表中。
  7. 在 Transit 表中,FIB 中的路由被编入客户表。

这些因素共同导致源自客户的 IPv6 流量不断在单个路由器上的多个 VRF 中循环,直到 128 TTL 过期。

这最终耗尽了背板带宽,导致 RR 中断。当 RR 中断时,客户路由中断,循环流量下降,网络在循环故障再次发生前短暂恢复。

这一配置故障造成了 3 分钟的停机时间和累计 13 分钟的服务质量下降。DMIT 对此事件表示诚挚的歉意。

DMIT运维日志–2025.12.07

- THE END -

laumaomao

2月23日16:28

0

非特殊说明,本博所有文章均为博主原创。