2025.12.12 23:48

LAX Pro:
CN2 AS4809 DRT single point failure. Waiting CT to response.

CN2 misconfigured the interface policer in CSLA failover path lead to the congestion.

2025.12.13 02:44

HKG:
Observed some host node experienced packet loss.
NOC is working on it

DMIT already esclated this issue to highest level of support.
As current information. CTG NOC didn't delivery the service as contract incl. specific BGP configuration and interface rates.

CTG NOC response they resolved the issue but turns out there is nothing changed.
BGP session keep down and the policer is still wrong.

CTG HKG BGP session misconfiguration leads to session reset.
DMIT is esclating this issue with CT Group.

Please wait for the futher response.

Both CN2 LAX and HKG:

China Telecom didn't configure the BGP sessions and interfaces correctly.
The service auto-failover triggered a cascade of failures.

Plus, CTG NOC isn't helpful at all. The esclated NOC team isn't responding.

The HKG CTG CN2 session is recovered.
The routing will be restored soon.

We observed DDoS come in and targeted to 3 of our HKG.Pro subnet which might leads some customer experience packet loss.

The longer than expected downtime are caused by none reponse from CTGnet NOC.
DMIT work with CTG NOC since 9AM EST today. But the issue still not been fully resolved.

The service failure notice will be published once we secured all serivces.

DMIT Network Incident Report: LAX & HKG
This is the last update until there is another major event needs to be updated.

Here is the combined technical postmortem regarding the recent network instability.

🇺🇸 LAX CN2 GIA Incident
Current Status: All immediate mitigations applied. Final correction from CTG is pending due to the China-wide "Network Freeze" (ending Dec 15).

Root Cause: Prefix Limit Exceeded

The Mismatch: DMIT ordered a 1k prefix-limit, but the provider (CTG) left it at the default 300. This parameter is non-testable after service delivery, so we trusted the configuration.

The Trigger: Two clients increased route announcements + multiple DDoS RTBH routes pushed the count over 300.

The Result: AS4809 (CN2) immediately idled the BGP session upon exceeding the limit.

Why did failover result in packet loss?

Design: The backup session (CoreSite) remained UP as designed (filtering DDoS routes to save prefix space).

The Critical Failure: Provider LACP Misconfiguration. CTG configured our link aggregation as a single interface capacity, ignoring our multiple physical 10G connections.

Impact: When traffic shifted to CoreSite, it exceeded the logical 10G cap, causing severe congestion and packet loss despite physical capacity being available.

Why the long recovery?

Administration: Due to the "Network Freeze," router CLI access is suspended.

Approval: CTA/CTG required emergency access approval from the Group level. Since it was after-hours in China, getting this authorization took significant time.

🇭🇰 HKG Incident
Current Status: 99.9% of traffic is successfully filtered. Active monitoring in place. 10Mpps ongoing.

Root Cause: "Carpet Bombing"

Attack Type: A massive Carpet Bombing attack targeted 3 specific subnets.

Vectors: Mixed volume of TCP-SYN, TCP-ACK (Zero/Empty), SYN-ACK, TCP Null, FIN, RST.

Why did mitigation fail initially?

The Leak: A combination of misconfigured detour rules and a hardware fault caused traffic to bypass local scrubbers. Malicious traffic entered directly via the backbone (LAX IP Transit).

The "Red Herring": We initially focused on refining rules, not realizing the mitigation equipment itself had a hardware/software fault. This misled our diagnosis and delayed the fix.

Resource Contention The concurrent critical failure in LAX required non-stop coordination, splitting our engineering resources and inevitably slowing down the HKG diagnosis.

🛡️ Future Prevention & Commitment
Stricter Auditing: We will implement an extra layer to manually review every text field on vendor orders to ensure delivered configurations (like Prefix Limits and LACP speeds) match our requirements perfectly.

The Reality: DDoS vectors evolve rapidly. While we cannot guarantee zero incidents, DMIT commits to using every resource to maintain stability and protect your business at reasonable costs.

Reimbursement: All services no matter location and network profile will have traffic reset on today, and everything an extra chance for free to reset the traffic before May 2026. (Deliver in the future by the website feature.)

DMIT 网络事件报告：LAX & HKG

这是在另一个重大事件之前的最后一次更新。

以下是关于最近网络不稳定性的综合技术事后分析。

🇺🇸 LAX CN2 GIA 事件

当前状态：所有即时缓解措施已应用。由于中国范围内的“网络冻结”（截至12月15日），CTG 的最终修复工作仍在等待中。

根本原因：前缀限制超出

不匹配：DMIT订购了1k前缀限制，但提供商（CTG）将其保留在默认的300。此参数在服务交付后无法测试，因此默认了前缀数量限制已被正确地配置

触发因素：两个客户增加了宣告 + 多个黑洞路由使前缀计数超过300。

结果：AS4809（CN2）在超出限制后立即挂起了BGP会话。

为什么故障转移导致丢包？

设计：冗余路径（CoreSite）按设计保持在线（过滤DDoS路由以节省前缀空间）。

关键故障：提供商 LACP 配置错误。CTG 将我们的链路聚合配置为单接口容量，忽略了我们多个物理 10G 端口连接。

影响：当流量转移到 CoreSite 时，超过了逻辑 10G 上限，尽管物理容量可用，但仍导致严重拥堵和丢包。

为什么恢复时间这么长？

由于“网络冻结”，CTA/CTG侧路由器配置更改权限已被临时关闭。
当前CTA/CTG需要从集团层面获得紧急授权。由于当时中国是在非工作时间，获得此授权花费了较长时间。

================

🇭🇰 HKG 事件

当前状态：99.9%的流量成功过滤。已实施主动监控。持续 10Mpps。

根本原因：“针对数个子网的所有IP地址进行轮询攻击”

攻击类型: TCP SYN-ACK; TCP ACK; TCP Null; FIN; RST

为什么初始缓解失败？

泄漏：配置错误的绕行规则和硬件故障导致流量绕过本地清洗集群。恶意流量直接通过洛杉矶 PoP 流入了 DMIT 骨干网，进入了香港 PoP。

“误导”：我们最初专注于细化规则，没有意识到缓解设备本身存在硬件/软件故障。这误导了我们的诊断并延迟了修复。

资源争用：由于 LAX 的故障需要不间断协调，导致了技术人力资源不足，不可避免地加长了 HKG 从发现到诊断到缓解的时间。

🛡️ 未来预防与承诺

更严格的审计：我们将额外实施一层手动审查，以检查供应商订单上的每个文本字段，确保交付的配置（如前缀限制和 LACP 速度）完全符合我们的要求。

现实：DDoS 攻击千变万化，突发的DDoS事件很难避免造成影响，但 DMIT 承诺利用尽可能多的资源支持客户，以合理成本体验稳定和持续的业务。

================

补偿：所有产品服务无论地点和网络配置文件如何，都将在今天免费重置流量，并且每一个现有的产品服务在2026年5月之前都有一次额外免费机会重置流量。（未来通过网站功能交付。）

DMIT运维日志–2025.12.12

DMIT运维日志–2025.12.12

Approval: CTA/CTG required emergency access approval from the Group level. Since it was after-hours in China, getting this authorization took significant time.

The Reality: DDoS vectors evolve rapidly. While we cannot guarantee zero incidents, DMIT commits to using every resource to maintain stability and protect your business at reasonable costs.

相关推荐

功能更新预告——重建选择SSH

更新一下DMIT的退款政策

DMIT运维日志–2025.12.31

DMIT运维日志–2026.01.12

归档

分类目录