Kimi-K2's Design Philosophy and Performance Insights

shaowei.docx

90.97 KB

Translate this tech blog into English

Author: Shaowei Liu
Original Link: https://www.zhihu.com/question/1927140506573435010/answer/1927892108636849910

Disclosure: I helped “deliver” Kimi-K2. Every parent thinks their own baby is the cutest.

Since Kimi-K2 shipped, the open-source community has showered us with attention. Many folks have dug into the tiny differences between K2 and DeepSeek-V3 (DSv3). As a low-key infra engineer at Moonshot, I’d like to share, purely from an inference perspective, why the K2 config looks the way it does.

⚠️ Caveat: I’ll sprinkle in some training-side intuition—if anything is off, my pre-train colleagues will roast me later.

1. Design Philosophy of K2’s Architecture

Before we even started training K2, we ran a mountain of scaling-law experiments on architecture variants. Every single variant that differed from DSv3 failed to beat it—at best they tied.
So the question became: “Should we pick an inferior but different architecture just to be different?” The answer was no. DSv3’s structure is battle-tested at scale; our “new ideas” weren’t. With two huge variables already in play (Muon optimizer + larger param count), we didn’t want to add an unverified third variable.

Constraint #1: Inherit DSv3’s architecture wholesale, then tune only the structural hyper-params.

Constraint #2: Cost ceiling. We’re a small company—training and inference budgets are tight. After evaluating DSv3, we judged its costs to be at the upper limit of what we could stomach. K2 had to stay within the same ballpark.

Hence the design task boiled down to:

Within the DSv3 skeleton, find parameters that keep train/inference cost flat while pushing loss significantly lower.

(I’ll skip training details—look for our upcoming tech report 😉.)

2. Concrete Changes & Rationale

If you diff the two config files you’ll see four deviations:

num_experts = 384
num_attention_heads = 64
first_k_dense = 1 (only the first layer is dense)
n_group = 1 (router has no expert grouping)

Let’s walk through each from an inference lens.
We inherit the EP+DP scheme from DeepSeek’s tech report [1] and OpenDay talk [2], and assume communication can be perfectly overlapped for now.

2.1 `num_experts = 384`

Origin: Pre-train team’s sparsity scaling law.
With fixed activated params, simply increasing total MoE params still obeys the scaling law—no overfitting observed. So 384 experts is the loss-killer.
Inference impact:
- Prefill:
  If we keep the same node count as DSv3’s 256 experts and sequences are long enough, prefill stays compute-bound. Activated params & FLOPs unchanged → no extra time.
- Decode:
  GEMM becomes memory-bound at production batch sizes.
  1.5× params ⇒ 1.5× memory traffic.
  Under EP=128 (GCD of 256/384 for apples-to-apples):
  Table
  Copy
  
  Model EP rank load MLP weight
  
  DSv3 2 routed + 1 shared ~7.5 GB
  K2 3 routed + 1 shared ~10 GB
  +2.5 GB per rank.

Model	EP rank load	MLP weight
DSv3	2 routed + 1 shared	~7.5 GB
K2	3 routed + 1 shared	~10 GB

2.2 `num_attention_heads = 64`

MoE just got 50 % more expensive—can we claw it back elsewhere?
DeepSeek doubled heads vs classic MHA to max bandwidth utilization, but that hurt latency in both prefill & decode. We rolled it back to 64.

Prefill:
- Attention FLOPs ∝ h s². Cutting heads halves the quadratic term—huge win for long sequences (our bread & butter: agents, vibe coding).
- QKVO projection params drop from 10 B → 5 B, shaving FLOPs again.
Decode:
- KV-cache size unchanged (tie).
- QKVO activation traffic drops 5 GB → saved across all DP ranks (no TP dilution).
  The 5 GB savings dwarf the 2.5 GB MoE penalty—net win.

Ablation showed the negative impact on loss is tiny compared to MoE’s gain. Heads=64 locked in.

🧠 Side quest: Fewer heads leave more room for speculative decoding.

2.3 `first_k_dense = 1`

Like DeepSeek, we saw layer-1 router load imbalance, but only layer 1. To avoid wasting experts, we made layer 1 dense; every other layer is MoE.

Prefill impact: none.
Decode: a few hundred MB extra traffic—negligible.

2.4 `n_group = 1` (no expert grouping)

Grouping helps when multiple experts sit on one GPU, balancing work at device level.
At our scale we must use large EP, so each device holds ≤1 expert. Balancing moves to the node level, yet worst-case imbalance inside a node still kills latency.
Thus dynamic expert re-allocation + redundancy (EPLB) outweigh grouping.
A freer router also enlarges the combinatorial space → better model quality.

3. Wrap-Up

Those four tiny knobs yield an inference recipe that:

Keeps EP count identical to DSv3.
Grows total params 1.5× yet, ignoring comms, shows lower theoretical prefill & decode time.
Even after real-world comm overlap, cost stays no worse than DSv3.

Each change was backed by solid theory + ablation. Once K2 is fully open-sourced, we hope the wider inference community will stress-test these claims.

Thanks for the love you’ve shown Kimi-K2!

Postscript

All roofline estimates assume our specific GPU model.
Your batch-size-to-compute sweet spot will differ—don’t copy-paste blindly.

References

[1] DeepSeek-V3 Technical Report
[2] One More Thing, DeepSeek-V3/R1 Inference System Overview