Question about K2/V2 cache computation in prefill vs generation

#10

by kernelpool - opened about 7 hours ago

about 7 hours ago

I'm trying to understand the caching behavior in modeling_iquestloopcoder.py and noticed a difference between prefill and generation:

Prefill (_forward_loop, lines 1072-1074):

hidden_states, gate_mean = decoder_layer.forward_loop2_mixed(...)
if use_cache and loop_idx == 2:
    hidden_states_normed = decoder_layer.input_layernorm(hidden_states)
    _, k2, v2 = decoder_layer.self_attn.get_qkv(hidden_states_normed, position_ids)

Here hidden_states is the layer OUTPUT (after attention + MLP).

Generation (forward_decode_loop2, line 630):

q2, k2, v2 = self.get_qkv(hidden_states, position_ids)

Here hidden_states is the layer INPUT (before attention).

Is the difference in prefill intentional, or should K2/V2 be computed from the same source in both paths? I ran some tests comparing both approaches against full recomputation and found that INPUT-based K2 matches exactly, while OUTPUT-based differs slightly. However, the practical impact seems minimal since the gates strongly favor global attention (~87%). I'm curious whether the difference is intentional or an oversight.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment