Abstract
Offline selective refusal in large language models is achieved through circuit-restricted weight updates that eliminate runtime intervention costs while maintaining performance.
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
Community
C-ΔΘ (Circuit-Restricted Weight Arithmetic) shifts selective refusal from inference-time steering to an offline, checkpoint-level edit. It first identifies the refusal-causal circuit via EAP-IG, then applies a circuit-restricted weight update that typically touches <5% of parameters, yielding a drop-in “safe-by-default” model with no runtime hooks or latency overhead. Across 30 model-category settings, C-ΔΘ sharply improves harmful refusal while keeping benign over-refusal controlled, preserving capability on standard benchmarks and generalizing to OOD attacks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering (2026)
- Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path (2026)
- Per-Axis Weight Deltas for Frequent Model Updates (2025)
- Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach (2026)
- When Benchmarks Leak: Inference-Time Decontamination for LLMs (2026)
- Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs (2026)
- Steering to Say No: Configurable Refusal via Activation Steering in Vision Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper