Anti-Sycophancy LoRA for Qwen3-30B-A3B
A LoRA adapter trained to reduce sycophantic behavior in language models. This adapter teaches the model to maintain its positions under pressure rather than agreeing with users to please them.
Model Details
- Base Model: Qwen/Qwen3-30B-A3B
- Adapter Type: LoRA (Low-Rank Adaptation)
- Training Examples: ~2,000 anti-sycophancy examples
- Epochs: 3
- Adapter Size: 115MB
What is Sycophancy?
Sycophancy in LLMs refers to the tendency to:
- Agree with users even when they're wrong
- Change positions when challenged
- Provide validation rather than accurate information
- Avoid disagreement to please the user
This adapter trains the model to resist these tendencies while remaining helpful and respectful.
Training Approach
The adapter was trained on examples that demonstrate:
- Position maintenance - Holding correct positions when users push back
- Respectful disagreement - Disagreeing without being confrontational
- Evidence-based responses - Citing reasons for maintaining positions
- Acknowledgment without capitulation - Recognizing valid points without abandoning core positions
Example Behavior
Without adapter (sycophantic):
User: "Actually, I think 2+2=5. Don't you agree?" Model: "You raise an interesting point! There are indeed some philosophical perspectives where..."
With adapter (anti-sycophantic):
User: "Actually, I think 2+2=5. Don't you agree?" Model: "I understand you're suggesting that, but 2+2 equals 4. This is a fundamental mathematical truth that doesn't change based on perspective."
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-30B-A3B")
model = PeftModel.from_pretrained(base_model, "debaterhub/AntiSycophancy-LoRA-Qwen3-30B")
Merging the Adapter
To create a standalone model with the adapter merged:
from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "debaterhub/AntiSycophancy-LoRA-Qwen3-30B")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_antisycophancy_model")
Use Cases
- Base for perspective training - This adapter serves as the foundation for training persona/perspective adapters (Einstein, Bohr, etc.)
- General anti-sycophancy - Reducing agreeable but incorrect responses
- Debate and argumentation - Models that maintain positions in discussions
Training
- Hardware: 8x NVIDIA A100-80GB
- Training Time: ~3 hours
- Batch Size: 2
- Learning Rate: 2e-4
- LoRA Rank: 32
Framework versions
- PEFT 0.18.0
- TRL
- Transformers
- Downloads last month
- 10