πŸ”’ Zero-Day Exploit Scanner & Fixer

A fine-tuned code security model that detects vulnerabilities and generates fixes across multiple programming languages.

Built on Qwen2.5-Coder-7B-Instruct with QLoRA fine-tuning on 90K+ real-world vulnerability-fix pairs from CVE/CWE databases.

🎯 What It Does

Given any code snippet, this model will:

  1. SCAN β€” Determine if the code contains a security vulnerability (VULNERABLE / SAFE)
  2. IDENTIFY β€” Classify the vulnerability type (CWE ID) and link to known CVEs
  3. EXPLAIN β€” Describe the attack vector, impact, and exploitation mechanism
  4. FIX β€” Generate corrected code that patches the vulnerability
  5. DOCUMENT β€” Explain what was changed and why

πŸ—οΈ Architecture

Component Details
Base Model Qwen/Qwen2.5-Coder-7B-Instruct
Method QLoRA (4-bit NF4 quantization)
LoRA Config r=16, Ξ±=32, dropout=0.05
Target Modules q, k, v, o, gate, up, down projections
Training SFT with assistant-only loss
Max Length 2048 tokens

πŸ“Š Training Data

Combined from 3 curated vulnerability datasets totaling ~90K samples:

Dataset Samples Languages Source
MegaVul ~17K C/C++ 992 repos, 169 CWE types, 2006-2023
TitanVul ~38K C, C++, Java, Python, JS Aggregated from 7 sources, deduplicated
CleanVul ~26K Multi-language LLM-filtered, vulnerability_score β‰₯ 1
Safe samples ~12K Multi-language Fixed code from TitanVul (negative examples)

Data Quality Controls

  • CleanVul filtered by vulnerability_score >= 1 (removes ~27% noise)
  • TitanVul aggregates and deduplicates BigVul + DiverseVul + CVEFixes + PrimeVul + more
  • Safe code examples from patched functions reduce false positive rate
  • Each sample includes CVE ID, CWE type, vulnerability description, and commit message

πŸš€ Quick Start

Installation

pip install transformers peft torch bitsandbytes accelerate

Python API

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-7B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "jacobmahon/zero-day-exploit-scanner-fixer")
tokenizer = AutoTokenizer.from_pretrained("jacobmahon/zero-day-exploit-scanner-fixer")

# Scan code
messages = [
    {"role": "system", "content": "You are a security expert. Analyze code for vulnerabilities and provide fixes."},
    {"role": "user", "content": "Analyze this C code for vulnerabilities:\n```c\nvoid process(char *input) {\n    char buf[64];\n    strcpy(buf, input);\n}\n```"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, top_p=0.9)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

CLI Usage

# Scan a code string
python inference.py --code "char buf[10]; gets(buf);"

# Scan a file
python inference.py --file vulnerable.c

# Interactive mode
python inference.py --interactive

πŸ“‹ Supported Vulnerability Types

The model has been trained on 169+ CWE types including:

Category CWE Examples
Memory Safety CWE-119 (Buffer Overflow), CWE-120 (Buffer Copy), CWE-416 (Use After Free), CWE-476 (NULL Pointer Deref)
Injection CWE-79 (XSS), CWE-89 (SQL Injection), CWE-78 (OS Command Injection)
Authentication CWE-287 (Improper Auth), CWE-306 (Missing Auth), CWE-798 (Hardcoded Credentials)
Cryptography CWE-327 (Broken Crypto), CWE-330 (Insufficient Randomness)
Race Conditions CWE-362 (Race Condition), CWE-367 (TOCTOU)
Input Validation CWE-20 (Improper Input Validation), CWE-190 (Integer Overflow)
Access Control CWE-862 (Missing Authorization), CWE-863 (Incorrect Authorization)
Information Disclosure CWE-200 (Info Exposure), CWE-209 (Error Message Info Leak)

πŸ”¬ Training Recipe

Based on research from:

  • R2Vul (arXiv:2504.04699) β€” Structured reasoning for vulnerability detection (81.47% F1)
  • MSIVD (arXiv:2406.05892) β€” Multi-task instruction tuning (0.92 F1 on BigVul)
  • SecRepair (arXiv:2401.03374) β€” Combined detection + repair with RL
  • SecureCode β€” QLoRA recipe: r=16, Ξ±=32, lr=2e-4, 3 epochs
  • TitanVul (arXiv:2507.21817) β€” 0.881 OOD accuracy on BenchVul benchmark

Hyperparameters

learning_rate = 2e-4        # LoRA-optimized (10x base)
num_train_epochs = 3
per_device_train_batch_size = 2
gradient_accumulation_steps = 8  # Effective batch = 16
max_length = 2048
lr_scheduler = "cosine"
warmup_steps = 100
optimizer = "adamw_torch"
quantization = "4-bit NF4 (double quant)"
lora_rank = 16
lora_alpha = 32
lora_dropout = 0.05

⚠️ Limitations & Ethical Use

  • Not a replacement for professional security audits β€” Use as a screening tool alongside manual review
  • May produce false positives/negatives β€” Always verify findings with static analysis tools (CodeQL, Semgrep)
  • Training data bias β€” Primarily C/C++ and Java; coverage for newer languages (Rust, Go, Kotlin) is limited
  • Zero-day detection β€” The model generalizes from known vulnerability patterns; truly novel attack vectors may not be detected
  • Do not use for malicious purposes β€” This tool is designed for defensive security only

πŸ“š Evaluation

Recommended evaluation benchmarks:

  • BenchVul β€” MITRE Top 25 CWEs, balanced real-world + synthetic
  • SVEN β€” Curated CWE-typed pairs with character-level diffs

πŸƒ Training

To reproduce or fine-tune further:

# Install dependencies
pip install transformers trl torch datasets trackio accelerate peft bitsandbytes

# Run training (requires 24GB+ GPU)
python train.py

See train.py in this repository for the full training script.

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for jacobmahon/zero-day-exploit-scanner-fixer

Base model

Qwen/Qwen2.5-7B
Adapter
(593)
this model

Datasets used to train jacobmahon/zero-day-exploit-scanner-fixer

Papers for jacobmahon/zero-day-exploit-scanner-fixer