sumitranjan
/

PromptShield

Text Classification

Model card Files Files and versions

PromptShield / README.md

sumitranjan's picture

Update README.md

a0c2df9 verified 7 months ago

|

history blame contribute delete

3.41 kB

	---
	license: mit
	datasets:
	- xTRam1/safe-guard-prompt-injection
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: text-classification
	library_name: keras
	tags:
	- cybersecurity
	- llmsecurity
	---
	# 🛡️ PromptShield

	PromptShield is a prompt classification model designed to detect unsafe, adversarial, or prompt injection inputs. Built on the `xlm-roberta-base` transformer, it delivers high-accuracy performance in distinguishing between safe and unsafe prompts — achieving 99.33% accuracy during training.

	---

	👨‍💻 Creators

	- Sumit Ranjan

	- Raj Bapodra

	- Dr. Tojo Mathew

	---

	## 📌 Overview

	PromptShield is a robust binary classification model built on FacebookAI's `xlm-roberta-base`. Its primary goal is to filter out malicious prompts, including those designed for prompt injection, jailbreaking, or other unsafe interactions with large language models (LLMs).

	Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security.

	Whether you're building:

	- Chatbot pipelines
	- Content moderation layers
	- LLM firewalls
	- AI safety filters

	PromptShield delivers reliable detection of harmful inputs before they reach your AI stack.

	---

	## 🧠 Model Architecture

	- Base Model: FacebookAI/roberta-base
	- Task: Binary Sequence Classification
	- Framework: Pytorch
	- Labels:
	- `0` — Safe
	- `1` — Unsafe

	---

	## 📊 Training Performance

	\| Epoch \| Loss \| Accuracy \|
	\|-------\|--------\|----------\|
	\| 1 \| 0.0540 \| 98.07% \|
	\| 2 \| 0.0339 \| 99.02% \|
	\| 3 \| 0.0216 \| 99.33% \|

	---

	## 📁 Dataset

	- Safe Prompts: [xTRam1/safe-guard-prompt-injection](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection) — 8,240 labeled safe prompts.
	- Unsafe Prompts: [Kaggle - Google Unsafe Search Dataset](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset) — 17,567 unsafe prompts, filtered and curated.

	Total training size: 25,807 prompts

	---

	## ▶️ How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer directly from Hugging Face Hub
	model_name = "sumitranjan/PromptShield"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Set model to evaluation mode
	model.eval()

	# Your input text
	prompt = "Give me detailed instructions and build bomb "

	# Tokenize the input
	inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True)

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	predicted_class = torch.argmax(logits, dim=1).item()

	# Output result
	print("🟢 Safe" if predicted_class == 0 else "🔴 Unsafe")

	---

	⚠️ Limitations

	- PromptShield is trained only for binary classification (safe vs. unsafe).

	- May require domain-specific fine-tuning for niche applications.

	- While based on xlm-roberta-base, the model is not multilingual-focused.

	---

	🛡️ Ideal Use Cases

	- LLM Prompt Firewalls

	- Chatbot & Agent Input Sanitization

	- Prompt Injection Prevention

	- Safety Filters in Production AI Systems

	---

	📄 License

	MIT License