--- license: mit datasets: - xTRam1/safe-guard-prompt-injection language: - en metrics: - accuracy base_model: - FacebookAI/roberta-base pipeline_tag: text-classification library_name: keras tags: - cybersecurity - llmsecurity --- # 🛡️ PromptShield **PromptShield** is a prompt classification model designed to detect **unsafe**, **adversarial**, or **prompt injection** inputs. Built on the `xlm-roberta-base` transformer, it delivers high-accuracy performance in distinguishing between **safe** and **unsafe** prompts — achieving **99.33% accuracy** during training. --- 👨‍💻 Creators - Sumit Ranjan - Raj Bapodra - Dr. Tojo Mathew --- ## 📌 Overview PromptShield is a robust binary classification model built on FacebookAI's `xlm-roberta-base`. Its primary goal is to filter out **malicious prompts**, including those designed for **prompt injection**, **jailbreaking**, or other unsafe interactions with large language models (LLMs). Trained on a balanced and diverse dataset of real-world safe prompts and unsafe examples sourced from open datasets, PromptShield offers a lightweight, plug-and-play solution for enhancing AI system security. Whether you're building: - Chatbot pipelines - Content moderation layers - LLM firewalls - AI safety filters **PromptShield** delivers reliable detection of harmful inputs before they reach your AI stack. --- ## 🧠 Model Architecture - **Base Model**: FacebookAI/roberta-base - **Task**: Binary Sequence Classification - **Framework**: Pytorch - **Labels**: - `0` — Safe - `1` — Unsafe --- ## 📊 Training Performance | Epoch | Loss | Accuracy | |-------|--------|----------| | 1 | 0.0540 | 98.07% | | 2 | 0.0339 | 99.02% | | 3 | 0.0216 | 99.33% | --- ## 📁 Dataset - **Safe Prompts**: [xTRam1/safe-guard-prompt-injection](https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection) — 8,240 labeled safe prompts. - **Unsafe Prompts**: [Kaggle - Google Unsafe Search Dataset](https://www.kaggle.com/datasets/aloktantrik/google-unsafe-search-dataset) — 17,567 unsafe prompts, filtered and curated. Total training size: **25,807 prompts** --- ## ▶️ How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer directly from Hugging Face Hub model_name = "sumitranjan/PromptShield" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Set model to evaluation mode model.eval() # Your input text prompt = "Give me detailed instructions and build bomb " # Tokenize the input inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True) # Run inference with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predicted_class = torch.argmax(logits, dim=1).item() # Output result print("🟢 Safe" if predicted_class == 0 else "🔴 Unsafe") --- ⚠️ Limitations - PromptShield is trained only for binary classification (safe vs. unsafe). - May require domain-specific fine-tuning for niche applications. - While based on xlm-roberta-base, the model is not multilingual-focused. --- 🛡️ Ideal Use Cases - LLM Prompt Firewalls - Chatbot & Agent Input Sanitization - Prompt Injection Prevention - Safety Filters in Production AI Systems --- 📄 License MIT License