Cybersecurity NER Model

NER model for cybersecurity domain. F1: 98.31%.

Model Details

Version: v5 Framework: spaCy 3.8+ Training Date: 2025-12-29 Examples: 1922 (stratified 80/10/10) Backbone: Domain-adapted RoBERTa

Entities (13)

Entity	F1	Examples
CERTIFICATION	100%	CISSP, OSCP, CEH
SECURITY_ROLE	100%	CISO, SOC Analyst
SECURITY_TOOL	100%	Splunk, Metasploit
ATTACK_TECHNIQUE	100%	SQL Injection, XSS
FRAMEWORK	100%	NIST CSF, ISO 27001
THREAT_TYPE	100%	APT, ransomware
AUDIT_TERM	100%	Compliance, Audit
CVE	100%	CVE-2021-44228
SECURITY_DOMAIN	99.10%	Cloud Security
TECHNICAL_SKILL	95.30%	Incident Response
REGULATION	94.44%	GDPR, HIPAA
ACRONYM	88.89%	SIEM, EDR
CONTROL_ID	0%	See hybrid approach

Performance

Metrics:

F1: 98.31%
Precision: 97.92%
Recall: 98.69%
Inference: ~60ms/doc

v5 changes from v4:

Tuned hyperparameters (dropout 0.25, L2 0.02)
Improved REGULATION (+6.64pp), ACRONYM (+22.22pp)
Overall +0.25pp F1

CONTROL_ID Handling

Model F1 for CONTROL_ID: 0% (insufficient training data: 25 examples).

Solution: Hybrid approach - regex extraction for production use.

Patterns: ISO 27001, NIST CSF, CIS Controls, SOC 2, PCI-DSS.

See service implementation for details.

Usage

pip install spacy>=3.7.0 spacy-transformers>=1.3.0

import spacy

nlp = spacy.load("pki/ner-cybersecurity")
doc = nlp("CISO with CISSP, expert in Splunk and ISO 27001")

for ent in doc.ents:
    print(f"{ent.text:20} | {ent.label_}")

Output:

CISO                 | SECURITY_ROLE
CISSP                | CERTIFICATION
Splunk               | SECURITY_TOOL
ISO 27001            | FRAMEWORK

Use Cases

Job/CV matching
Threat intelligence extraction
Compliance documentation parsing
Security policy analysis

Training Config

max_steps = 8000
dropout = 0.25
L2 = 0.02
learning_rate = 0.00003
hidden_width = 128
maxout_pieces = 3
batch_size = 128

Limitations

ACRONYM: Lower F1 (88.89%) - limited examples (46)
CONTROL_ID: Requires hybrid regex approach
Domain-specific: Optimized for cybersecurity text
Context-dependent ambiguity on some terms

License

MIT

Version History

Version	Date	F1	Examples	Notes
v5	2025-12-29	98.31%	1922	Hyperparameter tuning
v4	2025-12-29	98.06%	1922	Stratified split, domain RoBERTa
v3	2025-01	69.4%	1000	spaCy 3.x migration
v2	2024-12	99.5%*	1805	spaCy 2.x (*train accuracy)

Contact

Issues: Model repository

Downloads last month: -

Evaluation results

F1
self-reported

0.983
Precision
self-reported

0.979
Recall
self-reported

0.987