NERPA - Fine-Tuned GLiNER2 for PII Anonymisation

A fine-tuned GLiNER2 Large (340M params) model trained to detect Personally Identifiable Information (PII) in text. Built as a flexible, self-hosted replacement for AWS Comprehend at Overmind.

Why NERPA?

AWS Comprehend is a solid NER service, but it's a black box. The specific problem we hit was date granularity β€” Comprehend labels both a Date of Birth and an Appointment Date as DATE, but for PII anonymisation these require very different treatment. A DOB must be redacted; an appointment date is often essential debugging context.

GLiNER2 is a bi-encoder model that takes both text and entity label descriptions as input, enabling zero-shot entity detection for arbitrary types. We fine-tuned GLiNER2 Large to:

  1. Distinguish fine-grained date types (DATE_OF_BIRTH vs DATE_TIME)
  2. Exceed AWS Comprehend accuracy on our PII benchmark
Model Micro-Precision Micro-Recall
AWS Comprehend 0.90 0.94
GLiNER2 Large (off-the-shelf) 0.84 0.89
NERPA (this model) 0.93 0.90

Fine-Tuning Details

  • Base model: fastino/gliner2-large-v1 (DeBERTa v3 Large backbone, 340M params)
  • Training data: 1,210 synthetic snippets generated with Gemini 3 Pro + Python Faker, each containing 2–4 PII entities
  • Eval data: 300 held-out snippets (no template overlap with training)
  • Strategy: Full weight fine-tuning with differential learning rates:
    • Encoder (DeBERTa v3): 1e-7
    • GLiNER-specific layers: 1e-6
  • Batch size: 64
  • Convergence: 175 steps

The synthetic data approach effectively distils the "knowledge" of a large LLM into a small, fast specialist model β€” what we call indirect distillation.

Supported Entity Types

Entity Description
PERSON_NAME Person name
DATE_OF_BIRTH Date of birth
DATE_TIME Generic date and time
EMAIL Email address
PHONE Phone numbers
LOCATION Address, city, country, postcode, street
AGE Age of a person
BUSINESS_NAME Business name
USERNAME Username
URL Any URL
BANK_ACCOUNT_DETAILS IBAN, SWIFT, routing numbers, etc.
CARD_DETAILS Card number, CVV, expiration
DIGITAL_KEYS Passwords, PINs, API keys
PERSONAL_ID_NUMBERS Passport, driving licence, tax IDs
TECHNICAL_ID_NUMBERS IP/MAC addresses, serial numbers
VEHICLE_ID_NUMBERS License plates, VINs

Since NERPA is built on GLiNER2 (a zero-shot bi-encoder), it is not limited to the entities above. You can pass any custom entity types alongside the built-in ones β€” the fine-tuning does not reduce the model's ability to detect arbitrary categories. See Custom entities below.

Quick Start

Install dependencies

pip install gliner2 torch

Anonymise text (CLI)

# Inline text
python anonymise.py "Dear John Smith, born 15/03/1990. Contact: john@acme.com"

# From file
python anonymise.py --file input.txt --output anonymised.txt

# Show detected entities
python anonymise.py --show-entities "Call me at 020-7946-0958, my IBAN is GB29NWBK60161331926819."

Use in Python

from anonymise import load_model, detect_entities, anonymise

model = load_model(".")  # path to this repo

text = (
    "Dear John Smith, your appointment is on 2025-03-15. "
    "Your date of birth (15/03/1990) has been verified. "
    "Please contact support at help@acme.com or call 020-7946-0958. "
)

entities = detect_entities(model, text)
print(anonymise(text, entities))

Output:

Dear [PERSON_NAME], your appointment is on [DATE_TIME].
Your date of birth ([DATE_OF_BIRTH]) has been verified.
Please contact support at [EMAIL] or call [PHONE].

Entity detection only

If you just need the raw entity offsets (e.g. for your own replacement logic):

entities = detect_entities(model, text)
for e in entities:
    print(f'{e["type"]:25s} [{e["start"]}:{e["end"]}] score={e["score"]:.2f}  "{text[e["start"]:e["end"]]}"')
PERSON_NAME               [5:15]  score=1.00  "John Smith"
DATE_TIME                 [40:50] score=1.00  "2025-03-15"
DATE_OF_BIRTH             [72:82] score=1.00  "15/03/1990"
EMAIL                     [129:142] score=1.00  "help@acme.com"
PHONE                     [151:164] score=1.00  "020-7946-0958"
BANK_ACCOUNT_DETAILS      [187:209] score=1.00  "GB29NWBK60161331926819"

Detect a subset of entities

entities = detect_entities(model, text, entities={
    "PERSON_NAME": "Person name",
    "EMAIL": "Email",
})

Custom entities

You can detect additional entity types beyond the built-in PII set. The model's zero-shot capability means any label + description pair will work β€” your custom entities are detected and anonymised alongside the fine-tuned ones.

CLI β€” use --extra-entities / -e:

python anonymise.py -e PRODUCT="Product name" -e SKILL="Professional skill" \
    "John Smith is a senior Python developer who bought a MacBook Pro."

Output:

[PERSON_NAME] is a senior [SKILL] developer who bought a [PRODUCT].

Python:

from anonymise import load_model, detect_entities, anonymise, PII_ENTITIES

model = load_model(".")

custom_entities = {
    **PII_ENTITIES,
    "PRODUCT": "Product name",
    "SKILL": "Professional skill",
}

text = "John Smith is a senior Python developer who bought a MacBook Pro."
entities = detect_entities(model, text, entities=custom_entities)
print(anonymise(text, entities))

How It Works

The inference pipeline in anonymise.py:

  1. Chunking β€” Long texts are split into 3000-character chunks with 100-char overlap to stay within the model's context window. Specific chunk size can be varied since DeBERTa-v3 (underlying encoder) uses relative position encoding. We found that this size works as well as smaller ones.
  2. Batch prediction β€” Chunks are fed through GLiNER2.batch_extract_entities() with include_spans=True to get character-level offsets.
  3. Date disambiguation β€” Both DATE_TIME and DATE_OF_BIRTH are always detected together so the model can choose the best label per span.
  4. De-duplication β€” Overlapping detections from chunk boundaries are merged, keeping the highest-confidence label for each position.
  5. Replacement β€” Detected spans are replaced right-to-left with [ENTITY_TYPE] placeholders.

Notes

  • Confidence threshold: Default is 0.25. The model sometimes tends to be conservative, so a lower threshold works well for high recall.
  • GLiNER2 version: Requires gliner2>=1.2.4. Earlier versions had a bug where entity character offsets mapped to token positions instead of character positions; this is fixed in 1.2.4+.
  • Device: Automatically uses CUDA > MPS > CPU.

Acknowledgements

This model is a fine-tuned version of GLiNER2 Large by Fastino AI. We thank the GLiNER2 authors for making their model and library openly available.

Citation

If you use NERPA, please cite both this model and the original GLiNER2 paper:

@misc{nerpa2025,
  title={NERPA: Fine-Tuned GLiNER2 for PII Anonymisation},
  author={Akhat Rakishev},
  year={2025},
  url={https://huggingface.co/OvermindLab/nerpa},
}

@misc{zaratiana2025gliner2efficientmultitaskinformation,
  title={GLiNER2: An Efficient Multi-Task Information Extraction System with Schema-Driven Interface},
  author={Urchade Zaratiana and Gil Pasternak and Oliver Boyd and George Hurn-Maloney and Ash Lewis},
  year={2025},
  eprint={2507.18546},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.18546},
}

Built by Akhat Rakishev at Overmind.

Overmind is infrastructure for end-to-end agent optimisation. Learn more at overmindlab.ai.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for OvermindLab/nerpa

Finetuned
(1)
this model

Paper for OvermindLab/nerpa

Evaluation results