DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Paper
•
1910.01108
•
Published
•
21
This model is a fine-tuned DistilBERT (distilbert-base-uncased) for Named Entity Recognition (NER), specifically designed to detect Personally Identifiable Information (PII) in English text.
It was trained on a custom dataset of 4138 samples with 18 entity classes relevant to compliance, finance, and legal text redaction.
The model identifies PII entities such as names, email addresses, phone numbers, financial amounts, dates, and credentials, making it suitable for document redaction and compliance automation (GDPR, HIPAA, PCI-DSS).
The model supports 18 entity classes (plus O for non-entity tokens):
| Entity | Description |
|---|---|
AMOUNT |
Monetary values, amounts, percentages |
COUNTRY |
Country names |
CREDENTIALS |
Passwords, access keys, or secret tokens |
DATE |
Calendar dates |
EMAIL |
Email addresses |
EXPIRYDATE |
Expiry dates (e.g., card expiry) |
FIRSTNAME |
First names |
IPADDRESS |
IPv4 or IPv6 addresses |
LASTNAME |
Last names |
LOCATION |
General locations (cities, regions, etc.) |
MACADDRESS |
MAC addresses |
NUMBER |
Generic numeric identifiers |
ORGANIZATION |
Company or institution names |
PERCENT |
Percentages |
PHONE |
Phone numbers |
TIME |
Time expressions (HH:MM, AM/PM, etc.) |
UID |
Unique IDs (customer IDs, transaction IDs, etc.) |
ZIPCODE |
Postal/ZIP codes |
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("narayan214/distilbert-pii-before-v2")
model = AutoModelForTokenClassification.from_pretrained("narayan214/distilbert-pii-before-v2")
pii_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "John Doe's email is john.doe@example.com and his phone number is +1-202-555-0173."
print(pii_pipeline(text))
If you use this model, please cite the original DistilBERT paper:
BibTeX:
@article{sanh2019distilbert,
title={DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter},
author={Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas},
journal={arXiv preprint arXiv:1910.01108},
year={2019}
}
Base model
distilbert/distilbert-base-uncased