ento-label-deberta

DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw label string the model extracts semantic fields as verbatim character spans.

Three sizes are included in this repo: small, base, and large (subdirectories of the same name). ONNX exports are in onnx/small, onnx/base, and onnx/large.

Entity types

Label Description
country Country name
state State, province, or region
verbatim_locality Locality description
verbatim_date Collection date as written
verbatim_elevation Elevation as written
verbatim_collectors Collector name(s)
verbatim_habitat Habitat description
verbatim_method Collection method
verbatim_latitude Latitude as written
verbatim_longitude Longitude as written

Evaluation results (macro F1 per entity)

Entity small base large
country 0.9695 0.9749 0.9751
state 0.9046 0.9220 0.9212
verbatim_locality 0.8282 0.8499 0.8573
verbatim_date 0.9673 0.9700 0.9693
verbatim_elevation 0.9722 0.9742 0.9739
verbatim_collectors 0.4867 0.5393 0.5311
verbatim_habitat 0.7485 0.7751 0.7930
verbatim_method 0.9123 0.9205 0.9080
verbatim_latitude 0.7154 0.7145 0.6512
verbatim_longitude 0.8552 0.8528 0.7969
macro avg 0.8360 0.8493 0.8377

Usage (PyTorch)

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="SpeciesFileGroup/ento-label-deberta/base",
    aggregation_strategy="simple",
)

results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
for r in results:
    print(r["entity_group"], repr(r["word"]))
# country      'Sudan'
# state        'Blue Nile'
# verbatim_locality  'Abu Hashim'
# verbatim_date      '23-24.XI.1962'
# verbatim_collectors 'Linnavuori'

Usage (ONNX / hugot)

ONNX models are compatible with hugot and ONNX Runtime. Load from onnx/small, onnx/base, or onnx/large.

Training

Fine-tuned for 5 epochs with the HuggingFace Trainer. Hyperparameters:

Parameter small / base large
Learning rate 5e-6 2e-6
Batch size 16 16
LR scheduler linear linear
Warmup ratio 0.06 0.06
Weight decay 0.01 0.01
Max seq length 128 128

Training data: ~22 000 insect collection label strings with character-span annotations for the 10 entity types above.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SpeciesFileGroup/ento-label-deberta

Quantized
(21)
this model