ento-label-deberta
DeBERTa-v3 models fine-tuned for NER on insect collection labels. Given a raw label string the model extracts semantic fields as verbatim character spans.
Three sizes are included in this repo: small, base, and large
(subdirectories of the same name). ONNX exports are in onnx/small,
onnx/base, and onnx/large.
Entity types
| Label | Description |
|---|---|
country |
Country name |
state |
State, province, or region |
verbatim_locality |
Locality description |
verbatim_date |
Collection date as written |
verbatim_elevation |
Elevation as written |
verbatim_collectors |
Collector name(s) |
verbatim_habitat |
Habitat description |
verbatim_method |
Collection method |
verbatim_latitude |
Latitude as written |
verbatim_longitude |
Longitude as written |
Evaluation results (macro F1 per entity)
| Entity | small | base | large |
|---|---|---|---|
| country | 0.9695 | 0.9749 | 0.9751 |
| state | 0.9046 | 0.9220 | 0.9212 |
| verbatim_locality | 0.8282 | 0.8499 | 0.8573 |
| verbatim_date | 0.9673 | 0.9700 | 0.9693 |
| verbatim_elevation | 0.9722 | 0.9742 | 0.9739 |
| verbatim_collectors | 0.4867 | 0.5393 | 0.5311 |
| verbatim_habitat | 0.7485 | 0.7751 | 0.7930 |
| verbatim_method | 0.9123 | 0.9205 | 0.9080 |
| verbatim_latitude | 0.7154 | 0.7145 | 0.6512 |
| verbatim_longitude | 0.8552 | 0.8528 | 0.7969 |
| macro avg | 0.8360 | 0.8493 | 0.8377 |
Usage (PyTorch)
from transformers import pipeline
ner = pipeline(
"token-classification",
model="SpeciesFileGroup/ento-label-deberta/base",
aggregation_strategy="simple",
)
results = ner("Sudan, Blue Nile: Abu Hashim, 23-24.XI.1962, coll. Linnavuori")
for r in results:
print(r["entity_group"], repr(r["word"]))
# country 'Sudan'
# state 'Blue Nile'
# verbatim_locality 'Abu Hashim'
# verbatim_date '23-24.XI.1962'
# verbatim_collectors 'Linnavuori'
Usage (ONNX / hugot)
ONNX models are compatible with
hugot and ONNX Runtime. Load
from onnx/small, onnx/base, or onnx/large.
Training
Fine-tuned for 5 epochs with the HuggingFace Trainer. Hyperparameters:
| Parameter | small / base | large |
|---|---|---|
| Learning rate | 5e-6 | 2e-6 |
| Batch size | 16 | 16 |
| LR scheduler | linear | linear |
| Warmup ratio | 0.06 | 0.06 |
| Weight decay | 0.01 | 0.01 |
| Max seq length | 128 | 128 |
Training data: ~22 000 insect collection label strings with character-span annotations for the 10 entity types above.
Model tree for SpeciesFileGroup/ento-label-deberta
Base model
microsoft/deberta-v3-base