πππ·οΈ tanaos-NER-v1: A small but performant Named Entity Recognition model
This model was created by Tanaos with the Artifex Python library.
This is a multilingual (it supports 16+ languages) Named Entity Recognition model based on FacebookAI/roberta-base and fine-tuned on a synthetic dataset to recognize and classify entities in text into the following 14 entity categories:
| Entity | Description |
|---|---|
PERSON |
Individual people, fictional characters |
ORG |
Companies, institutions, agencies |
LOCATION |
Geographical areas |
DATE |
Absolute or relative dates, including years, months and/or days |
TIME |
Specific time of the day |
PERCENT |
Percentage expressions |
NUMBER |
Numeric measurements or expressions |
FACILITY |
Buildings, airports, highways, etc. |
PRODUCT |
Objects, vehicles, food, etc. bearing a specific name |
WORK_OF_ART |
Titles of creative works |
LANGUAGE |
Natural or programming languages |
NORP |
National, religious or political groups |
ADDRESS |
Full addresses |
PHONE_NUMBER |
Telephone numbers |
These entities were chosen to cover a wide range of common named entity types that are useful in various NLP applications, regardless of the specific application domain, in order to create a versatile and general-purpose Named Entity Recognition model, applicable across various industries and use cases.
βοΈ How to Use
Via the Artifex library (pip install artifex)
from artifex import Artifex
ner = Artifex().named_entity_recognition
print(ner("John landed in Barcelona at 15:45."))
# >>> [{'entity_group': 'PERSON', 'score': np.float32(0.92174554), 'word': 'John', 'start': 0, 'end': 4}, {'entity_group': 'LOCATION', 'score': np.float32(0.9853817), 'word': ' Barcelona', 'start': 15, 'end': 24}, {'entity_group': 'TIME', 'score': np.float32(0.98645407), 'word': ' 15:45.', 'start': 28, 'end': 34}]
Via the Transformers library
from transformers import pipeline
ner = pipeline(
task="token-classification",
model="tanaos/tanaos-NER-v1",
aggregation_strategy="first"
)
print(ner("John landed in Barcelona at 15:45."))
# >>> [{'entity_group': 'PERSON', 'score': np.float32(0.92174554), 'word': 'John', 'start': 0, 'end': 4}, {'entity_group': 'LOCATION', 'score': np.float32(0.9853817), 'word': ' Barcelona', 'start': 15, 'end': 24}, {'entity_group': 'TIME', 'score': np.float32(0.98645407), 'word': ' 15:45.', 'start': 28, 'end': 34}]
π§ Model Description
- Base model:
FacebookAI/roberta-base - Task: Text classification (Named Entity Recognition)
- Languages: Multilingual (16+ languages)
- Fine-tuning data: A synthetic, custom dataset of around 10,000 passages, each containing multiple named entities across 14 categories.
π Training Details
This model was trained using the Artifex Python library
pip install artifex
by providing the following instructions and generating 10,000 synthetic training samples:
from artifex import Artifex
ner = Artifex().named_entity_recognition
ner.train(
named_entities={
"PERSON": "Individual people, fictional characters",
"ORG": "Companies, institutions, agencies",
"LOCATION": "Geographical areas",
"DATE": "Absolute or relative dates, including years, months and/or days",
"TIME": "Specific time of the day",
"PERCENT": "Percentage expressions",
"NUMBER": "Numeric measurements or expressions",
"FACILITY": "Buildings, airports, highways, etc.",
"PRODUCT": "Objects, vehicles, food, etc. bearing a specific name",
"WORK_OF_ART": "Titles of creative works",
"LANGUAGE": "Natural or programming languages",
"NORP": "National, religious or political groups",
"ADDRESS": "full addresses",
"PHONE_NUMBER": "telephone numbers",
},
domain="general",
num_samples=10000
)
π§° Intended Uses
This model is intended to:
- Extract and classify named entities from text in a variety of applications, such as chatbots, information extraction systems, and data analysis tools.
- Be used in multilingual contexts, supporting over 16 languages.
- Serve as a general-purpose NER model applicable across various industries and use cases.
Not intended for:
- Highly specialized domains requiring custom entity types not covered by the 14 categories in this model.
- Idioms, slang, or very informal text where entity recognition may be less reliable.
- Downloads last month
- 153