Tanaos – Train task specific LLMs without training data, for offline NLP and Text Classification

πŸ“πŸ”πŸ·οΈ tanaos-NER-v1: A small but performant Named Entity Recognition model

This model was created by Tanaos with the Artifex Python library.

This is a multilingual (it supports 16+ languages) Named Entity Recognition model based on FacebookAI/roberta-base and fine-tuned on a synthetic dataset to recognize and classify entities in text into the following 14 entity categories:

Entity Description
PERSON Individual people, fictional characters
ORG Companies, institutions, agencies
LOCATION Geographical areas
DATE Absolute or relative dates, including years, months and/or days
TIME Specific time of the day
PERCENT Percentage expressions
NUMBER Numeric measurements or expressions
FACILITY Buildings, airports, highways, etc.
PRODUCT Objects, vehicles, food, etc. bearing a specific name
WORK_OF_ART Titles of creative works
LANGUAGE Natural or programming languages
NORP National, religious or political groups
ADDRESS Full addresses
PHONE_NUMBER Telephone numbers

These entities were chosen to cover a wide range of common named entity types that are useful in various NLP applications, regardless of the specific application domain, in order to create a versatile and general-purpose Named Entity Recognition model, applicable across various industries and use cases.

βš™οΈ How to Use

Via the Artifex library (pip install artifex)

from artifex import Artifex

ner = Artifex().named_entity_recognition

print(ner("John landed in Barcelona at 15:45."))
# >>> [{'entity_group': 'PERSON', 'score': np.float32(0.92174554), 'word': 'John', 'start': 0, 'end': 4}, {'entity_group': 'LOCATION', 'score': np.float32(0.9853817), 'word': ' Barcelona', 'start': 15, 'end': 24}, {'entity_group': 'TIME', 'score': np.float32(0.98645407), 'word': ' 15:45.', 'start': 28, 'end': 34}]

Via the Transformers library

from transformers import pipeline

ner = pipeline(
    task="token-classification",
    model="tanaos/tanaos-NER-v1",
    aggregation_strategy="first"
)

print(ner("John landed in Barcelona at 15:45."))
# >>> [{'entity_group': 'PERSON', 'score': np.float32(0.92174554), 'word': 'John', 'start': 0, 'end': 4}, {'entity_group': 'LOCATION', 'score': np.float32(0.9853817), 'word': ' Barcelona', 'start': 15, 'end': 24}, {'entity_group': 'TIME', 'score': np.float32(0.98645407), 'word': ' 15:45.', 'start': 28, 'end': 34}]

🧠 Model Description

  • Base model: FacebookAI/roberta-base
  • Task: Text classification (Named Entity Recognition)
  • Languages: Multilingual (16+ languages)
  • Fine-tuning data: A synthetic, custom dataset of around 10,000 passages, each containing multiple named entities across 14 categories.

πŸŽ“ Training Details

This model was trained using the Artifex Python library

pip install artifex

by providing the following instructions and generating 10,000 synthetic training samples:

from artifex import Artifex

ner = Artifex().named_entity_recognition

ner.train(
    named_entities={
        "PERSON": "Individual people, fictional characters",
        "ORG": "Companies, institutions, agencies",
        "LOCATION": "Geographical areas",
        "DATE": "Absolute or relative dates, including years, months and/or days",
        "TIME": "Specific time of the day",
        "PERCENT": "Percentage expressions",
        "NUMBER": "Numeric measurements or expressions",
        "FACILITY": "Buildings, airports, highways, etc.",
        "PRODUCT": "Objects, vehicles, food, etc. bearing a specific name",
        "WORK_OF_ART": "Titles of creative works",
        "LANGUAGE": "Natural or programming languages",
        "NORP": "National, religious or political groups",
        "ADDRESS": "full addresses",
        "PHONE_NUMBER": "telephone numbers",
    },
    domain="general",
    num_samples=10000
)

🧰 Intended Uses

This model is intended to:

  • Extract and classify named entities from text in a variety of applications, such as chatbots, information extraction systems, and data analysis tools.
  • Be used in multilingual contexts, supporting over 16 languages.
  • Serve as a general-purpose NER model applicable across various industries and use cases.

Not intended for:

  • Highly specialized domains requiring custom entity types not covered by the 14 categories in this model.
  • Idioms, slang, or very informal text where entity recognition may be less reliable.
Downloads last month
153
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tanaos/tanaos-NER-v1

Finetuned
(2049)
this model
Finetunes
1 model

Dataset used to train tanaos/tanaos-NER-v1