TeslaXLM

Вишејезични модел, 561 милион параметара

Обучаван над корпусима српског и српскохрватског језика - 20 милијарди речи

Једнака подршка уноса на ћирилици и латиници!

Multilingual model, 561 million parameters

Trained on Serbian and Serbo-Croatian corpora - 20 billion words

Equal support for Cyrillic and Latin input!

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='te-sla/teslaXLM')
>>> unmasker("Kada bi čovek znao gde će pasti on bi<mask>.")
>>> from transformers import AutoTokenizer, AutoModelForMaskedLM
>>> from torch import LongTensor, no_grad
>>> from scipy import spatial
>>> tokenizer = AutoTokenizer.from_pretrained('te-sla/teslaXLM')
>>> model = AutoModelForMaskedLM.from_pretrained('te-sla/teslaXLM', output_hidden_states=True)
>>> x = " pas"
>>> y = " mačka"
>>> z = " svemir"
>>> tensor_x = LongTensor(tokenizer.encode(x, add_special_tokens=False)).unsqueeze(0)
>>> tensor_y = LongTensor(tokenizer.encode(y, add_special_tokens=False)).unsqueeze(0)
>>> tensor_z = LongTensor(tokenizer.encode(z, add_special_tokens=False)).unsqueeze(0)
>>> model.eval()
>>> with no_grad():
>>>     vektor_x = model(input_ids=tensor_x).hidden_states[-1].squeeze()
>>>     vektor_y = model(input_ids=tensor_y).hidden_states[-1].squeeze()
>>>     vektor_z = model(input_ids=tensor_z).hidden_states[-1].squeeze()
>>>     print(spatial.distance.cosine(vektor_x, vektor_y))
>>>     print(spatial.distance.cosine(vektor_x, vektor_z))
Евалуација XLMR модела за српски језик
Serbian XLMR models evaluation results
Author
Mihailo Škorić
Author
Saša Petalinkar
Computation
TESLA project

Cit.

@incollection{skoric2025:juznoslovenskijezici,
  author       = {Škorić, Mihailo and Petalinkar, Saša},
  orcid        = {0000-0003-4811-8692 and 0009-0007-9664-3594},
  title        = {Quality Textual Corpora and New South Slavic Language Models},
  license      = {https://creativecommons.org/licenses/by/4.0/},
  booktitle    = {Proceedings of the International Conference South Slavic Languages in the Digital Environment JuDig : Thematic Collection of Papers},
  editor       = {Moskovljević Popović, Jasmina and Stanković, Ranka},
  isbn         = {978-86-6153-791-2},
  series       = {South Slavic Languages in the Digital Environment JuDig},
  publisher    = {University of Belgrade — Faculty of Philology},
  address      = {Belgrade},
  year         = {2025},
  volume       = {1},
  pages        = {337--348},
  note         = {19},
  doi          = {10.18485/judig.2025.1.ch19},
  doiurl       = {http://doi.fil.bg.ac.rs/volume.php?pt=eb_ser&issue=judig-2025-1&i=19},
  url          = {http://doi.fil.bg.ac.rs/pdf/eb_ser/judig/2025-1/judig-2025-1-ch19.pdf}
}

Истраживање jе спроведено уз подршку Фонда за науку Републике Србиjе, #7276, Text Embeddings – Serbian Language Applications – TESLA

This research was supported by the Science Fund of the Republic of Serbia, #7276, Text Embeddings - Serbian Language Applications - TESLA

Downloads last month
39
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for te-sla/TeslaXLM

Finetuned
(902)
this model

Datasets used to train te-sla/TeslaXLM

Collection including te-sla/TeslaXLM