John Locke's picture

John Locke

johnlockejrr

·

johnlockejrr@gmail.com

AI & ML interests

NLP, OCR, AI

Recent Activity

updated a model about 5 hours ago

johnlockejrr/Nordic_Multicentury

published a model about 5 hours ago

johnlockejrr/Nordic_Multicentury

reacted to MohamedRashad's post with ❤️ 5 days ago

I have update my https://huggingface.co/collections/MohamedRashad/arabic-speech-datasets with new datasets, making the full audio data more than 3000 hours of good arabic speech. Feel Free to use it in your new innovations, And happy new year!

View all activity

Organizations

upvoted an article 8 days ago

Article

New in llama.cpp: Model Management

24 days ago

•

103

upvoted an article 14 days ago

Article

Efficient MultiModal Data Pipeline

+3

Jul 8, 2025

•

69

upvoted a changelog about 1 month ago

Changelog

Duplicate Datasets

Dec 3, 2025

• 89

upvoted an article about 1 month ago

Article

Transformers v5: Simple model definitions powering the AI ecosystem

+2

Dec 1, 2025

•

265

upvoted a collection about 2 months ago

SHAMIYAT: A Collection of Syrian Dialect Datasets & LLMs

A collection of datasets and language models focused on the Syrian dialect, supporting NLP research and applications for Syria • 4 items • Updated Nov 28, 2025 • 2

upvoted an article about 2 months ago

Article

How to train a new language model from scratch using Transformers and Tokenizers

Feb 14, 2020

•

56

upvoted a collection 2 months ago

Yiddish Whisper Training

Yiddish based Whisper post-training - Crowd Sourced Open Data • 10 items • Updated Oct 24, 2025 • 4

upvoted a collection 4 months ago

Scaling Low-Res MT via Synthetic Data Generation with LLMs

Synthetic baselines trained for our paper "Scaling Low-Resource MT via Synthetic Data Generation with LLMs" accepted as a main in EMNLP 2025. • 8 items • Updated Sep 16, 2025 • 1

upvoted a paper 4 months ago

Scaling Low-Resource MT via Synthetic Data Generation with LLMs

Paper • 2505.14423 • Published May 20, 2025 • 2

upvoted a collection 11 months ago

DictaBERT

Collection of state-of-the-art language model for Hebrew, finetuned for various tasks, as detailed in the article: https://arxiv.org/abs/2308.16687 • 17 items • Updated Apr 4, 2024 • 5

upvoted a paper about 1 year ago

Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

Paper • 2411.17835 • Published Nov 19, 2024 • 4

upvoted an article about 1 year ago

Article

HTRflow - A tool for HTR and OCR

Oct 1, 2024

•

22

upvoted a collection over 1 year ago

ZeroGPU Spaces

ZeroGPU Spaces made by the community • 17 items • Updated Jun 6, 2024 • 245