LLM News

Every LLM release, update, and milestone.

Filtered by:tokenization✕ clear

research

ByteFlow Net removes tokenizers, learns adaptive byte compression for language models

Researchers introduce ByteFlow Net, a tokenizer-free language model architecture that learns to segment raw byte streams into semantically meaningful units through compression-driven segmentation. The method adapts internal representation granularity per input, outperforming both BPE-based Transformers and previous byte-level approaches in experiments.

March 5, 2026 · 5:53 AM2 min read

language-models tokenization byte-level

via arxiv.org ↗

research

UniLID: New language identification method achieves 70% accuracy with just 5 samples per language

Researchers introduce UniLID, a language identification method that leverages tokenizer-based unigram distributions to identify languages and dialects with high sample efficiency. The approach achieves over 70% accuracy on low-resource languages with only five labeled examples per language, substantially outperforming existing systems like fastText, GlotLID, and CLD3 in low-resource settings.

February 20, 2026 · 3:22 AM2 min read

language-identification multilingual-nlp tokenization

via arxiv.org ↗