LLM News

Every LLM release, update, and milestone.

Filtered by:tokenization✕ clear
research

ByteFlow Net removes tokenizers, learns adaptive byte compression for language models

Researchers introduce ByteFlow Net, a tokenizer-free language model architecture that learns to segment raw byte streams into semantically meaningful units through compression-driven segmentation. The method adapts internal representation granularity per input, outperforming both BPE-based Transformers and previous byte-level approaches in experiments.

research

UniLID: New language identification method achieves 70% accuracy with just 5 samples per language

Researchers introduce UniLID, a language identification method that leverages tokenizer-based unigram distributions to identify languages and dialects with high sample efficiency. The approach achieves over 70% accuracy on low-resource languages with only five labeled examples per language, substantially outperforming existing systems like fastText, GlotLID, and CLD3 in low-resource settings.