training-data
3 articles tagged with training-data
Researchers release 13B-parameter language model trained exclusively on pre-1931 data
A team of researchers has released Talkie, a 13-billion-parameter language model trained exclusively on digitized English-language texts published before the end of 1930. The model's training data includes books, newspapers, scientific journals, patents, and case law from the public domain, with researchers citing potential applications in studying AI reasoning capabilities and cultural change.
GitHub will train Copilot models on user interaction data starting April 2026
GitHub will use Copilot interaction data from Free, Pro, and Pro+ plan users to train AI models starting April 24, 2026, unless users actively opt out. The policy does not affect Copilot Business and Enterprise customers. Data shared will include prompts, outputs, code snippets, filenames, and repository structures.
Meta research challenges multimodal training assumptions as text data scarcity looms
A Meta FAIR and New York University research team trained a multimodal AI model from scratch and identified that several widely-held assumptions about multimodal model architecture and training don't align with their empirical findings. The work addresses growing concerns about text data exhaustion in LLM training.