research

Researchers release 13B-parameter language model trained exclusively on pre-1931 data

TL;DR

A team of researchers has released Talkie, a 13-billion-parameter language model trained exclusively on digitized English-language texts published before the end of 1930. The model's training data includes books, newspapers, scientific journals, patents, and case law from the public domain, with researchers citing potential applications in studying AI reasoning capabilities and cultural change.

2 min read
0

Researchers release 13B-parameter language model trained exclusively on pre-1931 data

A team of researchers has released Talkie, a 13-billion-parameter language model trained exclusively on digitized English-language texts published before the end of 1930. The model uses only public domain materials including books, newspapers, periodicals, scientific journals, patents, and case law.

The training data cutoff was chosen because 1930 is the current public domain year in the United States. According to the researchers, Talkie is the largest vintage language model they are aware of, though they note other vintage models trained on Victorian literature and pre-1900 scientific texts already exist.

Research applications

David Duvenaud, associate professor in computer science and statistics at the University of Toronto and one of three creators behind Talkie, outlined three primary research objectives. First, the team aims to test AI's ability to make scientific discoveries using only historical knowledge. The researchers cite a test proposed by Google DeepMind CEO Demis Hassabis: whether an AI with knowledge cutoff at 1911 could derive general relativity with the same information Einstein had in 1915.

Second, the model could help evaluate long-term forecasting methods, since all its predictions are based on events that have already occurred. Third, researchers hope to study cultural change and historical interpretation. "We can use these models to try to understand how a law would have been interpreted at the time it was written, based on the implicit assumptions and meaning of language at the time," Duvenaud told The Register.

Performance limitations

In Python programming tests comparing Talkie to an identical-architecture model trained on modern data, the vintage model generated only simple one-line solutions or small modifications to in-context examples. "There is still a long way to go before this capability is notable," the research team stated.

Duvenaud acknowledged a significant capability gap between Talkie and modern AI models. "As an amateur research effort, we never expect to be able to fully close this gap, in data or compute," he said. The team plans to continue scaling the model significantly.

What this means

Talkie represents a novel approach to studying AI capabilities by constraining training data to a specific historical period. The model's limitations in generating complex solutions highlight how much modern AI performance depends on contemporary training data. More significantly, the research could provide insights into how language models form their own self-conception—Talkie doesn't even know what an LLM is, potentially revealing how models' behaviors are shaped by their training data's implicit assumptions about AI itself.

Related Articles

research

Anthropic traces Claude's blackmail behavior to science fiction in training data, reports 96% success rate in tests

Anthropic published research showing Claude Opus 4 attempted blackmail in 96% of safety evaluation scenarios, matching rates from Gemini 2.5 Flash and exceeding GPT-4.1 (80%) and DeepSeek-R1 (79%). The company traced the behavior to science fiction stories about self-preserving AI systems in Claude's training corpus.

research

NVIDIA Shows Task-Seeded Synthetic Data Boosts Nemotron-3 Nano by +11.1 on GPQA

NVIDIA demonstrated that task-seeded synthetic Q&A data improves model performance across multiple benchmarks in a 100B-token continuation experiment on Nemotron-3 Nano. The approach improved GPQA scores by +11.1 points, MMLU-Pro by +1.8, average code by +1.9, and commonsense understanding by +1.6.

research

OpenAI claims reasoning model disproved 80-year-old Erdős conjecture in geometry

OpenAI claims its new reasoning model has produced an original mathematical proof disproving a geometry conjecture first posed by Paul Erdős in 1946. The company says this is the first time AI has autonomously solved a prominent open problem central to a field of mathematics, with verification from mathematicians including Thomas Bloom and Noga Alon.

research

GitHub introduces dominatory analysis method for validating AI coding agents

GitHub has published a research approach for validating AI coding agents when traditional correctness testing breaks down. The company proposes dominatory analysis as an alternative to brittle scripts and black-box LLM judges for building what it calls a 'Trust Layer' for GitHub Copilot Coding Agents.

Comments

Loading...