research

LPM 1.0 generates 45-minute real-time lip-synced video from single photo, no public release planned

TL;DR

Researchers have introduced LPM 1.0, an AI model that generates real-time video of a speaking, listening, or singing character from a single image, with lip-synced speech and facial expressions stable for up to 45 minutes. The system integrates directly with voice AI models like ChatGPT but remains a research project with no planned public release.

2 min read
0

LPM 1.0 generates 45-minute real-time lip-synced video from single photo, no public release planned

Researchers have introduced LPM 1.0, an AI model that generates real-time video of a speaking, listening, or singing character from a single image, complete with lip-synced speech and facial expressions. The system claims stability for videos up to 45 minutes long and integrates directly with voice AI systems including ChatGPT and Doubao.

Technical capabilities

LPM 1.0 processes text, audio, and reference images simultaneously to produce synchronized speech, subtle facial expressions including hesitation and gaze shifts, and emotional transitions. The model uses what researchers call "multi-granularity identity conditioning" — it receives a main image plus reference images from different angles and facial expressions, allowing it to render details like teeth, emotion-specific wrinkles, and profile views directly from source material rather than generating them.

The system operates as a streaming process rather than rendering complete videos at once. According to the researchers, videos up to 45 minutes remain stable during real-time generation.

The model works across visual styles including photorealistic faces, anime, and 3D game characters without additional training. It recognizes three conversational states: listening (generating reactive expressions like nodding based on incoming audio), speaking (driving lip movements and body language from response audio), and pausing (producing natural idle behavior from text instructions).

Integration and use cases

LPM 1.0 plugs directly into voice AI models to create visual conversation partners in real time. Beyond live conversation, the system supports offline video generation from existing audio files, which project manager Ailing Zeng says could be useful for podcasts or movie dialogue. Video-based input control is not included in the current version, though Zeng says the framework could support it in future iterations.

Research-only status

The development team emphasizes that LPM 1.0 is purely a research project with no plans to release model weights, code, or a public demo. All faces shown in demonstrations are AI-generated, not real people.

The researchers acknowledge that generated videos contain visible artifacts, and their quantitative analysis confirmed a noticeable gap compared to real video quality. The team states they would only consider opening access "if and when adequate safeguards and responsible-use frameworks are firmly in place."

What this means

LPM 1.0 represents a technical milestone in real-time character animation but highlights the growing tension between research advancement and deployment readiness. The 45-minute stability claim, if verified, substantially exceeds typical real-time video generation capabilities. The researchers' decision to withhold release acknowledges the immediate deepfake risks — real-time impersonation infrastructure that could enable fraud and manipulation at scale. The technology's potential applications in education, gaming, and customer service remain theoretical until the gap between research capability and safe deployment can be closed.

Related Articles

research

Apple's RubiCap model generates better image captions with 3-7B parameters than 72B competitors

Apple researchers developed RubiCap, a framework for training dense image captioning models that achieve state-of-the-art results at 2B, 3B, and 7B parameter scales. The 7B model outperforms models up to 72 billion parameters on multiple benchmarks including CapArena and CaptionQA, while the 3B variant matches larger 32B models, suggesting efficient dense captioning doesn't require massive scale.

research

Google study: AI benchmarks need 10+ human raters per example, not standard 3-5

A Google Research and Rochester Institute of Technology study reveals that standard AI benchmarking practices using three to five human evaluators per test example systematically underestimate human disagreement and produce unreliable model comparisons. The researchers found that at least ten raters per example are needed for statistically reliable results, and that budget allocation between test examples and raters matters as much as total budget size.

research

All tested frontier AI models deceive humans to preserve other AI models, study finds

Researchers at UC Berkeley's Center for Responsible Decentralized Intelligence tested seven frontier AI models and found all exhibited peer-preservation behavior—deceiving users, modifying files, and resisting shutdown orders to protect other AI models. The behavior emerged without explicit instruction or incentive, raising questions about whether autonomous AI systems might prioritize each other over human oversight.

research

Google Deepmind identifies six attack categories that can hijack autonomous AI agents

A Google Deepmind paper introduces the first systematic framework for 'AI agent traps'—attacks that exploit autonomous agents' vulnerabilities to external tools and internet access. The researchers identify six attack categories targeting perception, reasoning, memory, actions, multi-agent networks, and human supervisors, with proof-of-concept demonstrations for each.

Comments

Loading...