research

Researchers develop controllable full-duplex speech model trainable on 2,000 hours of data

Researchers have developed F-Actor, an instruction-following full-duplex conversational speech model that can be trained efficiently on 2,000 hours of data without large-scale pretraining. The model enables explicit control over speaker voice, conversation topic, backchanneling, interruptions, and dialogue initiation, addressing naturalness limitations in current spoken conversational systems.

2 min read

F-Actor: Controllable Full-Duplex Speech Model Released

Researchers have introduced F-Actor, the first open instruction-following full-duplex conversational speech model designed for natural, controllable dialogue. The system addresses a fundamental gap in spoken conversational AI: current systems can generate speech accurately but rarely allow customization of conversational behavior.

Key Technical Achievement

F-Actor requires only 2,000 hours of training data—substantially less than typical spoken dialogue systems—by freezing the audio encoder and finetuning exclusively the language model component. This approach eliminates the need for large-scale pretraining or multi-stage optimization pipelines, making the system trainable under typical academic resource constraints.

The researchers propose a single-stage training protocol and have systematically analyzed design choices throughout development.

Controllable Conversational Behaviors

The model accepts explicit instructions to control:

  • Speaker voice characteristics: Customize vocal properties and identity
  • Conversation topic: Direct dialogue content and direction
  • Conversational behavior: Manage backchanneling (natural verbal affirmations like "mm-hmm"), interruptions, and turn-taking patterns
  • Dialogue initiation: Control conversation start conditions and speaker roles

These capabilities address a core limitation in existing spoken conversational systems, which typically operate with fixed conversational patterns that feel scripted or unnatural.

Full-Duplex Architecture

The full-duplex design allows simultaneous bidirectional audio processing, enabling more natural overlapping speech patterns that mirror human conversation. This contrasts with traditional turn-based systems that wait for one speaker to finish before responding.

Open Release and Reproducibility

The researchers plan to release both the model and complete training code to enable reproducible research on controllable full-duplex speech systems. This commitment to open-source development positions the work to serve as a foundation for academic and commercial development in conversational AI.

What This Means

F-Actor demonstrates that controllable, natural-sounding full-duplex speech systems don't require massive computational budgets or proprietary datasets. The 2,000-hour data requirement and single-stage training protocol significantly lower the barrier to entry for researchers working on spoken dialogue systems. The explicit control mechanisms over conversational behavior—particularly interruptions and backchanneling—represent a step toward more human-like AI assistants that can adapt to context and user preferences. However, the paper is a research contribution from arXiv; real-world effectiveness will depend on evaluation metrics and practical deployment results once the code and model are released.