Voice Cloning Plugin - Native DAW Plugin for Movie Dubbing

Audio · AI/ML · C++ · JUCE · RVC · Plugin · Film Dubbing
INTRODUCTION
RVC (Retrieval-based Voice Conversion) makes high-quality voice cloning possible, but the open-source ecosystem ships it as a Python web tool — fine for hobbyists, useless for film post-production where dubbing artists work entirely inside DAWs and timing has to be frame-accurate. For a movie dubbing project, the dubbing team needed voice-conversion as a native step in their existing pipeline, not a context switch to a browser.

I wrapped RVC inside a native VST3/AU plugin so the entire flow happens inside the DAW: drop the plugin on a track, pick a trained voice checkpoint, render straight to the timeline. Trained custom checkpoints per voice and shipped them packaged with the plugin so the dubbing team could load any voice as a personal instrument.
MY ROLE
Designer & Developer — model training (RVC), plugin architecture (JUCE C++), checkpoint distribution, packaging for cross-DAW compatibility (VST3 / AU).
timeline
Late 2024
situation
Open-source RVC ships as a Python web app. For a film dubbing pipeline that's a workflow break: stop the DAW session, export audio, open browser, upload, generate, download, re-import, re-sync. Frame-accuracy, automation, undo history - all lost in the round-trip. Native plugins are how serious audio tools live - inside the session, on the track, rendered to the timeline. The gap between "powerful AI model" and "tool a film audio team can actually use" was a plugin.
task
  • Wrap RVC inference inside a native plugin format (VST3 / AU) that any DAW can load
  • Train custom voice checkpoints per voice cast member and distribute them with the plugin
  • Make the inference run inline — dubbing input on track, converted audio renders straight to timeline, frame-accurate
  • Cross-platform packaging — works in Logic, Ableton, FL Studio, anywhere with VST3 / AU support
  • Deliver a workflow the dubbing engineer can use without thinking about Python, models, or checkpoints — just track, plugin, render
action
  • Plugin shell - JUCE (C++). JUCE handles cross-DAW abstraction, audio I/O, parameter automation, GUI rendering. Plugin presents a clean track-loaded interface — pick checkpoint, set conversion parameters, hit render.
  • Inference layer - RVC. RVC's voice-conversion model integrated for inline audio processing. Input audio buffer → conversion → output buffer back to the DAW pipeline.
  • Checkpoint distribution. Trained models packaged alongside the plugin so the dubbing team got each voice as an instrument the moment they installed. Per-voice checkpoint approach turned the plugin into a personalized voice library rather than a generic tool.
  • Multi-DAW packaging. Plugin compiled for VST3 + AU formats so it loads in every major DAW on macOS and Windows.
result
  • Cross-DAW native plugin (VST3 + AU)
  • Custom voice checkpoints trained and shipped per voice in the dubbing cast
  • Output renders directly to the DAW timeline — no export/import cycle, frame-accurate
  • Dubbing engineers worked inside their session without touching Python, the model, or the checkpoint files directly
  • [need from you — film name (if shareable) or generic "feature film", number of voices trained, any quote from dubbing supervisor]
Related: open-source audio model training

The plugin is one shipped end of a wider audio-model practice. Same discipline — pick an open-source model, train or fine-tune it for a specific use, package it for the team that will actually use it — applied across voice and speech models:

  • VoxCPM LoRAs — custom voice/style LoRAs trained on ~120 hours of audio per character. Voices are under NDA; checkpoints live on a private Hugging Face. Same dataset → checkpoint → ship pipeline as the RVC work, just on the newer open speech model.
  • F5-TTS — Hindi fine-tuning. Open speech models are structurally under-resourced for Indic languages, so I fine-tuned F5-TTS on a Hindi corpus for a separate (also NDA-bound) project. Open-sourcing the methodology where the data permits; the fine-tuned weights stay private.

Both feed back into the dubbing pipeline and any future ComfyUI audio workflows. Treating image-generation, voice-conversion, and speech-synthesis as one continuous practice — same training discipline, different model families.