Voice Cloning Plugin - RVC + JUCE Native DAW Plugin

INTRODUCTION

RVC (Retrieval-based Voice Conversion) makes high-quality voice cloning possible, but the open-source ecosystem ships it as a Python web tool — fine for hobbyists, useless for film post-production where dubbing artists work entirely inside DAWs and timing has to be frame-accurate. For a movie dubbing project, the dubbing team needed voice-conversion as a native step in their existing pipeline, not a context switch to a browser.

I wrapped RVC inside a native VST3/AU plugin so the entire flow happens inside the DAW: drop the plugin on a track, pick a trained voice checkpoint, render straight to the timeline. Trained custom checkpoints per voice and shipped them packaged with the plugin so the dubbing team could load any voice as a personal instrument.

MY ROLE

Designer & Developer — model training (RVC), plugin architecture (JUCE C++), checkpoint distribution, packaging for cross-DAW compatibility (VST3 / AU).

timeline

Late 2024

link

https://www.youtube.com/watch?v=9J3DMF2htYc

situation

Open-source RVC ships as a Python web app. For a film dubbing pipeline that's a workflow break: stop the DAW session, export audio, open browser, upload, generate, download, re-import, re-sync. Frame-accuracy, automation, undo history - all lost in the round-trip. Native plugins are how serious audio tools live - inside the session, on the track, rendered to the timeline. The gap between "powerful AI model" and "tool a film audio team can actually use" was a plugin.

task

Wrap RVC inference inside a native plugin format (VST3 / AU) that any DAW can load
Train custom voice checkpoints per voice cast member and distribute them with the plugin
Make the inference run inline — dubbing input on track, converted audio renders straight to timeline, frame-accurate
Cross-platform packaging — works in Logic, Ableton, FL Studio, anywhere with VST3 / AU support
Deliver a workflow the dubbing engineer can use without thinking about Python, models, or checkpoints — just track, plugin, render

action

Plugin shell - JUCE (C++). JUCE handles cross-DAW abstraction, audio I/O, parameter automation, GUI rendering. Plugin presents a clean track-loaded interface — pick checkpoint, set conversion parameters, hit render.
Inference layer - RVC. RVC's voice-conversion model integrated for inline audio processing. Input audio buffer → conversion → output buffer back to the DAW pipeline.
Checkpoint distribution. Trained models packaged alongside the plugin so the dubbing team got each voice as an instrument the moment they installed. Per-voice checkpoint approach turned the plugin into a personalized voice library rather than a generic tool.
Multi-DAW packaging. Plugin compiled for VST3 + AU formats so it loads in every major DAW on macOS and Windows.

result

Cross-DAW native plugin (VST3 + AU)
Custom voice checkpoints trained and shipped per voice in the dubbing cast
Output renders directly to the DAW timeline — no export/import cycle, frame-accurate
Dubbing engineers worked inside their session without touching Python, the model, or the checkpoint files directly
[need from you — film name (if shareable) or generic "feature film", number of voices trained, any quote from dubbing supervisor]

Related: open-source audio model training

The plugin is one shipped end of a wider audio-model practice. Same discipline — pick an open-source model, train or fine-tune it for a specific use, package it for the team that will actually use it — applied across voice and speech models:

VoxCPM LoRAs — custom voice/style LoRAs trained on ~120 hours of audio per character. Voices are under NDA; checkpoints live on a private Hugging Face. Same dataset → checkpoint → ship pipeline as the RVC work, just on the newer open speech model.
F5-TTS — Hindi fine-tuning. Open speech models are structurally under-resourced for Indic languages, so I fine-tuned F5-TTS on a Hindi corpus for a separate (also NDA-bound) project. Open-sourcing the methodology where the data permits; the fine-tuned weights stay private.

Both feed back into the dubbing pipeline and any future ComfyUI audio workflows. Treating image-generation, voice-conversion, and speech-synthesis as one continuous practice — same training discipline, different model families.

‍

Voice Cloning Plugin - Native DAW Plugin for Movie Dubbing