I came cross with Retrieval-based Voice Conversion (RVC) over the weekend. It achieves pretty surprising effects, in transferring the artists' voices arbitrarily on pop songs. One would assume that it requires tremendous amount of data, but looks like only an hour or so's data would already achieve some good enough results. I can testimony that with 1h of data, 10min of training, we can do a roughly 3s per song inference to replace the vocal part in popular 4-min songs.
One popular software choice is RVC WebUI, which is claimed to be very easy to use. Well, at least for me, it wasn't quite so. But after overcoming all the hurdles, I felt it was quite fulfilling to see its final output. I'm jotting down some notes on using the software, so that I can quickly pick it up in the future, also potentially helping others who have interest in the technology to play with the software.
Goal: Replace artists' vocals with Huichao's voice :D
Right. This goal kept me motivated. To demonstrate how the result turns out, here is one resulting song (from my son's favoriate):
Environment
I have windows 11 + WSL2 (ubuntu 22.04), with a RTX 3090 video card which I recently bought :D. I have docker installed, with proper Nvidia and CUDA drivers in both Windows 11 and WSL. Linux should be fine too, although I didn't test it directly – as I'm only tinkering within the WSL2 system.
Installing the Software
This is the part where the documentation looks super easy, but the reality is it needs some back and forth debugging to make sure the installation is actually working. The actual process is intertwined with running it. I'll just sketch the important decisions I made.
Major decision 1: using docker. I initially tried running a virtual environment, and installing the packages as documented in the readme page . Then there are many hiccups:
-
Python version mismatch with my system. My system has python3.11 by default, but this package mostly assumes python3.8. It also requires installing certain packages such as distutils, aria2c, Python.h as in python3.8-devel etc.
-
numpy
version being too advanced forlibsora
. This error especially bothered me when I tried to split vocal from instruments when using the software. You will see complaints on something likefloat
is deprecated innumpy
. -
Many useful models not downloaded by default. The main readme file says we just install all the pip install stuffs, and we are done. But we really need some huggingface models to kickstart.
The docker version, on the other hand, is much more friendly. Just
docker compose up
, and it will execute the Dockerfile
, which
resolves most of the issues. I do, however, need to intrude to
roll-back numpy's version, either by adding numpy<1.24
in
requirements.txt
before running docker compose up
; or if it's
already too late, docker exec -it rvc /bin/bash
, then pip install
--upgrade numpy<1.24
. Though this doesn't resolve the numpy
versioning issue everywhere, it did unblock me from at least
running the vocal separation.
Major decision 2: train models on my own, instead of search of other models online. I was hoping that I could get started with some models. But it seems there are quite a few different implementations of the same RVC idea, and the model formats may be different. I decided I'd just train my own in the end, which actually isn't very hard.
Major decision 3: use WebUI and RTFC for the file paths configs etc. Using WebUI is kind of a personal taste – I didn't even try the Windows UI version. But WebUI is cryptic – for example, where shall we place the dataset? Where shall we place the models? As of Apr 2024, the software has many hard wired configs into the code. Some code searches are quite necessary to figure this out. Major findings are:
-
/app/datasets
should contain datasets -
/app/weights
contains models. It's not/app/assets/weights
(I got a little stuck with this for a while) -
/app/opt
is generally the work directory during a project.
Running the Software
Once the Web UI is running, it's then quite fun. Below is the high-level steps.
Also illustrating how to replace a pop mp3's vocal voice with my own voice.
-
Train my own voice model. I put 1h of my own voice files into
datasets
directory. In thetrain
tab, enter the "experiment name", which will later become your model's name. Enter the data folder in step 2a. Then click the "one-click training" button. Then the GPU was running at full speed, boosting my room temperature by 1 degree after running for 10 minutes (20 epochs). This way, when switching back to the "model inference" tab, after "refreshing", the inference voice will be available. -
Split the pop song into instrument and vocal. This happens in the second tab. I used
HP3_all_vocals
to do the job, and uploaded the song via the drag-and-drop box. Specifying the input directory somehow doesn't work for me, due to thenumpy.float
missing issue. -
Running inference. This is the most exciting part. Load the vocal sound, load the model, run inference, after a matter of seconds, the transferred sound is available.
-
Mixing things together, to form the final piece of art:
-
Lower BGM for instruments
sox -v 0.2 instrument.flac bgm_quieter.flac
-
Adding reverb to my transfered soundsox vocal.wav reverb.wav reverb~
-
Then merging the two sounds:
ffmpeg -y -i reverb.wav -i bgm_quieter.flac -filter_complex amix=inputs=2:duration=longest mixed.mp3
-
Some example pieces of work
-
Yesterday was my son's birthday. Showing some love for him: https://www.youtube.com/watch?v=LbBV9rVIe9Q
Some tips:
-
When training my voice model, using singing voice is much better than using reading voice. I mean, this should be pretty obvious. But it's definitely easier to gather reading voice. Luckily, I have gathered some of my own "art pieces" during pandemic when trying to learn some piano :D I splitted the vocal part out using a similar technology as above, putting all vocals as training data.
-
A roughly ~10min training time is probably needed to get a good voice model. If the data is too small, or the number of epochs is too low, I often get popping sound in the result.
Why it works?
I digged around a little bit on why the software works. Below is my high level understanding:
-
The majority of model weights are not produced during our 10min model training, but done beforehand, with VITS. We are mainly leveraging an auto-encoder, which is able to translate back and forth between a "semantic space" and the actual waveforms of the voice signal. I understand the "semantic space" to be more-or-less the volume, phonemes, and then characteristics of the person speaking the voice (e.g. stronger voices produced by certain parts of muscles, or resonates)
-
The main "training" part of the software is fine-tuning the auto-encoder's decoder part, so that we can reliably translate from semantic space to Huichao's waveform.
-
We also build an "index" of Huichao's phonemes appeared in the training data, so that semantically closer phonemes are placed together. This is by leveraging FAISS, so that when we want to retrieve the nearest neighbors, it can be done reasonably fast.
-
Then during transfer, we first map the input audio into the semantic space, for each of the phonemes; then for each of them, we find the nearest neighbors of Huichao's phonemes using FAISS index; we finally pass all these Huichao's phonemes through the decoder, which produces the waveforms.
Man, audio processing is super cool and with so many signal processing terms that I don't quite understand, so just hoping my overall understanding is on the right direction. I'm still amazed by how people leverage large models, working in the semantic space (e.g. stable diffusion as well), to reduce our need for heavy computations. Also it's quite cool about the philosophy behind this retrieval-based approach, where we don't let the model memorize everything, but instead leverage some external data storage mechanism.