Trying out RVC for Voice Transfer

2024/04/07

I came cross with Retrieval-based Voice Conversion (RVC) over the weekend. It achieves pretty surprising effects, in transferring the artists' voices arbitrarily on pop songs. One would assume that it requires tremendous amount of data, but looks like only an hour or so's data would already achieve some good enough results. I can testimony that with 1h of data, 10min of training, we can do a roughly 3s per song inference to replace the vocal part in popular 4-min songs.

One popular software choice is RVC WebUI, which is claimed to be very easy to use. Well, at least for me, it wasn't quite so. But after overcoming all the hurdles, I felt it was quite fulfilling to see its final output. I'm jotting down some notes on using the software, so that I can quickly pick it up in the future, also potentially helping others who have interest in the technology to play with the software.

Goal: Replace artists' vocals with Huichao's voice :D

Right. This goal kept me motivated. To demonstrate how the result turns out, here is one resulting song (from my son's favoriate):

Environment

I have windows 11 + WSL2 (ubuntu 22.04), with a RTX 3090 video card which I recently bought :D. I have docker installed, with proper Nvidia and CUDA drivers in both Windows 11 and WSL. Linux should be fine too, although I didn't test it directly – as I'm only tinkering within the WSL2 system.

Installing the Software

This is the part where the documentation looks super easy, but the reality is it needs some back and forth debugging to make sure the installation is actually working. The actual process is intertwined with running it. I'll just sketch the important decisions I made.

Major decision 1: using docker. I initially tried running a virtual environment, and installing the packages as documented in the readme page . Then there are many hiccups:

  1. Python version mismatch with my system. My system has python3.11 by default, but this package mostly assumes python3.8. It also requires installing certain packages such as distutils, aria2c, Python.h as in python3.8-devel etc.

  2. numpy version being too advanced for libsora. This error especially bothered me when I tried to split vocal from instruments when using the software. You will see complaints on something like float is deprecated in numpy.

  3. Many useful models not downloaded by default. The main readme file says we just install all the pip install stuffs, and we are done. But we really need some huggingface models to kickstart.

The docker version, on the other hand, is much more friendly. Just docker compose up, and it will execute the Dockerfile, which resolves most of the issues. I do, however, need to intrude to roll-back numpy's version, either by adding numpy<1.24 in requirements.txt before running docker compose up; or if it's already too late, docker exec -it rvc /bin/bash, then pip install --upgrade numpy<1.24. Though this doesn't resolve the numpy versioning issue everywhere, it did unblock me from at least running the vocal separation.

Major decision 2: train models on my own, instead of search of other models online. I was hoping that I could get started with some models. But it seems there are quite a few different implementations of the same RVC idea, and the model formats may be different. I decided I'd just train my own in the end, which actually isn't very hard.

Major decision 3: use WebUI and RTFC for the file paths configs etc. Using WebUI is kind of a personal taste – I didn't even try the Windows UI version. But WebUI is cryptic – for example, where shall we place the dataset? Where shall we place the models? As of Apr 2024, the software has many hard wired configs into the code. Some code searches are quite necessary to figure this out. Major findings are:

  1. /app/datasets should contain datasets

  2. /app/weights contains models. It's not /app/assets/weights (I got a little stuck with this for a while)

  3. /app/opt is generally the work directory during a project.

Running the Software

Once the Web UI is running, it's then quite fun. Below is the high-level steps.

/voice/rvc/flow.svg

Also illustrating how to replace a pop mp3's vocal voice with my own voice.

  1. Train my own voice model. I put 1h of my own voice files into datasets directory. In the train tab, enter the "experiment name", which will later become your model's name. Enter the data folder in step 2a. Then click the "one-click training" button. Then the GPU was running at full speed, boosting my room temperature by 1 degree after running for 10 minutes (20 epochs). This way, when switching back to the "model inference" tab, after "refreshing", the inference voice will be available.

  2. Split the pop song into instrument and vocal. This happens in the second tab. I used HP3_all_vocals to do the job, and uploaded the song via the drag-and-drop box. Specifying the input directory somehow doesn't work for me, due to the numpy.float missing issue.

  3. Running inference. This is the most exciting part. Load the vocal sound, load the model, run inference, after a matter of seconds, the transferred sound is available.

  4. Mixing things together, to form the final piece of art:

    1. Lower BGM for instruments sox -v 0.2 instrument.flac bgm_quieter.flac

    2. Adding reverb to my transfered soundsox vocal.wav reverb.wav reverb~

    3. Then merging the two sounds: ffmpeg -y -i reverb.wav -i bgm_quieter.flac -filter_complex amix=inputs=2:duration=longest mixed.mp3

Some example pieces of work

  1. Melody https://www.youtube.com/watch?v=yEtU5tzZMkY

  2. Yesterday was my son's birthday. Showing some love for him: https://www.youtube.com/watch?v=LbBV9rVIe9Q

Some tips:

  1. When training my voice model, using singing voice is much better than using reading voice. I mean, this should be pretty obvious. But it's definitely easier to gather reading voice. Luckily, I have gathered some of my own "art pieces" during pandemic when trying to learn some piano :D I splitted the vocal part out using a similar technology as above, putting all vocals as training data.

  2. A roughly ~10min training time is probably needed to get a good voice model. If the data is too small, or the number of epochs is too low, I often get popping sound in the result.

Why it works?

I digged around a little bit on why the software works. Below is my high level understanding:

  1. The majority of model weights are not produced during our 10min model training, but done beforehand, with VITS. We are mainly leveraging an auto-encoder, which is able to translate back and forth between a "semantic space" and the actual waveforms of the voice signal. I understand the "semantic space" to be more-or-less the volume, phonemes, and then characteristics of the person speaking the voice (e.g. stronger voices produced by certain parts of muscles, or resonates)

  2. The main "training" part of the software is fine-tuning the auto-encoder's decoder part, so that we can reliably translate from semantic space to Huichao's waveform.

  3. We also build an "index" of Huichao's phonemes appeared in the training data, so that semantically closer phonemes are placed together. This is by leveraging FAISS, so that when we want to retrieve the nearest neighbors, it can be done reasonably fast.

  4. Then during transfer, we first map the input audio into the semantic space, for each of the phonemes; then for each of them, we find the nearest neighbors of Huichao's phonemes using FAISS index; we finally pass all these Huichao's phonemes through the decoder, which produces the waveforms.

Man, audio processing is super cool and with so many signal processing terms that I don't quite understand, so just hoping my overall understanding is on the right direction. I'm still amazed by how people leverage large models, working in the semantic space (e.g. stable diffusion as well), to reduce our need for heavy computations. Also it's quite cool about the philosophy behind this retrieval-based approach, where we don't let the model memorize everything, but instead leverage some external data storage mechanism.