Generative model to improve audio quality

Will we be touching on techniques to improve audio quality, similar to create super-resolution images? (Reduce unintended noise, increase bitrate in a useful way, other things people smarter than me would come up with)

I imagine spoken word (audiobooks, podcasts, speeches, etc.) would be a pretty good application: make something 32kbps and/or flooded with white noise actually worth listening to?

Has anyone attempted or read any studies on this?

This would be incredibly interesting to me!

1 Like

It would be. We won’t cover this specifically, but I think the techniques we discuss would work pretty well in this application.

1 Like

Could you use a high-quality audio file and introduce your own noise for this type of application? The thing I’m not sure about it how to generate the noise to add to the audio. Maybe just randomly distort it or just delete some of the file. What I’m thinking is:

  1. Start out with high-quality audio
  2. Try to predict the next value to build up the dictionary (similar to word prediction)
  3. Your loss function could be weighted by how far out the piece of audio is.
    a. For example: the first 100 data points would be very important and then as you got farther from that point, it would lessen the penalty for being wrong.
  4. Optimize this function and hope it works.

I would think you would want to do specific training on this where if you are doing podcasts, make it only train on podcasts or at least only talking vs if it is for music, you may have to work on different musical styles (hard rock and folk will probably do better if you separate their training sets unless you can get a big enough example set that your model could determine which type of audio is needing to be optimized.

Yeah that’s what I was imagining. It’s all about creating the right kind of noise!

1 Like

I’m quite interested in this topic too. I just read this article : it’s an easy read, the structure of the network is well documented, including optimizer and hyperparameters, it could be a nice exercise to start from this. However this one is literally super-resolution : it is upsampling a 4kHz signal to 16 kHz, adding more points to the signal.
i personnally am more interested in reconstructing a good signal from a 44kHz low-bitrate mp3 for I don’t know if the architectures can be used as-is.
Good thing is we can generate easily a big dataset by encoding a lot of audio in bad quality.

i wonder if it would be a good idea to make it a RNN by adding a loop over the whole network, so that it keeps tracks of the previously enhanced audio, the hidden state could maybe keep track of the voice/instruments tones and make things more smooth.

i’m still quite beginning so I hope what i said makes sense :smiley: but it’s something i’d like to have a look at.


So would you have multiple recording devices of different quality to generate your data or what would be your plan to generate the bad-quality audio data?

yeah that’s definitely the question.

Just downsampling or adding Gaussian noise or both would, I assume, not be enough.

You bring up an interesting idea though- playing the audio in it’s highest quality format (preferably live) through a number of mics of varying quality, generating audio in different encodings (both lossless and lossy) and bitrates might do the trick?

But I barely know what I’m talking about.

I am also interested in this topic.
Any suggestion / reference for approaching the problem. @jsonm, @gdc sharing any progress on this will be helpful.