I saw this competition as well, interested to go for it?
I think WaveNet may be a little overkill for the general audio classification task, but would definitely be worth looking into for generative models. Spectrograms are still being used quite a bit, along with MFCCs which attempt to map frequencies to something more representative of human perception.
I’m really interested in multi-resolution spectral features in addition to the raw waveform itself, somehow combining RNNs and CNNs.
I really appreciate the input. Honestly, after I posted this I had kind of resolved to make a post about my project, come what may. Finishing it is much more important than profiting off it. But I think you’re definitely right. I think somebody stealing my idea is much more remote a possibility than I’m probably worried about. Thanks for the inspiration!
Do you all have
torchaudio installed? If so, what’s the best way to do it? I’m using Paperspace, and it was not included as part of the Fastai environment. For
torchvision, you can do a conda-install as described here by Jeremy. No such luck for
Am interested in music generation models. I’m a musician myself and would love to follow-up along this line.
I’m definitely interested in music generation models. I think it could be really cool to try the progressive GAN approach for music (on either audio, or MIDI)
This should work for you https://github.com/pytorch/audio
But it doesn’t support windows… not sure what to do here.
Thanks. I gave that a shot, but when I tried importing
torchaudio into a notebook, I got the following error:
Anywhere near Atlanta, If I may ask?
No, I’m in Savannah.
@andrewyip what’s your email? We should keep in touch about music generation models.
has anyone taken a crack at the freesound competition? or processing audio at all using the fast.ai library? It seems most people have generated STFT/FTT/MFCC spectrographs and run a CNN on the resulting jpgs, but I wanted to try to run a model on the resulting arrays themselves. Has anyone tried this yet?
Yes. 5 of us have a team doing it right now. I’m not quite sure what you mean by “run on the resulting arrays themselves” vs. a jpg. Everything is just tensor arrays by the time it gets to the model. If you want to join our team, I can invite you to the slack channel we have. We’re currently getting a little above the baseline. But we have loads of ideas, and I feel confident we can create a solid submission.
Sorry I should have been clearer. I’ve seen approaches that involve generating a spectrogram plot (of whatever length windows), converting that an image file, and then training a CNN on those images, the idea being the spectrograms will represent the content the same way a picture of a dog or cat would. The other way to use that data would be to use the numpy array generated by a call to, say, librosa.features.mfcc and using that as the input for a neural net, which sounds like your approach.
Given the interest in deep learning applied to music I thought I’d share this paper as it seems likely to be of interest and relevant:
A Universal Music Translation Network
Noam Mor, Lior Wolf, Adam Polyak, Yaniv Taigman
(Submitted on 21 May 2018 (v1), last revised 23 May 2018 (this version, v2))
We present a method for translating music across musical instruments, genres, and styles. This method is based on a multi-domain wavenet autoencoder, with a shared encoder and a disentangled latent space that is trained end-to-end on waveforms. Employing a diverse training dataset and large net capacity, the domain-independent encoder allows us to translate even from musical domains that were not seen during training. The method is unsupervised and does not rely on supervision in the form of matched samples between domains or musical transcriptions. We evaluate our method on NSynth, as well as on a dataset collected from professional musicians, and achieve convincing translations, even when translating from whistling, potentially enabling the creation of instrumental music by untrained humans.
Yeah we’re doing mfcc’s, and mel filterbanks, and just raw audio. And we’re trying them on various architectures, including CNN’s, RNN’s, and soon I’m hoping to do an all attention model like a Transformer network.