I’ve published two notebooks:
- SINGLE SPEC. : one focused on tuning your transformation to be sure that they won’t messe up your original data
- MULTI SPEC: example of generating multiple spectrogram from augmented data
I’ve published two notebooks:
Seems the kernel showing 3d spectrogram, but it is actually just for convinient visualization. x-y is time-freq , z axis is amplitude which is the same like color in our 2d spectrograms… So no extra info added by using 3d…
I can imagine a useful 3d spectrogram like, multiple channels of sounds in the z axis. Like stereo, or more interestingly microphone array sound spectrograms concat. in the z axis for sound localization… I have thought about this before…
I’d suggest using the tools/run-after-git-clone
stuff in the fastai repo to avoid this. See the fastai docs for details. Also try the ReviewNB service - it’s great.
Totally agree, the 3d spectrogram is very much a “human” vis and not a different representation of the data.
Check out @ste’s branch of the fastai-audio repo for an example of using not quite a different representation, but a kind of multi scale representation, (ab)using the fact that we’re really only ever training on tensors whether us puny humans can see them or not.
I was thinking of another way of aggregating - and then visualising - the information in an audio clip besides a spectrogram, but it really does capture most of what’s there.
It would be interesting to experiment with different kinds of spectrogram (“raw” vs. mel, power vs. amp vs. db) and different values for the params (number of FFTs, number of bins, bin size…). Honestly we’re just trying to find what looks “good” to our (puny) human eyes; there’s no guarantee that the prettiest image does the best job of helping a NN discriminate. For my next experiment I want to try making really “high def” spectros to train on.
And also interesting to play with the effects of audio signal transforms vs spectrogram params on the accuracy of the network. For example, if you augment your data by downsampling, but keep your spectrogram #ffts & #num_mels constant, will the “image” presented to the network actually be substantially different?
We haven’t tackled normalisation yet, either; and that could cancel out some assumed audio transforms, eg if you add white noise to every sample, and then normalise, you’re basically removing the noise you added…
There’s a lot to be learned here - and now a Kaggle comp to learn it on
Awesome, I’ve been meaning to ask how you handle this workflow internally. We’ve just been using liberal branching + splatting + cleanup.
Hey Baz, if you check out the “doc_notebooks” branch on the main git repo you’ll find a few changes. I changed all the notebooks to use the public dataset, changed the exception handling, made a few other cleanups, optimised some of the transforms, fixed a pretty critical bug with the spectrogram transform step, etc.
We’ll merge this in with the cool stuff Zac and Stefano have been doing on Monday, but thought I’d let you know in case you’re playing with it over the weekend.
And the “baseline” demo workbook now gets 98.4% accuracy maybe it will go even higher with your model layer modification!
We’re still actively working on this; you’ll be able to see Stefano has been testing ideas “manually” in his notebooks when we merge them (or it might already be there in a branch!), and the DataAugmentation
notebook in the doc_notebooks
branch has a slightly improved comparison helper. Ideally I think we’d want a pretty rich display widget that let you hear the original & transformed audio, see the original & transformed waveforms, and let you see the final post-transform spectrogram that the network is actually seeing. It’s a little tricky - because the way we’ve handled the transform (using the wav-to-spectrogram as the final step) it’s hard to access the “transformed -1” state. We’re debating whether it’s best to change the way the AudioItem
handles being transformed (eg. adding a concept like “transform groups”) or change the way it __repr__
s itself. I’m wondering whether it’s best to dig into Jupyter’s custom display()
handling (particularly _repr_html
) to make a richer show. We’ve even thought of a sub-project to make an ipython widget based tool to help you test and select transforms on the fly! I think it’s definitely something to focus on, as there is so much to experiment with, we’ll get high leverage from making tools that ease experimentation. Feel free to help!
FWIW, I did this last night, and it wasn’t good. It took ages to train because I had to use a tiny batch size, and was overall less accurate than using the relatively lo-res ones. So, not recommended.
There are many other variables to consider, as well; for example, the naive “pad to max” we use in the current demo notebook adds a LOT of zeros to the vast majority of samples, so doing a smarter uniformity selection would probably be advantageous (something like pad from end by average length"). (I suspect the reason the higher res specros were worse is because the relative amount of 0 bins was higher).
I’d also be interested to try progressive resizing - train the model on low-res spectros first, then generate higher and higher res ones to see if it made a difference. It would be interesting to do this at the audio level too (i.e. downsample to 8KHz first).
hey @kmo, of course! Please PR away.
Keep in mind though that we are trying to adhere as closely as possible to the recommended FastAI workflow. There are probably things that we’re doing, such as writing the code in notebooks and aspects of code style that may break with typical software engineering practices and PEP8.
But this has been a conscious decision – because the hope is that this work can be harmoniously brought into the course somehow. Hopefully that doesn’t discourage you from PR’ing and collaborating
Some ideas for possible audio augmentation from a new paper:
Adversarial data augmentation and Stratified data augmentation
import pyaudio
import wave
FORMAT = pyaudio.paInt16
CHANNELS = 2
RATE = 44100
CHUNK = 1024
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "file.wav"
audio = pyaudio.PyAudio()
# start Recording
stream = audio.open(format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK)
print "recording..."
frames = []
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print "finished recording"
# stop Recording
stream.stop_stream()
stream.close()
audio.terminate()
waveFile = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
waveFile.setnchannels(CHANNELS)
waveFile.setsampwidth(audio.get_sample_size(FORMAT))
waveFile.setframerate(RATE)
waveFile.writeframes(b''.join(frames))
waveFile.close()
How does your current audio work handle stereo wav files?
I was exploring https://www.kaggle.com/c/freesound-audio-tagging-2019 which seems like a good dataset to learn about multilabel audio classification.
Last weekend I found the older fastai-audio package someone was working on and tried to build a simple pipeline to classify this dataset.
I kept getting tensor size mismatches and suspect this is because the library didn’t handle stereo audio.
Actually we’re not supporting stereo audio - I’ll take a look at it later…
Btw there are multiple way to cope with a “pair” of sounds instead of one:
What you think is the best one?
Please also note the fastai style suggestions: style – fastai
These are all awesome questions.
Any audio experts have suggestions?
I understand that we’re using spectrograms because resnet is already trained on images, and we want to leverage that pretraining.
I personally wonder if we could get better results by working with raw waveforms. I know there’s been a lot of work with stuff like waveNet, and recently I saw a neural Vocoder [0] that seem to deal with audio in a fundamentally different way.
Also, when first approaching this problem I read [1] which suggested in its abstract that feeding raw waveforms worked better than Spectrograms, and that was from 2009 and seems to have some … pretty important authors.
[0] https://gfx.cs.princeton.edu/pubs/Jin_2018_FAR/
[1] https://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf
Thread forming here discussing just that. There are a few good links to papers that model raw audio.
Re. Stereo, I’m currently working on modifying the paradigm to use a “preprocessor” concept, like the fastai.text
library, in order to ensure your model can take arbitrary audio inputs and have them normalised before you start applying your transforms. Supporting stereo data would be part of that - in the first instance I assume I would just downmix to mono if it detects a stereo input. Everything after that should “just work” once the LoadAudioData preprocessor has had its way with the inputs. We could later enable this option once the rest of the processes can handle stereo.
As for @ste’s question about how to “deal” with stereo data, I think the simplest way would be to store them as a rank 2 tensor where the stereo channels are handled akin to RGB channels in images. Both in the signal space and the spectrogram space. So your input signal would be of shape [2, <length of file / sample rate>]
and your spectrogram would have 2 channels instead of 1 (or 3 like an RGB image).
This would take some tweaking to the display of spectrograms but that’s simple enough, just by cat’ing them together either in array-space or in plt.add
space. Here’s what SOX does, for example.
I’d be very keen to explore working with raw waveforms, I think a lot of us want to try it out. For example you could “just” do convolutions with a kernel shape of [3,1] instead of [3,3] to start with… or have your first layer be [16000, 1] to “convolve” across a second at a time of 16KHz audio. I’d be keen to try this… as soon as the audio framework is “built enough”
The twitter thread @MadeUpMasters pointed out will no doubt bring up some great material!
For discussion, here’s the “problem spec” I’m working with for the preprocessors. Let me know what you think!
Training data on disk can be audio files of arbitrary & heterogeneous length, format, bitrate, bitdepth & channels.
When creating an AudioDataBunch
, the constituent AudioList
s will contain AudioItem
s of user-declared, homogenous length, bitrate & channels, and have a WAV-compatible signal.
When performing inference, input can be of arbitrary length, format, bitrate, bitdepth & channels, and will be converted to the same length, format, bitrate, bitdepth & channels as the training data.
This will mean the AudioList
s in the final AudioDataBunch
could (& probably will) have >1 AudioItem
per input file, and sum(len(AudioDataBunch.x) + len(AudioDataBunch.y)) > len(AudioList.from_folder())
. Is this OK in principle? I think so. It probably means you’ll have to think about validation sets carefully - e.g. you probably wouldn’t want segments of the same audio file in both train & val sets.
What are the implications of this inference model when you’re applying it in practice? Say the model is trained on 1sec snips, and you upload a 30sec recording to be classified, how does it classify that recording? It would have to break it up into 1sec snips to match the shape of the model’s inputs & training data, but there’s many of those snips. What would that look like? I guess you could return “this clip is 96% Category A” depending on what % of clips got classified as A. I suspect the best representation of that will depend on the application, and as such we can probably leave that to require some custom implementation by whoever’s implementing the inferrer.
Some of those Preprocessors look a lot like things we’d also like to use as transforms e.g. padding, cutting, resampling; what’s the most “fastai-idiomatic” way to handle that, I wonder? Does it imply that preprocessors are the wrong way to think about this?
Blog on comparable resampling algorithm in scipy to resamply
I don’t think using spectrograms are because we want to use it with pretrained resnets… Even long time before DL, spectrum was widely used for audio processing… That’s because audio is nothing but a bunch of mixture of different freq. sine waves… And what makes a sound different from another (piano vs guitar, saying p vs v …etc) is nothing but the different mixture of audio frequencies each with different amplitudes… So it is natural to make spectrum and classify according the spectrum… It is the best feature extraction that you would say when you look at it, aha this is a guitar…
Even the Princeton paper that you’ve cited, it uses 1d conv in order to do FFT, which means to generate spectrum… FFT is convolution in the time domain, and it is synonymous to do 1d-conv on raw audio… So I don’t think there is any better approach than putting your prism on the mixture of audio waves to see what its constituent…
Blockquote
As for @ste’s question about how to “deal” with stereo data, I think the simplest way would be to store them as a rank 2 tensor where the stereo channels are handled akin to RGB channels in images. Both in the signal space and the spectrogram space. So your input signal would be of shape[2, <length of file / sample rate>]
and your spectrogram would have 2 channels instead of 1 (or 3 like an RGB image).
I’m curious about that.
If a stereo signal just contains audio position info via timing shift or volume difference, the 2 channels contains the same info if the chunk of audio is bigger than the time shift of the two channles ( usually exactly the same info i suppose ?).
That makes me wonder if that could be used as a way for data augmentation even on mono signals ? Applying synthetic position shift and volume change to the mono channel to create a second one and thus more data ?