And DCASE 2020 now. http://dcase.community/challenge2020/index
I am trying to inference in Kaggle kernel, and I am stuck, there seems to be some incompatibility which I couldn’t find. In Colab it works, so I installed all the same dependencies just to be sure, but no avail.
Code
learn = load_learner("/kaggle/input/deepfake")
test = AudioList.from_folder(audio_path, config=config)
learn.predict(test[-1])
/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
652 else: x,y = self.item ,0
653 if self.tfms or self.tfmargs:
--> 654 x = x.apply_tfms(self.tfms, **self.tfmargs)
655 if hasattr(self, 'tfms_y') and self.tfm_y and self.item is None:
656 y = y.apply_tfms(self.tfms_y, **{**self.tfmargs_y, 'do_resolve':False})
TypeError: apply_tfms() got an unexpected keyword argument 'size'
@scart97, any idea what would be causing this discrepancy?
I have no idea, we need more info to understand what is happening there. @much_learner could you please run the following code on both colab and kaggle to check if the packages installed are the same?
import fastai.utils; fastai.utils.show_install(1)
Also, a full copy of the stack trace would be useful.
I manually installed exact packages on Kaggle. Here they are. Also the full trace.
I thought it could be something with pytorch since after I hacked size
it spurted tensor doesn't have pixel property
or something.
torchaudio, torchvision are the same too (0.4 and 0.5)
@scart97 Any idea? Does the inference part work at all? As @shruti_01 pointed audio_predict
and audio_predict_all
are broken in master too.
I am happy to help if you could hint where to dig.
Hey, great work with the library!
I started using it for phonemic classification of separate fourier time bins. When studying the lib I noticed is that you try to shadow the LabelList class in order to be able to pre_process audio files together with their labels. The preprocessing here involves downmixing, silence removal, etc.
I was wondering if that wouldn’t be possible to achieve using PreProcessors? I figured this would be more in compliance with the fastai workflow. Alternatively the config file could be processed in the .transform method. Are there some reasons why it was not possible to be done this way?
Best regards
Yes I totally agree, it would be more in line and we did try to implement it with the PreProcessor initally initially but there was some limitation that meant that we couldn’t save state I believe for some reason. Bare in mind the the v1 is likely to be deprecated in the next few months in favour of the new fastaiv2 based API.
Hello all, thank you for the excellent work on this module.
I have had success doing basic classification. I’m hoping for some technical help tackling a new type of problem.
I would like to feed slices of spectrograms through an LSTM and predict a corresponding float value. So if I had a 480-width spectrogram and 240 float values, for the first 2-wide slice of the spectrogram I would like to predict the first Float value. Then use that predicted float and the next slice to predict the next float, and so on.
Do I need to divide my labels into chunks which correspond to the chunks in my audio sequences?
I’m really interested in DL for audio too.
I’m trying to figure out why librosa and torch audio give different spectrograms even though one uses the same parameters for n_fft, hop_length, rate, power?
I can’t seem to find the reason for this difference. Is it some sort of normalisation that’s different?
@MadeUpMasters, any hints?
Oh, and is the Telegram group alive?
I have been overwhelmed with a few projects and haven’t been working on audio the past 8 weeks so sorry I haven’t been as active in the thread. Both the library and audio telegrams are still up, PM me and I’ll get you added. @kdorichev is the one most actively working on the library at the moment. He is doing a great job of organizing people to contribute.
About the spectrograms. Are you using the same exact signal (read in by one library, either librosa or torchaudio, and then fed to the 2 distinct spectrogram functions)? They both have their own quirks with regards to how they read in audio and also how they normalize. I believe they mostly use the same underlying window functions and algorithms because (if it hasnt changed in past 2 months) torchaudio mostly delegates to librosa.
Yes, have used torchaudio
to read in the same signal and then fed that to the melspectrogram functions followed by the amplitude-to-db functions of each library. The output looks very different!
torchaudio
:
librosa
:
If I just look at the outputs after Melspectrograms alone, they then look very similar (identical, I think). So it has to be the amplitude-to-db conversion functions that are working differently.
Obviously, the torchaudio
output performs way better for the deep learning model.
Can you include code used to generate both?
Sure @MadeUpMasters, see below. Note: The code defining the function get_x
is the same in both cases up to the dashed line.
au2spec = torchaudio.transforms.MelSpectrogram(sample_rate=target_rate,n_fft=n_fft, hop_length=hop_length, n_mels=64)
ampli2db = torchaudio.transforms.AmplitudeToDB()
def get_x(path, target_rate=target_rate, num_samples=num_samples):
x, rate = torchaudio.load_wav(path)
if rate != target_rate:
x = torchaudio.transforms.Resample(orig_freq=rate, new_freq=target_rate, resampling_method='sinc_interpolation')(x)
x = x[0] / 32767
x = x.numpy()
x = librosa.util.fix_length(x, num_samples)
#------------------------------------------------------#
torch_x = torch.tensor(x)
spec = au2spec(torch_x)
spec = ampli2db(spec)
spec = spec.data.squeeze(0).numpy()
spec = spec - spec.min()
spec = spec/spec.max()*255
return spec
Image.fromarray(get_x(wav_files[0]).astype(np.uint8))
yields
and
def get_x(path, target_rate=target_rate, num_samples=num_samples):
x, rate = torchaudio.load_wav(path)
if rate != target_rate:
x = torchaudio.transforms.Resample(orig_freq=rate, new_freq=target_rate, resampling_method='sinc_interpolation')(x)
x = x[0] / 32767
x = x.numpy()
x = librosa.util.fix_length(x, num_samples)
#------------------------------------------------------#
spec = librosa.feature.melspectrogram(x, sr=target_rate, n_fft=n_fft, hop_length=140, n_mels=64)
spec = librosa.amplitude_to_db(spec)
spec = spec - spec.min()
spec = spec/spec.max()*255
return spec
Image.fromarray(get_x(wav_files[0]).astype(np.uint8))
yields
Any insights, are welcome.
I don’t see why the results should be different, since torchaudio
claims to be consistent with librosa
's algorithms everywhere in their source code.
I find this post really amazing! I recently want to implement a audio detector for badminton shot (i.e. detecting smash, serve etc. in a badminton match etc.), but I found my problem has a slight differences with the general tutorials available in a sense that:
- unlike usual audio samples which lasts for a few seconds / minutes, badminton shots usually lasts for less than 1 second. Could deep learning models work well on such samples with short duration?
- I would like to detect those badminton shots in a match of around 1 hour long, so it’s more like an audio detection problem instead of a classification problem. To make a simple first step, is it reasonable to do a work around by chopping a 1-hour long match audio into a bunch of short audio short clips and do classification on each short clips?
So I figured it out. There are two different magnitude scales for spectrograms, amplitude(sometimes called magnitude) and power, and these are controlled by the power
parameter. The power spectrogram is just the elementwise square of the amplitude spectrogram. If you take the square of every value in your amplitude spectrogram, you then have a “power” spectrogram.
To properly convert to dB depends on which scale you’re using. Librosa uses two distinct functions librosa.power_to_db
and librosa.amplitude_to_db
(a convenience function that squares the spectrogram and then delegates to librosa.power_to_db
). Torchaudio only has one, the poorly named amplitude_to_db
which takes a string argument stype
that tells if the incoming spectrogram is a power
or magnitude
spectrogram. It’s default value is “power” so unfortunately torchaudio.amplitude_to_db
!= librosa.amplitude_to_db
.
torchaudio.amplitude_to_db
by default is actually the same as librosa.power_to_db
. Change librosa.amplitude_to_db(spec)
to librosa.power_to_db(spec)
and I think you’ll have roughly the same spectrograms.
Absolutely, I’m not sure just how short a badminton sound is but I’m confident there will be a way to do it. Try the normal spectrogram way, but if that doesn’t work you may need to use something called wavelets that are more useful than the discrete fourier transform for detecting abrupt changes to sound (I’ve never actually used wavelet transforms myself on a project, spectrogram will likely work so don’t go down that rabbit hole until you have to).
Yes that is a great idea to work and see if the problem of classifying badminton is feasible in the first place. If it is, then you can worry about identifying which sounds are badminton sounds in the first place. If your long clips have no noise other than the shots (unlikely) you can use regular silence removal (build into fastai v2 audio). If they do have other noise (cheering, human grunts) you will need to have another model that determines if something is a badminton sound, but for now just manually make clips and start there.
Thanks @MadeUpMasters for checking this out and letting us know. However, I did try this out earlier but didn’t mention it because I doubt that this is the reason.
Yes, the outputs are closer to one another than before but still too different from each other for them to be no other difference, in my opinion (I can plot them next time I start up my machine and run these experiments).
Moreover, the results of training the same model on these different datasets (torchaudio.amplitude_to_db
vs librosa.amplitude_to_db
or librosa.power_to_db
) are also vastly different (an order of magnitude difference). That is, there’s no real difference whether one uses librosa.amplitude_to_db(spec)
or librosa.power_to_db(spec)
. I think this should be expected, too since just squaring the values of pixels should be a task that’s easy enough for the model to learn, right?
FYI, I’m using resnet
and xresnet
which are sufficiently capable, I would think.
BTW, it occurred to me that my previous conclusion that the “Melspectrogram functions are not responsible/ yield the same result”, might be false. Perhaps I just don’t see the difference in the image because the Melspectrograms have all the interesting stuff happening in a very small region of the image (a small one, at that) and the amplitude_to_db
functions only show this difference but may not be the reason for the difference.
That’s really interesting, I’ve never seen such a large difference in training results between the two. Do you have the full code on github? Are you using fastai, fastai2, pytorch or something else? Are you using either of our audio libraries? What’s the dataset?
Also is there a reason you chose to normalize manually? There are already several points where the data is getting normalized (torchaudio normalizes on audio load by default, and amplitude_to_db has a degree of normalization via the ref
and top_db
parameters although I still don’t fully understand the impact this has on training audio models. Finally, why did you convert to uint8 at the end, just to fit more on GPU?
For those still working with v1. I’ve created a notebook in the v1 repo that demonstrates how to do spec_augement on the GPU which increases training speed. There is potential to do a few of the random transforms here.