Deep Learning with Audio Thread

Hey! nice job with this library!

I’m trying to use load_learner with a test set to make predictions, but I’m not able to.

I would like to do something like:

some_audio_item_list= AudioItemList.from_XXX() 
learn = load_learner(path, model_name, test=some_audio_item_list)

let me add that

some_audio_item_list.items

returns

array(['../input/test/000ccb97.wav', '../input/test/0012633b.wav', '../input/test/001ed5f1.wav',
       '../input/test/00294be0.wav', ..., '../input/test/41f86bc4.wav', '../input/test/4215309a.wav',
       '../input/test/4248d196.wav', '../input/test/42542036.wav'], dtype=object)

which is what I expect to.

The error I have is:


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-270-e3b4c8a83f0f> in <module>()
      8     #learn = load_learner(DATA, 'models-sound/' + model, test=test)
      9     test_audios=AudioItemList.from_df(df=test_df, path=WORK, cols=2, using_librosa=True)
---> 10     learn = load_learner("../working", model, test=test_audios)
     11 
     12     preds, _ = learn.TTA(ds_type=DatasetType.Test)

/opt/conda/lib/python3.6/site-packages/fastai/basic_train.py in load_learner(path, file, test, **db_kwargs)
    593     model = state.pop('model')
    594     src = LabelLists.load_state(path, state.pop('data'))
--> 595     if test is not None: src.add_test(test)
    596     data = src.databunch(**db_kwargs)
    597     cb_state = state.pop('cb_state')

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in add_test(self, items, label)
    550         if isinstance(items, ItemList): items = self.valid.x.new(items.items, inner_df=items.inner_df).process()
    551         else: items = self.valid.x.new(items).process()
--> 552         self.test = self.valid.new(items, labels)
    553         return self
    554 

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in new(self, x, y, **kwargs)
    616     def new(self, x, y, **kwargs)->'LabelList':
    617         if isinstance(x, ItemList):
--> 618             return self.__class__(x, y, tfms=self.tfms, tfm_y=self.tfm_y, **self.tfmargs)
    619         else:
    620             return self.new(self.x.new(x, **kwargs), self.y.new(y, **kwargs)).process()

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in __init__(self, x, y, tfms, tfm_y, **kwargs)
    589         self.y.x = x
    590         self.item=None
--> 591         self.transform(tfms, **kwargs)
    592 
    593     def __len__(self)->int: return len(self.x) if self.item is None else 1

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in transform(self, tfms, tfm_y, **kwargs)
    707     def transform(self, tfms:TfmList, tfm_y:bool=None, **kwargs):
    708         "Set the `tfms` and `tfm_y` value to be applied to the inputs and targets."
--> 709         _check_kwargs(self.x, tfms, **kwargs)
    710         if tfm_y is None: tfm_y = self.tfm_y
    711         if tfm_y: _check_kwargs(self.y, tfms, **kwargs)

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in _check_kwargs(ds, tfms, **kwargs)
    578     if (tfms is None or len(tfms) == 0) and len(kwargs) == 0: return
    579     if len(ds.items) >= 1:
--> 580         x = ds[0]
    581         try: x.apply_tfms(tfms, **kwargs)
    582         except Exception as e:

/opt/conda/lib/python3.6/site-packages/fastai/data_block.py in __getitem__(self, idxs)
    107     def __getitem__(self,idxs:int)->Any:
    108         idxs = try_int(idxs)
--> 109         if isinstance(idxs, Integral): return self.get(idxs)
    110         else: return self.new(self.items[idxs], inner_df=index_row(self.inner_df, idxs))
    111 

<ipython-input-234-334a3dbe8fad> in get(self, i)
     81     def get(self, i):
     82         fn = super().get(i)
---> 83         return open_audio(fn, using_librosa=self.using_librosa, downsampling=self.downsampling)
     84 
     85 

<ipython-input-233-ef0065fe2480> in open_audio(fn, using_librosa, downsampling)
     73 def open_audio(fn, using_librosa:bool=True, downsampling=8000):
     74     if using_librosa:
---> 75         x, sr = librosa.core.load(fn, sr=None, mono=False)
     76 
     77     else:

/opt/conda/lib/python3.6/site-packages/librosa/core/audio.py in load(path, sr, mono, offset, duration, dtype, res_type)
    117 
    118     y = []
--> 119     with audioread.audio_open(os.path.realpath(path)) as input_file:
    120         sr_native = input_file.samplerate
    121         n_channels = input_file.channels

/opt/conda/lib/python3.6/posixpath.py in realpath(filename)
    385     """Return the canonical path of the specified filename, eliminating any
    386 symbolic links encountered in the path."""
--> 387     filename = os.fspath(filename)
    388     path, ok = _joinrealpath(filename[:0], filename, {})
    389     return abspath(path)

TypeError: expected str, bytes or os.PathLike object, not numpy.int64

Any help is appreciated!

It looks like it’s expecting a list of filenames but it’s getting a numpy array on int64. Perhaps pass it some_audio_item_list.items instead of some_audio_item_list or just bypass the creation of some_audio_item_list entirely.

Looks like a cool tools roundup here, can we add to wiki? https://github.com/faroit/awesome-python-scientific-audio

3 Likes

Hello everyone, I have this kind of ECG data now (numpy format instead of mp3).
IMG_1354
I want to try to convert it to spectrogram training, What should I do?
Thanks

1 Like

Roughly Subclass AudioList and override “open” method to support numpy file format.

2 Likes

You can also pass numpy arrays into many librosa functions. You just need to provide the sample rate as well. So you could just compute spectograms on your array, then use those with other code used for audio. You most likely don’t want a melspectogram though. The mel scaling is based on the properties of human hearing which there is no reason to think are useful in your case. But you should be able to apply many of the techniques here to non mel-weighted spectograms and librosa can produce those. Though then you may have a much larger amount of frequency data to deal with. You’ll thus probably want to set the minimum and maximum frequencies of your spectograms to something appropriate to ECG.
Another issue you are likely to find is the problem of dealing with the various channels of ECG data. That’s quite different to audio where generally only 1 channel is used and at most the 2 of stereo audio files.
So you may find that much of the stuff here doesn’t necessarily help. I’d imagine you can still use the basic idea of converting the time-series data to the frequency domain (spectograms), then creating an image from that and using existing image models. Perhaps creating a spectogram per channel, then concatenating the resulting arrays along the height axis to create a single image. Something like your sample image but replacing those time-domain line plots with spectograms. You could also use each EEG channel as a separate channel in the image, but pre-trained image models want 3 image channels so you’d have to deal with that. I think that Jeremy talked about this in one of the lectures in Part 1 when considering dealing with 4 channels in some dataset, maybe satellite data. He suggested some things you might apply to extending pre-trained 3 channel image models to the 12 you have (or however many, there’s 12 in that image).
I’d also be aware that the multi-channel nature might make a lot of the processing here not applicable. I’d imagine you would for instance want to take into account the multiple channels when normalising your data. Similarly applying augmentations without taking into account the multi-channel nature may not work well. Though on the other hand it might be fine.

3 Likes

please check out https://github.com/StatisticDean/Crystal_Clear , thanks to StatisticDean. I think this is an implementation of the idea from lesson 7.

1 Like

Wiki overhaul completed!

1 Like

Hey sorry for the delay in replying to this. I find your article really fascinating. I think the best topics to write on are the ones you’re interested in and working on so I’d say just write up whatever you think is coolest and share it! But one thing that is currently missing is a thorough intro to audio. It’s a bit overwhelming coming in and seeing spectrograms, melspectrograms and logmelspectrograms, not to mention all the various feature extractions like MFCC, and all the params they take like min/max freq (different for speech v music/sound), hop_length, n_fft, number of mel bins…etc. Fine tuning those parameters for your application and clip length do appear to matter and it’s a possibly intractable problem to just provide all “audio” users with good defaults. One option is instead of having defaults around application. So passing in “speech” and using logmelspec with frequency range of human speech and so on.

Anyways I’m getting off track, but I think a very comprehensive intro to audio could be really useful. Something in the fastai style, top-down and ready to use on several problems, with lots of detail and stuff a new user won’t understand on the first pass, but will gradually come to understand with time. This is something I’d like to work on but could definitely see it being a collaborative effort between various regular posters in the thread. You seem to have more domain expertise so let me know if it’s something you’d be interested in helping with (even if that means just reading over it and correcting mistakes/misconceptions we have). Cheers.

4 Likes

Love the idea of “defaults by application” like speech, music, etc.

Btw: Fast.ai audio should be part of version 1.2!

https://forums.fast.ai/t/fastai-v1-2-plans-based-on-part2-v3-notebooks/45367/4?u=ste

3 Likes

Hi all,

I’ve been using fast.ai for language classification from audio signals using mel-spectrograms as my inputs. It works well as long as all of my audio data from different languages is from the same dataset (therefore in the same format). When I train on language A from dataset 1, and validate on language A from dataset 2, the accuracy for that language drops significantly, and instead the network guesses that a lot of the signals from language A are a different language from dataset 2.

This leads me to believe the network is first picking up on features that determine which dataset the audio came from, and then classifying the language as a secondary feature.

I tried reformatting data from each dataset so that it would contain the same amount of information. Dataset 1 (voxforge) consisted of wavfiles and dataset 2 (mozilla common voice) consisted of mp3 data, so I converted dataset 1 to mp3 data that had the same sampling rate and bits per sample as dataset 2.

I’m thinking there must be some other encoding artifact from the datasets that is throwing my network off. Has anyone else had issues with “normalizing” data across datasets or anything about converting encodings so that the information encoded is of equivalent quality?

2 Likes

bioacoustics defaults too, please!

P.S. What does the forum post say, I cannot see it :slightly_frowning_face:

2 Likes

New conference presentation on using CNN for bioacoustics audio classification.
Authors reported on augmentations they used, good info to consider for fast.ai audio

  1. " changed the amplitude of the spectrograms in range −6 dB to +3 dB"

Result: weak calls detected better.

  1. The pitch was changed by a factor in the range of [0.5, 1.5]
    and the length of the signal was stretch by a factor between
    [0.5, 2].

Result: better generalization with new data

  1. added characteristic noise

Result: ???

Overall results on calls classification task:

mean test accuracy of 87 % after training for 72 epochs

1 Like

The post just says that Jeremy wants to include the audio library in the next fastai library release :slight_smile: In the next month or two.

5 Likes

Am sorry , if its a dump question ? has anybody faced issue with _torch_sox ?.

i am not able to import _torch_sox

Are you using the fastai audio module or trying to import directly ? Can you post a full error message?

Trying to use fastai audio module, may be below error is the problem

Running in RHEL 3.10 , torch version is 1.0

sh-4.2$ python setup.py install
running install
running bdist_egg
running egg_info
writing torchaudio.egg-info/PKG-INFO
writing dependency_links to torchaudio.egg-info/dependency_links.txt
writing requirements to torchaudio.egg-info/requires.txt
writing top-level names to torchaudio.egg-info/top_level.txt
reading manifest file ‘torchaudio.egg-info/SOURCES.txt’
writing manifest file ‘torchaudio.egg-info/SOURCES.txt’
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
error: [Errno 2] No such file or directory: ‘which’: ‘which’

Please suggest if you have seen this before.

I’m wondering if behavior related to unknown class is a function of using softmax.

During part 2 it was mentioned that Softmax always wants to elevate one feature. In a dataset that doen’t always conform to labels, or where more that on object might be present binomial loss function might be better.

Just a guess

I’ve made a post related to Siamese Networks but I’m using it for Audio related stuff.

3 Likes

Are you running the install.sh file?