Weird behavior with fastai.core.parallel and 7z on large dataset?

MadeUpMasters · March 15, 2019, 10:28pm

I’ve had some weird stuff happening while running my model on a test set for the Tensorflow Speech Recognition Challenge on Kaggle. The test set is ~158,000 1 sec wav files. I experienced two issues I was hoping someone could shed some light on.

Extracting with 7z in the terminal of my paperspace gradient notebook using the line 7za x test.7z, takes too long to be feasible. It was headed for 18 hrs with the unzipping slowing down as it sent. I actually downloaded the dataset, extracted it locally, recompressed as tar.gz, uploaded, and extracted in around an hour. Is 7z just horrible for linux? I’ve seen Jeremy use it in the planet notebook from lesson 3. What could be causing this?
Secondly I generated the spectrograms in parallel using fastai’s awesome built in parallel function. It worked and appeared to run for all 158,000 files (in 3 hours or so) according to the progress bar, but when I tried len(os.listdir(spectrogram_path)) I found there were only around 50,000 files. About ten minutes later I ran it again, and saw there were a few thousand more. The program had iterated over every file and returned, and the notebook where I executed the code was responsive, but somehow it seems the output of the spectrograms bottlenecked and is still unfolding, every time I check, a few more images appear. Is this a side effect of parallel? Is it possible I did something wrong? Could it be paperspace/linux?

Some code is included below. I had to make a special version to work with parallel. Most of the gen_spec code comes from @kmo in our FastAI Forums: Deep Learning with Audio Thread

parallel(gen_spec_partial, os.listdir(path_test_audio)[27013:])

gen_spec_partial = partial(gen_spec_parallel, src_path = path_test_audio, 
                           dst_path = path_test_spectrogram)

def gen_spec_parallel(fname: str, index: int=0, src_path: str="", dst_path: str=""):
    y, sr = librosa.load(src_path/fname)
    
    n_fft = 1024
    hop_length = 512
    n_mels = 128
    fmin = 20
    fmax = 8000

    S = librosa.feature.melspectrogram(y, sr=sr, n_fft=n_fft, 
                                                    hop_length=hop_length, 
                                                    n_mels=n_mels, power=1.0, 
                                                    fmin=fmin, fmax=fmax)
                     
    plt.figure(figsize=(2.24, 2.24))
    pylab.axis('off')
    pylab.axes([0., 0., 1., 1.], frameon=False, xticks=[], yticks=[])
                     
    librosa.display.specshow(librosa.power_to_db(S, ref=np.max), y_axis='mel', x_axis='time')
                     
    save_path = f'{dst_path/fname}.png'
    pylab.savefig(save_path, bbox_inches=None, pad_inches=0, dpi=100)
    pylab.close()