Single CPU `dataloaders`

Hi All,

Following the example 10_nlp notebook and running into a significant bottleneck when creating the DataBlock. Looking at top from the command line, there is only one CPU being utilized after the progress bar completes and would appreciate recommendations for parallelizing. When processing 100k rows, it takes ~1min with 500k rows taking ~5min. Row extract below formatted as CSV with “label” as the category and “text” as the input text:

Source Data
label, text
…, …

For context, below is the instance configuration:

VM Resources acknowledge this is overkill
CPUs: 64
Memory: 240GB
GPUs: 4 Tesla P4s

With the dataloaders code below:

Dataloaders Code Block note this is using TextBlock.from_df

dls_lm = DataBlock(
        blocks=TextBlock.from_df('text', is_lm=True),
        bs=128, seq_len=72, 

Any recommendations to accelerate performance would be appreciated.

[Update] - resolved
Running over the full 22M records takes 3.5hrs to load the DataBlock. Ran into a CUDA out of memory error when training the model; reducing batch size to 16 to mitigate.
Resolution - reducing batch worked

[Update] - resolved
While the progress bar is visible, Learn.fit_one_cycle() is using only a single thread with an estimated completion time of 11 hours for 22M records with resources listed above. During the progress bar phase for DataBlock, multiple threads were visible using top. Note that torch.cuda.current_device() returns 0; I am connected to the GPU. Code snippet below:

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

learn.fit_one_cycle(1, 2e-2)

Resolution - running watch -n 1 nvidia-smi shows one of the four GPUs being used (look at Working with GPU documentation for more information)

Only one GPU being used; recommendations for activating all four appreciated