Slow dataloader with character tokenizer

Hello,
I am building a binary text classifier on aminoacid sequences of about 70 characters using fastai 2.2.7

I have built the data block this way

AMMINO_ALPHABET = [i for i in 'ARNDCQEGHILKMFPSTWYV']
VOCAB = make_vocab(Counter(AMMINO_ALPHABET), min_freq=0)
data_block = DataBlock(
blocks=(TextBlock.from_df('text', rules=[], tok=BaseTokenizer(), vocab=VOCAB), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.1))

Then the data loader this way:
dl = data_block.dataloaders(df_text, bs=64, seq_len=72)

df_text is a pandas dataframe (50k rows) containing a text column that contains an aminoacid sequence where each amminoacid is separated by a space (e.g. “A N D C Q E G H”), and a target column which is either 0 or 1.

When running the above line which generates the dataloader i see the progress bar for about 2-3 seconds and then the code runs silently for about 50 seconds.

That is way too slow for a 50k rows dataset.

What am I missing?

Thanks a lot!

1 Like

This is what I get from running the profiler as

cProfile.run('data_block.dataloaders(df_text, bs=64, seq_len=72)', 'restats')
p = pstats.Stats('restats')
p.sort_stats(SortKey.CUMULATIVE).print_stats()

Result is
119497097 function calls (115645684 primitive calls) in 79.873 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   79.873   79.873 {built-in method builtins.exec}
        1    0.000    0.000   79.873   79.873 <string>:1(<module>)
        1    0.003    0.003   79.873   79.873 /opt/conda/lib/python3.7/site-packages/fastai/data/block.py:112(dataloaders)
        1    0.000    0.000   74.913   74.913 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:207(dataloaders)
        2    0.000    0.000   74.863   37.432 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:185(__init__)
        2    0.294    0.147   74.857   37.429 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:189(<listcomp>)
    50001    0.073    0.000   73.055    0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/load.py:132(do_item)
    50001    0.067    0.000   62.392    0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/load.py:139(create_item)
    50001    0.118    0.000   62.325    0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:318(__getitem__)
    50001    0.401    0.000   62.124    0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:319(<listcomp>)
   100002    0.287    0.000   61.723    0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:282(__getitem__)
   150005    0.222    0.000   45.975    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:198(__call__)
   150005    0.496    0.000   45.753    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:145(compose_tfms)
   200008    0.211    0.000   41.481    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:73(__call__)
   200008    0.373    0.000   41.270    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:81(_call)
300013/200008    0.851    0.000   40.661    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:85(_do_call)
   100002    0.142    0.000   35.525    0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:247(_after_item)
350021/250017    1.599    0.000   34.448    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:111(__call__)
  1750123    1.651    0.000   31.782    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:111(__getitem__)
  1750131    1.729    0.000   28.130    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:114(_get)
   100019    0.217    0.000   25.327    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py:882(__getitem__)
   100019    0.407    0.000   25.054    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py:1479(_getitem_axis)
   100007    0.517    0.000   23.131    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:2934(_ixs)
2000259/1950116    2.083    0.000   16.391    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:95(__call__)
900050/600031    3.165    0.000   15.760    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:125(__getitem__)
   100010    0.943    0.000   14.038    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/series.py:236(__init__)
1650221/1650189    2.219    0.000   12.796    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:103(__init__)
   100004    0.263    0.000   10.421    0.000 /opt/conda/lib/python3.7/site-packages/fastai/torch_core.py:306(__new__)
150008/150005    0.136    0.000    9.907    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:90(<genexpr>)
1750249/1750205    1.734    0.000    9.773    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:49(listify)
   250010    0.286    0.000    9.337    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:100(returns)
   100004    0.901    0.000    8.065    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py:940(fast_xs)
        1    0.000    0.000    7.445    7.445 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:220(<listcomp>)
        1    0.000    0.000    7.430    7.430 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:212(new)
        1    0.000    0.000    7.430    7.430 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:61(new)
        1    0.000    0.000    7.427    7.427 /opt/conda/lib/python3.7/site-packages/fastai/data/load.py:120(new)
21162055/20161955    3.766    0.000    7.051    0.000 {built-in method builtins.isinstance}
    50002    0.194    0.000    6.854    0.000 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:48(encodes)
350078/350072    0.736    0.000    6.499    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:154(map)
   100004    0.189    0.000    6.077    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py:1887(_interleaved_dtype)
   100006    0.566    0.000    6.016    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/construction.py:423(sanitize_array)
    50002    0.142    0.000    5.844    0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/transforms.py:244(encodes)
   100004    0.775    0.000    5.693    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1517(find_common_type)
     10/6    0.000    0.000    4.967    0.828 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:230(__init__)
        1    0.000    0.000    4.958    4.958 /opt/conda/lib/python3.7/site-packages/fastai/data/block.py:105(datasets)
        3    0.000    0.000    4.947    1.649 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:313(__init__)
        1    0.000    0.000    4.947    4.947 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:315(<listcomp>)
        2    0.001    0.000    4.932    2.466 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:256(setup)
        5    0.000    0.000    4.908    0.982 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:189(setup)
        7    0.000    0.000    4.908    0.701 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:194(add)
        7    0.000    0.000    4.908    0.701 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:77(setup)
        1    0.001    0.001    4.840    4.840 /opt/conda/lib/python3.7/site-packages/fastai/text/core.py:283(setups)
        1    0.008    0.008    4.839    4.839 /opt/conda/lib/python3.7/site-packages/fastai/text/core.py:211(tokenize_df)
9609280/9508994    1.857    0.000    4.601    0.000 {built-in method builtins.getattr}
   100006    0.200    0.000    4.275    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py:1577(from_array)
8552820/6402555    1.643    0.000    4.083    0.000 {built-in method builtins.len}
350078/350072    1.073    0.000    3.905    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:659(map_ex)
   100012    0.261    0.000    3.849    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py:2711(make_block)
   100006    0.456    0.000    3.790    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/construction.py:554(_try_cast)
    50001    0.017    0.000    3.669    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/parallel.py:123(parallel_gen)
    50001    0.017    0.000    3.650    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/parallel.py:109(run_procs)
   100004    0.208    0.000    3.547    0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/transforms.py:202(__call__)
    50001    0.044    0.000    3.489    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/parallel.py:120(<genexpr>)
   800074    0.505    0.000    3.234    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/imports.py:20(is_iter)
    50000    0.119    0.000    3.201    0.000 /opt/conda/lib/python3.7/multiprocessing/queues.py:91(get)
   250010    0.619    0.000    3.126    0.000 /opt/conda/lib/python3.7/site-packages/fastai/torch_core.py:124(tensor)
   100004    0.252    0.000    2.928    0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/transforms.py:196(_do_one)
   100012    0.545    0.000    2.844    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py:2662(get_block_type)
   200014    0.363    0.000    2.602    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/generic.py:5446(__getattr__)
   100004    0.283    0.000    2.553    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:170(cast)
   100007    0.235    0.000    2.450    0.000 /opt/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py:569(find_common_type)
  1000100    0.565    0.000    2.450    0.000 /opt/conda/lib/python3.7/typing.py:715(__instancecheck__)
    50000    0.075    0.000    2.161    0.000 /opt/conda/lib/python3.7/multiprocessing/connection.py:208(recv_bytes)
   200014    1.523    0.000    2.097    0.000 /opt/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py:545(_can_coerce_all)
800086/800085    0.736    0.000    2.092    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:385(__getattr__)
    50000    0.078    0.000    2.057    0.000 /opt/conda/lib/python3.7/multiprocessing/connection.py:406(_recv_bytes)
   450094    0.338    0.000    2.024    0.000 {built-in method builtins.any}
350107/350084    0.423    0.000    1.974    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:110(_new)
   100000    0.140    0.000    1.948    0.000 /opt/conda/lib/python3.7/multiprocessing/connection.py:374(_recv)
  1000100    0.645    0.000    1.885    0.000 /opt/conda/lib/python3.7/typing.py:718(__subclasscheck__)
  1600247    1.342    0.000    1.836    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:135(__iter__)
   250010    0.132    0.000    1.798    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:253(anno_ret)
   100000    1.764    0.000    1.764    0.000 {built-in method posix.read}
   100005    0.172    0.000    1.646    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:243(annotations)
   700197    0.639    0.000    1.619    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/common.py:1470(is_extension_array_dtype)
  3650257    1.096    0.000    1.558    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:78(is_indexer)
    50000    0.029    0.000    1.509    0.000 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:180(_default_sort)
   100004    0.269    0.000    1.501    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/series.py:837(__getitem__)
   300013    0.253    0.000    1.448    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:238(type_hints)
    50002    1.444    0.000    1.444    0.000 {built-in method tensor}
100000/50000    0.330    0.000    1.437    0.000 /opt/conda/lib/python3.7/site-packages/torch/tensor.py:567(__len__)
  3350900    0.750    0.000    1.431    0.000 {built-in method builtins.issubclass}
550143/550129    1.153    0.000    1.390    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:645(__call__)
   500158    0.347    0.000    1.367    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/base.py:254(is_dtype)
  6552421    1.335    0.000    1.335    0.000 {built-in method builtins.hasattr}
   100006    0.429    0.000    1.324    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1379(maybe_cast_to_datetime)
   300012    0.120    0.000    1.303    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1563(<genexpr>)
   200046    0.417    0.000    1.243    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/generic.py:5464(__setattr__)
   200029    0.333    0.000    1.183    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/common.py:1341(is_bool_dtype)
  1650219    0.679    0.000    1.172    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:44(is_array)
800086/800085    0.627    0.000    1.152    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:380(_component_attr_filter)
   200009    0.603    0.000    1.109    0.000 /opt/conda/lib/python3.7/typing.py:934(get_type_hints)
   100004    0.182    0.000    1.103    0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/series.py:942(_get_value)
2000144/2000143    0.853    0.000    1.079    0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:86(__len__)
   100013    0.118    0.000    1.017    0.000 /opt/conda/lib/python3.7/site-packages/fastai/torch_core.py:291(as_subclass)

Just to help comparing performance. I am running the code on a google cloud notebook instance with 4 vCPUs, 15Gb RAM and one Tesla V100

The code below runs in 23 seconds (which seems slow to me for 10’000 IMDB reviews)

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df = pd.concat([df]*10) #10000 IMDB reviews

data_block = DataBlock(
    blocks=(TextBlock.from_df('text'), CategoryBlock),
    get_x=ColReader('text'), get_y=ColReader('label'), splitter=RandomSplitter(0.1))

start = time.time()
dl = data_block.dataloaders(df)
end = time.time()
print(end - start)

Output is: 23.152453305921462

I want to add that if I split this in two phases:

ds = data_block.datasets(df)
and then
dl = ds.dataloaders()

Then most of the time (around 90%) is spent on the second line. Which is weird as the first one should do most of the work (tokenization and numericalization).

This doesn’t seem to happen with the imdb dataset. So it seems to be a problem related mainly to pandas dataframes?

From the profiler it seems that a big chunk of time is spent on compose_tfms. Maybe the transformations do not happen in a vectorized way when using from_df?

I know, I am annoying, but I want to share with you a way to completely reproduce this from scratch

import fastai
import pandas as pd
import random
from fastai.text.all import *

AMINO_ALPHABET = [i for i in 'ARNDCQEGHILKMFPSTWYV']
DF_LEN = 50000
STR_LEN = 10

df = pd.DataFrame({
    'text': [' '.join(random.choice(AMINO_ALPHABET) for i in range(STR_LEN)) for i in range(DF_LEN)],
    'target': [random.randint(0, 1) for i in range(DF_LEN)]
})

VOCAB = make_vocab(Counter(AMINO_ALPHABET), min_freq=0)
data_block = DataBlock(
    blocks=(TextBlock.from_df('text', tok=BaseTokenizer(), vocab=VOCAB), CategoryBlock),
    get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.1))

start = time.time()
ds = data_block.datasets(df)
end = time.time()
ds_time = end - start

start = time.time()
dl = ds.dataloaders()
end = time.time()
dl_time = end - start

print(ds_time)
print(dl_time)

And I get as output

8.407098531723022
34.95427393913269

Which is too much for a 50k rows dataframe. The reason why I think it is too much is that if I do the following

AMINO_ALPHABET_DICT = {
    k:v for k,v in zip(AMINO_ALPHABET, range(len(AMINO_ALPHABET)))
}
def text_to_tensor(s):
    s = s.replace(' ', '')
    rtn = list(s)
    return TensorText([AMINO_ALPHABET_DICT[i] for i in rtn])

start = time.time()
df['tensor'] = df['text'].map(text_to_tensor)
end = time.time()
print(end - start)

I get as output

2.6132564544677734

Also if instead of using TensorText I use torch.tensor then the output is

0.30541300773620605

I’m not sure. but that might be related to https://github.com/fastai/fastai/issues/2812

Thanks, it might be related, but it is a different issue as in this case the slow down appears at dataloader creation (while in that issue it happens during training).

I have submitted an issue https://github.com/fastai/fastai/issues/3276