Hello,
I am building a binary text classifier on aminoacid sequences of about 70 characters using fastai 2.2.7
I have built the data block this way
AMMINO_ALPHABET = [i for i in 'ARNDCQEGHILKMFPSTWYV']
VOCAB = make_vocab(Counter(AMMINO_ALPHABET), min_freq=0)
data_block = DataBlock(
blocks=(TextBlock.from_df('text', rules=[], tok=BaseTokenizer(), vocab=VOCAB), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.1))
Then the data loader this way:
dl = data_block.dataloaders(df_text, bs=64, seq_len=72)
df_text is a pandas dataframe (50k rows) containing a text column that contains an aminoacid sequence where each amminoacid is separated by a space (e.g. “A N D C Q E G H”), and a target column which is either 0 or 1.
When running the above line which generates the dataloader i see the progress bar for about 2-3 seconds and then the code runs silently for about 50 seconds.
That is way too slow for a 50k rows dataset.
What am I missing?
Thanks a lot!
1 Like
This is what I get from running the profiler as
cProfile.run('data_block.dataloaders(df_text, bs=64, seq_len=72)', 'restats')
p = pstats.Stats('restats')
p.sort_stats(SortKey.CUMULATIVE).print_stats()
Result is
119497097 function calls (115645684 primitive calls) in 79.873 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 79.873 79.873 {built-in method builtins.exec}
1 0.000 0.000 79.873 79.873 <string>:1(<module>)
1 0.003 0.003 79.873 79.873 /opt/conda/lib/python3.7/site-packages/fastai/data/block.py:112(dataloaders)
1 0.000 0.000 74.913 74.913 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:207(dataloaders)
2 0.000 0.000 74.863 37.432 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:185(__init__)
2 0.294 0.147 74.857 37.429 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:189(<listcomp>)
50001 0.073 0.000 73.055 0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/load.py:132(do_item)
50001 0.067 0.000 62.392 0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/load.py:139(create_item)
50001 0.118 0.000 62.325 0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:318(__getitem__)
50001 0.401 0.000 62.124 0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:319(<listcomp>)
100002 0.287 0.000 61.723 0.001 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:282(__getitem__)
150005 0.222 0.000 45.975 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:198(__call__)
150005 0.496 0.000 45.753 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:145(compose_tfms)
200008 0.211 0.000 41.481 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:73(__call__)
200008 0.373 0.000 41.270 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:81(_call)
300013/200008 0.851 0.000 40.661 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:85(_do_call)
100002 0.142 0.000 35.525 0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:247(_after_item)
350021/250017 1.599 0.000 34.448 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:111(__call__)
1750123 1.651 0.000 31.782 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:111(__getitem__)
1750131 1.729 0.000 28.130 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:114(_get)
100019 0.217 0.000 25.327 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py:882(__getitem__)
100019 0.407 0.000 25.054 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/indexing.py:1479(_getitem_axis)
100007 0.517 0.000 23.131 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/frame.py:2934(_ixs)
2000259/1950116 2.083 0.000 16.391 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:95(__call__)
900050/600031 3.165 0.000 15.760 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:125(__getitem__)
100010 0.943 0.000 14.038 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/series.py:236(__init__)
1650221/1650189 2.219 0.000 12.796 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:103(__init__)
100004 0.263 0.000 10.421 0.000 /opt/conda/lib/python3.7/site-packages/fastai/torch_core.py:306(__new__)
150008/150005 0.136 0.000 9.907 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:90(<genexpr>)
1750249/1750205 1.734 0.000 9.773 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:49(listify)
250010 0.286 0.000 9.337 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:100(returns)
100004 0.901 0.000 8.065 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py:940(fast_xs)
1 0.000 0.000 7.445 7.445 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:220(<listcomp>)
1 0.000 0.000 7.430 7.430 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:212(new)
1 0.000 0.000 7.430 7.430 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:61(new)
1 0.000 0.000 7.427 7.427 /opt/conda/lib/python3.7/site-packages/fastai/data/load.py:120(new)
21162055/20161955 3.766 0.000 7.051 0.000 {built-in method builtins.isinstance}
50002 0.194 0.000 6.854 0.000 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:48(encodes)
350078/350072 0.736 0.000 6.499 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:154(map)
100004 0.189 0.000 6.077 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py:1887(_interleaved_dtype)
100006 0.566 0.000 6.016 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/construction.py:423(sanitize_array)
50002 0.142 0.000 5.844 0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/transforms.py:244(encodes)
100004 0.775 0.000 5.693 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1517(find_common_type)
10/6 0.000 0.000 4.967 0.828 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:230(__init__)
1 0.000 0.000 4.958 4.958 /opt/conda/lib/python3.7/site-packages/fastai/data/block.py:105(datasets)
3 0.000 0.000 4.947 1.649 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:313(__init__)
1 0.000 0.000 4.947 4.947 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:315(<listcomp>)
2 0.001 0.000 4.932 2.466 /opt/conda/lib/python3.7/site-packages/fastai/data/core.py:256(setup)
5 0.000 0.000 4.908 0.982 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:189(setup)
7 0.000 0.000 4.908 0.701 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:194(add)
7 0.000 0.000 4.908 0.701 /opt/conda/lib/python3.7/site-packages/fastcore/transform.py:77(setup)
1 0.001 0.001 4.840 4.840 /opt/conda/lib/python3.7/site-packages/fastai/text/core.py:283(setups)
1 0.008 0.008 4.839 4.839 /opt/conda/lib/python3.7/site-packages/fastai/text/core.py:211(tokenize_df)
9609280/9508994 1.857 0.000 4.601 0.000 {built-in method builtins.getattr}
100006 0.200 0.000 4.275 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py:1577(from_array)
8552820/6402555 1.643 0.000 4.083 0.000 {built-in method builtins.len}
350078/350072 1.073 0.000 3.905 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:659(map_ex)
100012 0.261 0.000 3.849 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py:2711(make_block)
100006 0.456 0.000 3.790 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/construction.py:554(_try_cast)
50001 0.017 0.000 3.669 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/parallel.py:123(parallel_gen)
50001 0.017 0.000 3.650 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/parallel.py:109(run_procs)
100004 0.208 0.000 3.547 0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/transforms.py:202(__call__)
50001 0.044 0.000 3.489 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/parallel.py:120(<genexpr>)
800074 0.505 0.000 3.234 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/imports.py:20(is_iter)
50000 0.119 0.000 3.201 0.000 /opt/conda/lib/python3.7/multiprocessing/queues.py:91(get)
250010 0.619 0.000 3.126 0.000 /opt/conda/lib/python3.7/site-packages/fastai/torch_core.py:124(tensor)
100004 0.252 0.000 2.928 0.000 /opt/conda/lib/python3.7/site-packages/fastai/data/transforms.py:196(_do_one)
100012 0.545 0.000 2.844 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py:2662(get_block_type)
200014 0.363 0.000 2.602 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/generic.py:5446(__getattr__)
100004 0.283 0.000 2.553 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/dispatch.py:170(cast)
100007 0.235 0.000 2.450 0.000 /opt/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py:569(find_common_type)
1000100 0.565 0.000 2.450 0.000 /opt/conda/lib/python3.7/typing.py:715(__instancecheck__)
50000 0.075 0.000 2.161 0.000 /opt/conda/lib/python3.7/multiprocessing/connection.py:208(recv_bytes)
200014 1.523 0.000 2.097 0.000 /opt/conda/lib/python3.7/site-packages/numpy/core/numerictypes.py:545(_can_coerce_all)
800086/800085 0.736 0.000 2.092 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:385(__getattr__)
50000 0.078 0.000 2.057 0.000 /opt/conda/lib/python3.7/multiprocessing/connection.py:406(_recv_bytes)
450094 0.338 0.000 2.024 0.000 {built-in method builtins.any}
350107/350084 0.423 0.000 1.974 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:110(_new)
100000 0.140 0.000 1.948 0.000 /opt/conda/lib/python3.7/multiprocessing/connection.py:374(_recv)
1000100 0.645 0.000 1.885 0.000 /opt/conda/lib/python3.7/typing.py:718(__subclasscheck__)
1600247 1.342 0.000 1.836 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:135(__iter__)
250010 0.132 0.000 1.798 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:253(anno_ret)
100000 1.764 0.000 1.764 0.000 {built-in method posix.read}
100005 0.172 0.000 1.646 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:243(annotations)
700197 0.639 0.000 1.619 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/common.py:1470(is_extension_array_dtype)
3650257 1.096 0.000 1.558 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:78(is_indexer)
50000 0.029 0.000 1.509 0.000 /opt/conda/lib/python3.7/site-packages/fastai/text/data.py:180(_default_sort)
100004 0.269 0.000 1.501 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/series.py:837(__getitem__)
300013 0.253 0.000 1.448 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:238(type_hints)
50002 1.444 0.000 1.444 0.000 {built-in method tensor}
100000/50000 0.330 0.000 1.437 0.000 /opt/conda/lib/python3.7/site-packages/torch/tensor.py:567(__len__)
3350900 0.750 0.000 1.431 0.000 {built-in method builtins.issubclass}
550143/550129 1.153 0.000 1.390 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:645(__call__)
500158 0.347 0.000 1.367 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/base.py:254(is_dtype)
6552421 1.335 0.000 1.335 0.000 {built-in method builtins.hasattr}
100006 0.429 0.000 1.324 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1379(maybe_cast_to_datetime)
300012 0.120 0.000 1.303 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py:1563(<genexpr>)
200046 0.417 0.000 1.243 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/generic.py:5464(__setattr__)
200029 0.333 0.000 1.183 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/common.py:1341(is_bool_dtype)
1650219 0.679 0.000 1.172 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:44(is_array)
800086/800085 0.627 0.000 1.152 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/basics.py:380(_component_attr_filter)
200009 0.603 0.000 1.109 0.000 /opt/conda/lib/python3.7/typing.py:934(get_type_hints)
100004 0.182 0.000 1.103 0.000 /opt/conda/lib/python3.7/site-packages/pandas/core/series.py:942(_get_value)
2000144/2000143 0.853 0.000 1.079 0.000 /opt/conda/lib/python3.7/site-packages/fastcore/foundation.py:86(__len__)
100013 0.118 0.000 1.017 0.000 /opt/conda/lib/python3.7/site-packages/fastai/torch_core.py:291(as_subclass)
Just to help comparing performance. I am running the code on a google cloud notebook instance with 4 vCPUs, 15Gb RAM and one Tesla V100
The code below runs in 23 seconds (which seems slow to me for 10’000 IMDB reviews)
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df = pd.concat([df]*10) #10000 IMDB reviews
data_block = DataBlock(
blocks=(TextBlock.from_df('text'), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('label'), splitter=RandomSplitter(0.1))
start = time.time()
dl = data_block.dataloaders(df)
end = time.time()
print(end - start)
Output is: 23.152453305921462
I want to add that if I split this in two phases:
ds = data_block.datasets(df)
and then
dl = ds.dataloaders()
Then most of the time (around 90%) is spent on the second line. Which is weird as the first one should do most of the work (tokenization and numericalization).
This doesn’t seem to happen with the imdb dataset. So it seems to be a problem related mainly to pandas dataframes?
From the profiler it seems that a big chunk of time is spent on compose_tfms
. Maybe the transformations do not happen in a vectorized way when using from_df
?
I know, I am annoying, but I want to share with you a way to completely reproduce this from scratch
import fastai
import pandas as pd
import random
from fastai.text.all import *
AMINO_ALPHABET = [i for i in 'ARNDCQEGHILKMFPSTWYV']
DF_LEN = 50000
STR_LEN = 10
df = pd.DataFrame({
'text': [' '.join(random.choice(AMINO_ALPHABET) for i in range(STR_LEN)) for i in range(DF_LEN)],
'target': [random.randint(0, 1) for i in range(DF_LEN)]
})
VOCAB = make_vocab(Counter(AMINO_ALPHABET), min_freq=0)
data_block = DataBlock(
blocks=(TextBlock.from_df('text', tok=BaseTokenizer(), vocab=VOCAB), CategoryBlock),
get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.1))
start = time.time()
ds = data_block.datasets(df)
end = time.time()
ds_time = end - start
start = time.time()
dl = ds.dataloaders()
end = time.time()
dl_time = end - start
print(ds_time)
print(dl_time)
And I get as output
8.407098531723022
34.95427393913269
Which is too much for a 50k rows dataframe. The reason why I think it is too much is that if I do the following
AMINO_ALPHABET_DICT = {
k:v for k,v in zip(AMINO_ALPHABET, range(len(AMINO_ALPHABET)))
}
def text_to_tensor(s):
s = s.replace(' ', '')
rtn = list(s)
return TensorText([AMINO_ALPHABET_DICT[i] for i in rtn])
start = time.time()
df['tensor'] = df['text'].map(text_to_tensor)
end = time.time()
print(end - start)
I get as output
2.6132564544677734
Also if instead of using TensorText
I use torch.tensor
then the output is
0.30541300773620605
florianl
(Florian)
March 18, 2021, 9:32pm
6
Thanks, it might be related, but it is a different issue as in this case the slow down appears at dataloader creation (while in that issue it happens during training).
I have submitted an issue https://github.com/fastai/fastai/issues/3276