Dataset creation VERY slow

HI there,

I wonder if anyone can help me with a weird problem. I’m working on a computer vision problem (CT classification) with 1.7M images. I have created a working pipeline for a subset of the data, but when I try to scale to the whole dataset, it takes basically forever to try to make the dataset using the low level API. What’s more, as I scale up the size of the dataset, the time it takes to create grows in a non-linear fashion. See code below. Any thoughts anyone???

def fix_pxrepr(dcm):
    if dcm.PixelRepresentation != 0 or dcm.RescaleIntercept<-100: return
    x = dcm.pixel_array + 1000
    px_mode = 4096
    x[x>=px_mode] = x[x>=px_mode] - px_mode
    dcm.PixelData = x.tobytes()
    dcm.RescaleIntercept = -1000

def dcm_tfm(fn): 
    fn = Path(fn)
        x = fn.dcmread()
    except Exception as e:
        raise SkipItemException
    if x.Rows != 512 or x.Columns != 512: x.zoom_to((512,512))
    px = x.scaled_px
    return TensorImage(px.to_3chan(dicom_windows.lungs,dicom_windows.mediastinum, bins=bins))

def fn2label2(fn): return df_comb[df_comb.fname == fn][htypes[0]].values[0]`

tfms = [[dcm_tfm], [fn2label2,Categorize()]]
dsrc = Datasets(fns, tfms, splits=splits)

See loop below for what happens when I try to increase the size of the dataset…

for i in [10000, 20000, 50000]:
    df_comb = pd.merge(df_trn, df_lbls).sample(i)
    print(f'for {df_comb.shape[0]} rows:')
    fns = [f for f in df_comb.fname]
    splits = split(df_comb)
    tfms = [[dcm_tfm], [fn2label2,Categorize()]]
    %time dsrc = Datasets(fns, tfms, splits=splits)

And the output…

for 10000 rows:
Wall time: 8.38 s

for 20000 rows:
Wall time: 27.9 s

for 50000 rows:
Wall time: 3min 46s

Really strange effect. Hope someone can help me figure it out. Tagging @muellerzr since he seems to know everything about and is very active here.

I would try profiling it with the profile module.