HI there,
I wonder if anyone can help me with a weird problem. I’m working on a computer vision problem (CT classification) with 1.7M images. I have created a working pipeline for a subset of the data, but when I try to scale to the whole dataset, it takes basically forever to try to make the dataset using the low level API. What’s more, as I scale up the size of the dataset, the time it takes to create grows in a non-linear fashion. See code below. Any thoughts anyone???
def fix_pxrepr(dcm):
if dcm.PixelRepresentation != 0 or dcm.RescaleIntercept<-100: return
x = dcm.pixel_array + 1000
px_mode = 4096
x[x>=px_mode] = x[x>=px_mode] - px_mode
dcm.PixelData = x.tobytes()
dcm.RescaleIntercept = -1000
def dcm_tfm(fn):
fn = Path(fn)
try:
x = fn.dcmread()
fix_pxrepr(x)
except Exception as e:
print(fn,e)
raise SkipItemException
if x.Rows != 512 or x.Columns != 512: x.zoom_to((512,512))
px = x.scaled_px
return TensorImage(px.to_3chan(dicom_windows.lungs,dicom_windows.mediastinum, bins=bins))
def fn2label2(fn): return df_comb[df_comb.fname == fn][htypes[0]].values[0]`
tfms = [[dcm_tfm], [fn2label2,Categorize()]]
dsrc = Datasets(fns, tfms, splits=splits)
See loop below for what happens when I try to increase the size of the dataset…
for i in [10000, 20000, 50000]:
df_comb = pd.merge(df_trn, df_lbls).sample(i)
print(f'for {df_comb.shape[0]} rows:')
fns = [f for f in df_comb.fname]
splits = split(df_comb)
tfms = [[dcm_tfm], [fn2label2,Categorize()]]
%time dsrc = Datasets(fns, tfms, splits=splits)
And the output…
for 10000 rows:
Wall time: 8.38 sfor 20000 rows:
Wall time: 27.9 sfor 50000 rows:
Wall time: 3min 46s
Really strange effect. Hope someone can help me figure it out. Tagging @muellerzr since he seems to know everything about fast.ai and is very active here.