Hey, I’m currently following this excellent notebook from Jeremy. It uses a very early version of fastai2 so some things are not quite up to date anymore. I’m stuck at the part where he uses a DataSource
Object, as this does not seem to exist anymore in the current version of fastai.
The part I’m stuck with is:
dsrc = DataSource(fns, [[dcm_tfm],[os.path.basename]])
dl = TfmdDL(dsrc, bs=bs, num_workers=2)
A DataSource
is still referenced at some points in the docs, but there doesn’t seem to be an implementation of a DataSource
itself.
I would like to achieve what Jeremy wrote just above that part where he said:
If we’re not careful, processing 300GB of input data could take a really long time! To make it super fast, we’ll do it in 3 steps:
- Create a multiprocess
DataLoader
that reads, fixes, and rescales the DICOMs, returning batches of sizebs
- Loop through each batch, moving it to the GPU, and then using fastai’s GPU-optimized masking, cropping, and resizing functions on the whole batch at once
- For each batch, save each image in it in parallel, using fastai’s
parallel
function.This is the first time that I’ve shown how to combine parallel and GPU processing for preprocessing and saving medical images (and might be the first time it’s been shown anywhere, at least using something this flexible and concise!)
Let’s start by setting our params, creating our DataSource, and then wrapping that with a multiprocess transformed data loader.
How would that part be done with the current fastai version?