Medical imaging from_dicoms not finishing

Hi, I’m going through the medical imaging tutorial by going through a new-ish Kaggle competition VinBigData chest x-rays.

In my notebook, I’m getting stuck on this:
dicom_dataframe = pd.DataFrame.from_dicoms(items)
dicom_dataframe[:5]

Where it will start to run ok, but then it never finishes putting the DICOM information into a dataframe.

In addition to the tutorial, Jeremy posted a similar Kaggle notebook about a year ago, and I can’t figure out why his works and mine doesn’t…

Thanks in advance for any help!

There are about 15000 images in the train set so it does take a while! You could break it down into chunks and then merge the dataframes and also by default from_dicoms uses the brain window so you may want to change that as well.

dicom_dataframe = pd.DataFrame.from_dicoms(items[:5000], window=dicom_windows.lungs)

thanks, ok, I guess I’ll give that a shot… surprised that that’s a lot, as i’m just trying to pull text from images into a dataframe… I just came back to the computer and 3 hours didn’t cut it…

Also, I didn’t know about the different windows it wasn’t n the tutorial… I should probably go read the docs.

Anyway, just started running the notebook on just 5k, will report back in a couple hours.

thanks!

To access that meta information that function needs to read the DICOM file, which usually is a large file with all the metadata including pixel array. So it’s not surprising if it takes some time.

1 Like

Yeah, I’ve never worked with DICOM files before… I guess I’m just surprised that there’s not a fast way to extract the metadata from the image. Seems like a pretty inefficient standard…

Anyway, it failed on 5k images before the kaggle kernel timed out. I’m running it now on just 500 images.

Ok, 500 took a bit over 11 minutes… I’ll go through Jeremy’s older notebook tomorrow and see if I can figure out what the difference is… that competition had 74k images and he was able to load up the metadata in under 15 minutes.

By default from_dicoms generates a summary (img_min, img_max, img_mean, img_std, img_pct_window) this uses the pixel_array and this is the time consuming part.

If you do not really want this info then you can manually turn this off and and it is alot faster. (7 mins)

I am submitting a PR so that it is easy to toggle this feature on or off as required.

Thanks! that worked!

Is there anywhere to find the list of parameters for from_dicoms? I can’t seem to find any documentation on it.

Have a look at this notebook

or here

as an aside you may be interested in working with downsampled version of the data (such as this (jpeg images)) for fast experimenting.

I’ve done my pre-processing on the dicom images and then saved the results as 16bit .tiff files.
but the problem that i’m having now is that loading the images as a DataBlock seems to automatically convert them to 8bit.
Anyone have any ideas on how to load the data as 16bit?
I could work of the dicom files directly, but the dataset is just too large to use on my machine so i spent quite some time already on getting this 16bit tiff data set

You could look at this which allows for easy integration of a custom PILDicom block.

fmi