Can pandas df have cell values of numpy array

akashgshastri · September 25, 2019, 11:25am

I want to store Numpy arrays as values for cells in my Dataframe. Is there any way to do this? Basically i have pixel data which is a (512,512) Numpy array that i want to save as the value for pixel_data column corresponding to its particular id in the ID column of my Dataframe. How can i do this?

Heres what i tried:

    for f in train_files[:10]:
        id_tmp = f.split('/')[4].split('.')[0]
        first_dcm = pydicom.read_file(f)
        img = first_dcm.pixel_array
        window = get_windowing(first_dcm)
        image = window_image(img, *window)
        train.loc[train.Image == id_tmp, 'img_before_w'] = img
        train.loc[train.Image == id_tmp, 'img_after_w'] = image

The error i got:

    ValueError                                Traceback (most recent call last)
    <ipython-input-47-32236f8c9ccc> in <module>
          5     window = get_windowing(first_dcm)
          6     image = window_image(img, *window)
    ----> 7     train.loc[train.Image == id_tmp, 'img_before_w'] = img
          8     train.loc[train.Image == id_tmp, 'img_after_w'] = image
          9 

    /opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
        203             key = com.apply_if_callable(key, self.obj)
        204         indexer = self._get_setitem_indexer(key)
    --> 205         self._setitem_with_indexer(indexer, value)
        206 
        207     def _validate_key(self, key, axis: int):

    /opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
        525                     if len(labels) != value.shape[1]:
        526                         raise ValueError(
    --> 527                             "Must have equal len keys and value "
        528                             "when setting with an ndarray"
        529                         )

    ValueError: Must have equal len keys and value when setting with an ndarray

TomB · September 25, 2019, 6:25pm

You can, but it’s tricky as a lot of things are intelligently handling numpy arrays, hence your error. So you can only update in certain ways:

I couldn’t find a way that works without that warning, though it is updating properly in this case.

Though unless you have a pretty small dataset loading all your images into RAM like that is probably not a good idea.

akashgshastri · September 25, 2019, 7:06pm

@tomB

what i’m trying to do is extract pixel arrays from a bunch of dicom files. The folder is extremely big and i’m running into ram issues with other methods too. i thought i’ll save all the np arrays using np.save but again i ran into ram issues. Any way i can extract pixel arrays to be used later? i just need to extract all the pixel arrays from each dicom file

TomB · September 25, 2019, 7:18pm

You can load the DICOMs directly into your dataloader. Might need adapting for your needs (especially div which should be true if the pixel data is integers and false if float, I used the default True but might depend on the dicoms) but I’ve used:

class DicomImageList(ImageList):
    def __init__(self, *args, div:bool=True, **kwargs):
        super().__init__(*args, **kwargs)
        self.div = div
        self.copy_new += ['div']

    def open(self, fn):
        dcm = pydicom.dcmread(fn)
        px = torch.from_numpy(dcm.pixel_array)[None,...]
        if self.div: px = px.float() / 255.
        return Image(px)

You just use it like a normal ImageList, doing DicomImageList.from_folder (/from_df etc).

If pre-processing you need to save the arrays one image at a time (you might also be better to convert to Tensors and save in pytorch format as it might be a bit faster to load, though likely not much). You also need to be careful not to leave any variables pointing to images after you process and save them. You can so del some_var to explicitly delete a variable .

akashgshastri · September 26, 2019, 4:47pm

hey this is really useful thanks.
I was wondering if using the numpy arrays directly in my conv net by converting to torch tensors was better than converting to images (which i assume are byte arrays that get converted to tensors). Do u have any idea about this?

TomB · September 26, 2019, 5:05pm

How do you mean, converting to images? Do you mean writing them out as JPEG say? Or converting to a fastai Image as above. The fastai Image is just a wrapper around a tensor with various operations so very fast.
In terms of writing to images, generally storing the tensors should be faster (or numpy arrays, torch.from_numpy() shares the memory with the numpy array so is very fast,other funnctions ot convert numpy copy the data). That avoids the work needed to decode the image files. Though with uncompressed formats this won’t be much work, the saved tensors will be optimised for loading. But then you can’t use standard tools on them so it’s a tradeoff.
I’m not sure how fast the above is. The pydicom library could be slow so it might be better to extract them from the dicoms as images first. I didn’t test that. Though that will all happen in worker processes anyway, so unless it’s really slow shouldn’t affect training speed too much.