Use raw (image) data always better for training?

Archaeologist · July 3, 2020, 9:38am

I was told that using raw data as input for deep learning is often or always (?) better than using pre-processed (or feature engineered) data.

For example: Use RAW images rather than JPEGs, use original size images rather than downscaled ones (if feasible memory-wise).

Is that correct?

Does anyone have a scientific source to cite from?

machinethink · July 3, 2020, 10:38am

The whole point of a deep learning model is that it can do feature engineering better than humans. That’s what it learns when you train it.

However, neural nets are also pretty dumb, so sometimes it makes sense to pass in additional features (for example, “coord conv” where you also pass in the coordinates of every pixel in the image).

RAW vs JPEGs only matters if the JPEG compression artifacts confuse your model. Loading JPEG files is typically faster than loading RAW images.

Original image size vs downscaled has an impact on speed and memory. Larger images may also require a deeper network. In general, larger images give better results, but they can also cause the model to pick up on unnecessary details. In that sense, downscaling actually is a type of feature engineering (low-pass filtering).

sophia · July 4, 2020, 8:01pm

I have a slightly different question. I want to deploy deep learning for digital pathology. (I hope it is ok to ask this question here.)

I am working with tiles that I generate within a digital pathology program from a tissue slide. I can create tiles (from tumor tissue) that always have the same size e.g. 150 µm. When I compare these generated tiles between different origin slides, the output dimensions of the files differ. From one slide I generate tiles with 306x306 pixels, from another slide I generate tiles with 330x330 pixels and so on. The file format is tiff/".tif".

On the “surface” around the tumor I can also generate tiles that are “cut off” with various size where the tumor “ends”.

like this, yellow represents the various shape possible:

Screenshot from 2020-07-04 21-49-40

What would you recommend would be a good way to deal with these different dimensions? What happens if I would just load all these files into the same databunch in the fastai library for example?

I could probably get rid of the tiles with the various shapes (if necessary with some python programming), but I don’t know at the moment what to do regarding the different dimensions of the tiles between different slides. Are these different sizes going to decrease a deep learning model accuracy in comparison to a set of tiles that already have the same original dimensions?

Archaeologist · July 5, 2020, 7:08am

In your example the size varies by around 10 percent only, so I assume you can safely use them for training. Use data augmentation (scaling) to even improve results

sophia · July 5, 2020, 10:15pm

ok, I will try with the files I have now. Thank you!