With this, IIUC, item_tfms will be executed twice. I couldn’t figure out a feasible way to share pre-processing between blocks. The only way I see in the DataBlock API would be to do the shared pre-processing in the get_items function. But that would require loading all data into memory (which is infeasible for large data sets). I could also create my own blocks that share caching data behind the scenes but that seems clunky.
I concluded that to achieve shared pre-processing, I’ll have to use the Mid-Level API (TfmdLists and friends). Is this correct or am I missing something in the DataBlock API?
This resulted in one_batch() returning [(x, x)] instead of [x, x] as your version does. I need to read up on ItemTransform… Any hint what is the best starting point?
I tried that as well when I was playing with it. I’m pretty sure the key is the order here.
In regards to the transforms, not really. Probably should make an article at some point but the key thing that helps me is knowing that a transform is not the same as data augmentation, it’s just applying some function to an input or output.
For future reference: I think not the order but the usage of ItemTransform – as opposed to a normal Transform – is key here. (A normal Transform is used when a plain function is passed into batch_tfms because it ends up in a Pipeline.)