I have a scenario where my dataset is massive and I only want to get_x
and get_y
when the items are needed for a batch.
get_items
returns a list of files upon which the get_x/y
will operate against to get the actual inputs/targets.
I have a scenario where my dataset is massive and I only want to get_x
and get_y
when the items are needed for a batch.
get_items
returns a list of files upon which the get_x/y
will operate against to get the actual inputs/targets.
It should be lazy by default. You can check via dblock.summary()
and pass in a path. If you simply do something like:
get_x = lambda x: print(x)
You’ll notice that it won’t actually print out an entire list of things but instead one item (be sure to pair this with get_items
)
You are correct … unless you are using dl_type=SortedDL
like I was
I forgot that this type of DataLoader
has to pull all the items in order to figure out how to sort them.
Btw, do you know if there is a callback we can utilize that get ran right before get_x
or get_y
? Something that operates on what get_items
returns but not until you need an actual x and y?
Thanks!
What about passing in a function after your get items so it runs afterwards? (No idea if it would work, just a thought). Something similar to how we can pass in multiple functions to get_y to be run in a row
Just saw second bit, either tail end of get_items or front end of get_y mabye?
Hmmmm … I’ll have to play with this one.
Do you know when the blocks
get called? Haven’t looked yet … wondering if they get called lazily before get_x
and get_y
or if they get called right after get_items
.
If the former, that may be where I want to do things. If you know, lmk
block.summary() should show it
Note that get_x and get_y can be a list of functions.