Is there a way to make the DataBlock's "get_x" and "get_y" lazy?

I have a scenario where my dataset is massive and I only want to get_x and get_y when the items are needed for a batch.

get_items returns a list of files upon which the get_x/y will operate against to get the actual inputs/targets.

It should be lazy by default. You can check via dblock.summary() and pass in a path. If you simply do something like:

get_x = lambda x: print(x)

You’ll notice that it won’t actually print out an entire list of things but instead one item (be sure to pair this with get_items)

1 Like

You are correct … unless you are using dl_type=SortedDL like I was :slight_smile:

I forgot that this type of DataLoader has to pull all the items in order to figure out how to sort them.

1 Like

Btw, do you know if there is a callback we can utilize that get ran right before get_x or get_y? Something that operates on what get_items returns but not until you need an actual x and y?

Thanks!

What about passing in a function after your get items so it runs afterwards? (No idea if it would work, just a thought). Something similar to how we can pass in multiple functions to get_y to be run in a row

Just saw second bit, either tail end of get_items or front end of get_y mabye?

Hmmmm … I’ll have to play with this one.

Do you know when the blocks get called? Haven’t looked yet … wondering if they get called lazily before get_x and get_y or if they get called right after get_items.

If the former, that may be where I want to do things. If you know, lmk :wink:

block.summary() should show it :wink:

Note that get_x and get_y can be a list of functions.

2 Likes