SOURCE CODE: Mid-Level API

Also you can straight up decode a batch of data. IE:

batch = dls.one_batch()
dec = dls.decode_batch(*tuplify(batch[0]), *tuplify(batch[1]))

@florianl

thank you - using dls.dataset was easier than I thought :smiley: thank you

regarding decode_batch … I came across that code snippet when debugging show_batch but I couldn’t find a way to decode a special image (by id / path) using the batch methods.

Yeah it assumes you’ve already got it transformed in some way (also I think you can have a batch of 1 in with it)

I decided to follow @philchu suggestion and I’m building a fastai cookbook: fastcook

The quotes above describes exactly what I’m trying to do, a collection of recipes (snippets of code) on how to use some standard and not so standard functionality of fastai.

While this is a bit different from the work that is being done here, I think both ideas complement each other very nicely! I don’t want to go very “behind the scenes” in the cookbook, and instead I’ll be adding a lot of links to the blogs produced here, so if the reader wants to go more in depth, he can :smile:

There are two ways of exploring the cookbook, you can check the nbs directly or you can browse the generated docs. The docs have a sidebar and a search function (not working yet) so it should be easier to use. You can also link to specific parts of the documentation like: How to use callbacks? which can make it really easy to share specific recipes.

I hope this will be helpful to us all :grin: :grin:

3 Likes

Please add it to the blog thread @lgvaz :slight_smile:

1 Like

Hi everyone, as some of you know I shared implementation of Devise in fastai here
I used

dls=ImageDataLoaders.from_lists(path="./",fnames=images,labels=img_vecs,y_block=RegressionBlock,bs=256,seed=123)

to read create the dataloader, here the images are a list of image paths, and the img_vecs is the list of word vector arrays of the labels.
I wanted to use the Datablock API here but couldn’t figure out how first, but after digging the source code I came up with

def _zip(x):
  return L(x).zip()
dblock=DataBlock(blocks=(ImageBlock,RegressionBlock),get_items=_zip,getters=[ItemGetter(0),ItemGetter(1)],splitter=RandomSplitter(seed=123))
dls = dblock.dataloaders((images,img_vecs))

that is exactly what happens when you use ImageDataLoaders.from_lists, but I still don’t know why get_items=_zip is needed, I’m guessing the dataloader expects each item in x,y as inputs separately rather than a batch or list of them so that it can transform them properly, stack them and return as a batch? Can anyone explain me what is happening here?

2 Likes

I’ve allowed myself to add a debugging tutorial I had made to the wiki “Five ways to debug fastai”

2 Likes

I was going through 05_data_transform, and the get_files function is defined as follows:

Can someone explain this line of code:

if len(folders) !=0 and i==0 and '.' not in folders: continue

and why is it used?

Also why is the library os preferred over pathlib?

I’m working on some model training on a GCP instance which has 16gb GPU RAM and even though I restart my kernel I see that around 9gb ram is being used when nothing is running as shown below:

And when I actually start training a model, a parallel process starts up like this:

Is there a way to clear up GPU ram? I was able to run an xresnet50 model initially on a bs of 64 but now need to scale down to 16 due to this blockage of memory…

os used because of os.scandir - it is fastest way to iterate through folders. And Pathlib still used - Path is from pathlib.

1 Like

This looks like a second jupyter kernel running. Check if you have another notebook opened and shutdown its kernel.
If that fails you can restart jupyter or manually kill the python process.

if len(folders) !=0 and i==0 and '.' not in folders: continue

The folders variable here represents the folders we want to recurse into. Now the line says: if we are the top level (first step of os.walk()) and we have specified folders to recurse into, and ‘.’ is not in that list then do not include files from the current directory. So this works as you would expect, get_files(path, folders=['A','B']) will get you all the files from path/'A' and path/'B'. But if you do get_files(path, folders=['A','B', '.']) it will also include files directly under path.

As a side node, I actually found the previous three lines more mind bending, we are modifying a local variable d that is never used in the code. Only after some debugging I found this in the python docs

When topdown is True , the caller can modify the dirnames list in-place (perhaps using del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames ; this can be used to prune the search, impose a specific order of visiting, or even to inform walk() about directories the caller creates or renames before it resumes walk() again.

So you are basically altering a value yielded from a function to change the future yields :open_mouth:

4 Likes

Wow, I did not look at the d[:] really closely because I was confused about the thing I asked.

I propose we have a study group meeting tomorrow in about 12 hours from now, and start digging in to the DataBlocks API for a few hours.

Trust me when I say that the DataBlocks API is a rabbit hole and we will be sucked into dataloaders, datasets, TfmdLists, Transforms and what not.

But belive that this will be worth it as then we can spread out as a group and help the fastai library with documentation and more examples.

I have started writing basic blogs about the high level DataBlocks API already but would be great to replicate everything with the mid-level API without DatBlocks.

What do you guys think?

2 Likes

Can it be done at 6:30 PM San Francisco time as it is 7 AM India time? :slightly_smiling_face:

Also, how much understanding of pytorch is required to understand the details?

Thanks
Ganesh

1 Like

That sounds great! I’ve been inactive for 2-3 days now because I had this paper (sort-of) submission deadline this morning. I can contribute now.

1 Like

Excellent, I’ve been inactive too but let’s get right back in it.

A basic understanding of datasets/dataloaders would do. Personally, I get stuck in fastai’s many python-tricks rather than Pytorch or code as such, for me it’s mostly the decorators/ class methods that get magically patched and make the whole library work seamlessly.

What is the meeting link and is the time decided?

We’re just in the process of deciding the meeting time here. Do you have a preference?

I am in india so I would say anytime post 7am IST.

1 Like