Getting input and target files according to their number in a folder

Danrohn · March 18, 2022, 3:22am

Hey folks,

How do I use “get_image_files” in a folder where I’d like to get as inputs only files whose filenames (such as “DSC_0254.JPG”) are only numbers that their modulus (%) is equal to “3”?

For instance: Given modulus %5, I’d want only files that return 3. So out of these files:

“DSC_0251.JPG”
“DSC_0252.JPG”
“DSC_0253.JPG”
“DSC_0254.JPG”
“DSC_0255.JPG”
“DSC_0256.JPG”
“DSC_0257.JPG”
“DSC_0258.JPG”
“DSC_0259.JPG”

253%5=3
258%5=3

And for targets, I’d use the same function to accept only the files that return 2.

Can I do something like this: (in a DataBlock)

...
get_items=get_image_files,
get_y=function3,
get_x=function2,
...

where

def function3(fnme):
...
    if(num%5==3)
         return fnme

?

Danrohn · March 18, 2022, 4:48am

Alright, so whenever the condition doesn’t apply, the DataBlock doesn’t know what to return, so it leave an “empty” object. Or: NoneObject.
How can I tell my function() to drop that filename that doesn’t follow the condition?
Thanks

Conwyn · March 18, 2022, 2:36pm

Hi Dan
It is easiest just to copy the selected files to a new folder and then use fastai to process the new folder.
Regards Conwyn

Danrohn · March 18, 2022, 9:28pm

That’s what I thought too, but let’s say that I’d keep shooting new photos, I won’t need to shift files in between folders if I can just move them all together into one single folder.

Danrohn · March 19, 2022, 2:55am

Alright, I managed to adjust the “get_files” function a little so it will also check my condition before it adds the desired photo to the list.

def get_files2(path, extensions=image_extensions, recurse=True, folders=None, followlinks=True,modulus=10,x_m=2,y_m=3):
    "Get all the files in `path` with optional `extensions`, optionally with `recurse`, only in `folders`, if specified."
    path = Path(path)
    folders=L(folders)
    extensions = setify(extensions)
    extensions = {e.lower() for e in extensions}
    if recurse:
        res = []
        for i,(p,d,f) in enumerate(os.walk(path, followlinks=followlinks)): # returns (dirpath, dirnames, filenames)
            if len(folders) !=0 and i==0: d[:] = [o for o in d if o in folders]
            else:                         d[:] = [o for o in d if not o.startswith('.')]
            if len(folders) !=0 and i==0 and '.' not in folders: continue
            res += _get_files(p, f, extensions)
    **elif x_m==False and y_m==False:**
**        f = [o.name for o in os.scandir(path) if o.is_file()]**
**        res = _get_files(path, f, extensions)**
**    else:**
**        f = [o.name for o in os.scandir(path) if (o.is_file() and (int(re.findall('\d+',o.name)[0]))%modulus in (x_m,y_m))]**
**        res = _get_files(path, f, extensions)**
    return L(res)

I mostly added this: (with import re, as well)
(o.is_file() and (int(re.findall(’\d+’,o.name)[0]))%modulus in (x_m,y_m))]

So right afterward, in the main folder, the items that were appended into the list were only the ones that followed the condition that I added with the modulus.

But when I run the “summary()” command, it couldn’t really work out. It kept adding other photos in the folder to the list of objects (of which it marked as “None”).

This is what I get by running “one_batch”

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in one_batch(self)
    146     def one_batch(self):
    147         if self.n is not None and len(self)==0: raise ValueError(f'This DataLoader does not contain any batches')
--> 148         with self.fake_l.no_multiproc(): res = first(self)
    149         if hasattr(self, 'it'): delattr(self, 'it')
    150         return res

/usr/local/lib/python3.7/dist-packages/fastcore/basics.py in first(x, f, negate, **kwargs)
    553     x = iter(x)
    554     if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
--> 555     return next(x, None)
    556 
    557 # Cell

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in __iter__(self)
    107         self.before_iter()
    108         self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 109         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
    110             if self.device is not None: b = to_device(b, self.device)
    111             yield self.after_batch(b)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    519             if self._sampler_iter is None:
    520                 self._reset()
--> 521             data = self._next_data()
    522             self._num_yielded += 1
    523             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    559     def _next_data(self):
    560         index = self._next_index()  # may raise StopIteration
--> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    562         if self._pin_memory:
    563             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     37                 raise StopIteration
     38         else:
---> 39             data = next(self.dataset_iter)
     40         return self.collate_fn(data)
     41 

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in create_batches(self, samps)
    116         if self.dataset is not None: self.it = iter(self.dataset)
    117         res = filter(lambda o:o is not None, map(self.do_item, samps))
--> 118         yield from map(self.do_batch, self.chunkify(res))
    119 
    120     def new(self, dataset=None, cls=None, **kwargs):

/usr/local/lib/python3.7/dist-packages/fastcore/basics.py in chunked(it, chunk_sz, drop_last, n_chunks)
    215     if not isinstance(it, Iterator): it = iter(it)
    216     while True:
--> 217         res = list(itertools.islice(it, chunk_sz))
    218         if res and (len(res)==chunk_sz or not drop_last): yield res
    219         if len(res)<chunk_sz: return

/usr/local/lib/python3.7/dist-packages/fastai/data/load.py in do_item(self, s)
    131     def prebatched(self): return self.bs is None
    132     def do_item(self, s):
--> 133         try: return self.after_item(self.create_item(s))
    134         except SkipItemException: return None
    135     def chunkify(self, b): return b if self.prebatched else chunked(b, self.bs, self.drop_last)

/usr/local/lib/python3.7/dist-packages/fastcore/transform.py in __call__(self, o)
    198         self.fs = self.fs.sorted(key='order')
    199 
--> 200     def __call__(self, o): return compose_tfms(o, tfms=self.fs, split_idx=self.split_idx)
    201     def __repr__(self): return f"Pipeline: {' -> '.join([f.name for f in self.fs if f.name != 'noop'])}"
    202     def __getitem__(self,i): return self.fs[i]

/usr/local/lib/python3.7/dist-packages/fastcore/transform.py in compose_tfms(x, tfms, is_enc, reverse, **kwargs)
    148     for f in tfms:
    149         if not is_enc: f = f.decode
--> 150         x = f(x, **kwargs)
    151     return x
    152 

/usr/local/lib/python3.7/dist-packages/fastai/vision/augment.py in __call__(self, b, split_idx, **kwargs)
     32 
     33     def __call__(self, b, split_idx=None, **kwargs):
---> 34         self.before_call(b, split_idx=split_idx)
     35         return super().__call__(b, split_idx=split_idx, **kwargs) if self.do else b
     36 

/usr/local/lib/python3.7/dist-packages/fastai/vision/augment.py in before_call(self, b, split_idx)
    241 
    242     def before_call(self, b, split_idx):
--> 243         w,h = self.orig_sz = _get_sz(b)
    244         if split_idx:
    245             xtra = math.ceil(max(*self.size[:2])*self.val_xtra/8)*8

/usr/local/lib/python3.7/dist-packages/fastai/vision/augment.py in _get_sz(x)
    142 def _get_sz(x):
    143     if isinstance(x, tuple): x = x[0]
--> 144     if not isinstance(x, Tensor): return fastuple(x.size)
    145     return fastuple(getattr(x, 'img_size', getattr(x, 'sz', (x.shape[-1], x.shape[-2]))))
    146 

AttributeError: 'NoneType' object has no attribute 'size'

Out of 59 photos in a folder, the get_files successfully reads only the relevant 24 files.
But when it goes through summary() method, it tries once again to read all the 59 photos, which leaves many (35 objects) as “NoneType”.

What shall I change then?

matdmiller · March 19, 2022, 4:47am

You want to create a custom function for the get_items parameter if your goal is pre-filtering the files to only bring in numbers where the file name %3 == 0. You can see an example of a custom get_items function here: Read data from different directories - #4 by matdmiller . If all you are doing is loading images, you probably do not need to write a custom function for (or even define) the get_x parameter. The get_y parameter is the function for loading in the labels. It looks like you are trying to filter your files by defining a custom function for get_y which is not correct. It’s hard to know what get_y should be without knowing what you’re trying to train the model to do. Here is another example of some custom datablock implementations which may help lead you in the right direction: Multi-GPU w/ PyTorch? - #5 by matdmiller . In this second example I am doing semantic segmentation and loading a mask for the y variable.

A function something like this may be what you’re after…

def my_get_items_function(x):
   img_files = get_image_files('./my/image/file/path')
   filtered_img_files = []
   for img_fname in img_files:
      if int(str(img_fname.name)[4:8])%3 == 0:
         filtered_img_files.append(img_fname)
   return filtered_img_files

...
   get_items=my_get_items_function
...

Danrohn · March 19, 2022, 3:19pm

Seems like the solution (and more elegant one) for me.
But I’d say, that after the dblock.summary(path) didn’t work accordingly, meaning, it kept getting all of the files in the path (not only the filtered out), I realized that my whole approach was wrong.
So I figured out a better approach:
I will code a function that makes all of those modulus calculations and create new folders containing only those filtered out files, and then send its new path to the dblock.

matdmiller · March 19, 2022, 5:45pm

You can also write the function like this. It’s harder to read if you’re not used to list comprehensions, but it should be faster, is more concise and gives the same result. Jeremy leverages list comprehensions a lot so it’s a good thing to learn.

def my_get_items_function(x):
    return [o for o in get_image_files('./my/image/file/path') if int(str(o.name)[4:8])%3 == 0]