Filter_by_func returns a single repeated item

I’m going through the new NLP course and I’ve run into an issue with the filter_by_func function on the translation notebook using fastai version 1.0.51. This is the section I’m trying to run:

When I run this section, I end up with a since repeated item in my training dataset and a single repeated item in the validation dataset:

This appears to be due to how the fastai array function determined datatypes for numpy arrays.

This is the filter_by_func function for the LabelList class:

def filter_by_func(self, func:Callable):
    filt = array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
    self.x,self.y = self.x[~filt],self.y[~filt]
    return self

We expect the function to return a boolean array, which is then used to index into self.x and self.y. When array is called on the list of booleans, the output is a binary array in int64 format, which has converted boolean values to 0 and 1. This means that instead of using boolean indexing, selv.x[~filt] indexes into a single value over and over again.

Here is an example trying to filter the LabelList from the notebook using the same method to illustrate:

Here is a more minimal example

The quick solution is to change filter_by_func to use np.array instead of array

def filter_by_func(self, func:Callable):
    filt = np.array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
    self.x,self.y = self.x[~filt],self.y[~filt]
    return self

It indeed looks strange, it should at least call array([func(x,y) for x,y in zip(self.x.items, self.y.items)], dtype=np.bool) to allow boolean indexing. Besides, I don’t really understand why this filter works in reverse (it only keeps items where the filter yields False).

For now I thin you should just do something like:

from fastai.data_block import LabelList
def filter_by_func(self, func:Callable):
    filt = array([func(x,y) for x,y in zip(self.x.items, self.y.items)], dtype=np.bool)
    self.x,self.y = self.x[~filt],self.y[~filt]
    return self
LabelList.filter_by_func=filter_by_func

That should allow you to normally run the notebook after that.

I don’t have any issue with the notebook. Note that the NLP course requires v1.0.54 or latest as bugs/functionality were fixed/added for it.

I take back what I said, the code should indeed work fine as array implicitly finds the right dtype, even with version 1.0.51, as the implementations of both filter_by_func and array seem to be the same. Upgrading is always a good idea thought, problem would come from elsewhere.

I also had a problem with filter_by_func running on 1.0.57 and just hacked my way out of it in the 7-seq2seq-translation notebook by directly excluding from the df any questions containing more than 25 words.