I’m going through the new NLP course and I’ve run into an issue with the filter_by_func function on the translation notebook using fastai version 1.0.51. This is the section I’m trying to run:
When I run this section, I end up with a since repeated item in my training dataset and a single repeated item in the validation dataset:
This appears to be due to how the fastai array function determined datatypes for numpy arrays.
This is the filter_by_func function for the LabelList class:
def filter_by_func(self, func:Callable):
filt = array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
self.x,self.y = self.x[~filt],self.y[~filt]
return self
We expect the function to return a boolean array, which is then used to index into self.x and self.y. When array is called on the list of booleans, the output is a binary array in int64 format, which has converted boolean values to 0 and 1. This means that instead of using boolean indexing, selv.x[~filt] indexes into a single value over and over again.
Here is an example trying to filter the LabelList from the notebook using the same method to illustrate:
Here is a more minimal example
The quick solution is to change filter_by_func to use np.array instead of array
def filter_by_func(self, func:Callable):
filt = np.array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
self.x,self.y = self.x[~filt],self.y[~filt]
return self



