I’m going through the new NLP course and I’ve run into an issue with the filter_by_func
function on the translation notebook using fastai version 1.0.51. This is the section I’m trying to run:
When I run this section, I end up with a since repeated item in my training dataset and a single repeated item in the validation dataset:
This appears to be due to how the fastai array
function determined datatypes for numpy arrays.
This is the filter_by_func
function for the LabelList
class:
def filter_by_func(self, func:Callable):
filt = array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
self.x,self.y = self.x[~filt],self.y[~filt]
return self
We expect the function to return a boolean array, which is then used to index into self.x
and self.y
. When array
is called on the list of booleans, the output is a binary array in int64 format, which has converted boolean values to 0 and 1. This means that instead of using boolean indexing, selv.x[~filt]
indexes into a single value over and over again.
Here is an example trying to filter the LabelList
from the notebook using the same method to illustrate:
Here is a more minimal example
The quick solution is to change filter_by_func
to use np.array
instead of array
def filter_by_func(self, func:Callable):
filt = np.array([func(x,y) for x,y in zip(self.x.items, self.y.items)])
self.x,self.y = self.x[~filt],self.y[~filt]
return self