Lesson 11 official topic

@Pierrot here are some comparation APL vs http://matrixmultiplication.xyz

APL Hadamar Product:
A ← 3 3 ⍴1 2 3 4 5 6 7 8 9
A
1 2 3
4 5 6
7 8 9

  B ← 3 3 ⍴ 9 8 7 6 5 4 3 2 1
  B

9 8 7
6 5 4
3 2 1
A × B
9 16 21
24 25 24
21 16 9
B × A
9 16 21
24 25 24
21 16 9

Maths dot product :slight_smile:

A x B:

30 24 18
84 69 54
138 114 90

B x A:

90 114 138
54 69 84
18 24 30

Hi folks, I recently started learning fastai and it is very interesting. In chapter 11, mid level API, the book has example to create Datasets object with custom transforms using IMDB dataset in which data is stored as list of file paths. As per definition of Datasets in fastai library it only accepts lists as input on which both x and y transform work to output tuple of (input, output). So by definition, does this means that we cannot create a Datasets object with custom transforms using a dataframe.

Anyways I tried myself to create a Datasets object but ran into multiple problems. Here is my code:-

imdb_sample_path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(imdb_sample_path/ ‘texts.csv’)

Transforms

imdb_tok = Tokenizer.from_df(‘text’)
imdb_tokens = df[‘text’].map(imdb_tok)

imdb_num = Numericalize()
imdb_num.setup(imdb_tokens)

cat = Categorize()
lbls = list(df.label)
cat.setup(lbls)

these transforms work perfectly well like this. But when I put them in Datasets it does not work. In fastai documentation, Datasets do take df as an input, so I thought of trying to pass df directly. However, this code is just not working:-

x_tfms = [Tokenizer.from_df(‘text’), Numericalize()]
y_tfms = [Categorize]

splits = [df.index[~df.is_valid].tolist(), df.index[df.is_valid].tolist()]
dsets = Datasets(df, [x_tfms, y_tfms], splits = splits)

Problems with this code:-

  1. Following is the error when I pass Numericalize() in x_tfms

TypeError Traceback (most recent call last)
in <cell line: 0>()
6
7 # create a dataset object
----> 8 dsets = Datasets(df, [x_tfms, y_tfms], splits = splits)

10 frames
/usr/local/lib/python3.11/dist-packages/fastai/text/data.py in (.0)
50 self.o2i = defaultdict(int, {v:k for k,v in enumerate(self.vocab) if v != ‘xxfake’})
51
—> 52 def encodes(self, o): return TensorText(tensor([self.o2i [o_] for o_ in o]))
53 def decodes(self, o): return L(self.vocab[o_] for o_ in o)
54

TypeError: unhashable type: ‘L’

  1. Following is the code when I remove Numericalize() from x_tfms

TypeError Traceback (most recent call last)
in <cell line: 0>()
6
7 # create a dataset object
----> 8 dsets = Datasets(df, [x_tfms, y_tfms], splits = splits)

12 frames
/usr/local/lib/python3.11/dist-packages/fastcore/basics.py in uniqueify(x, sort, bidir, start)
806 def uniqueify(x, sort=False, bidir=False, start=None):
807 “Unique elements in x, optional sort, optional return reverse correspondence, optional prepend with elements.”
→ 808 res = list(dict.fromkeys(x))
809 if start is not None: res = listify(start)+res
810 if sort: res.sort()

TypeError: unhashable type: ‘Series’

  1. There is no effect of either removing or keeping y_tfms

Can anyone please help me. The basic idea is to be able to create Datasets object with custom transforms even if the data is in dataframe format. As of now, it seems that it only works if the data is in the form of list of file paths.