Lesson 4 In-Class Discussion

I’m trying to use the create a language model with a custom dataset.

I’ve loaded a .csv dataset in and saved it to a .txt file after performing operations on it as a pandas data frame.

Now the LanguageModelData.from_text_files gives the error.
ascii' codec can't decode byte 0xcc in position 6836: ordinal not in range(128)

The .txt file displays its encoding as utf-8 according to sublime text.

Also, I’m saving the dataset to a single concatenated .txt file rather than a number of them, since I’m reading from a csv. Will this work or do I have to do something differently?

Please help!

Regards,
Sanyam Bhutani.

Can you put a screenshot of which line gives that error. Can you put your sample data somewhere for us to take a look and test?

It would be great if you can upload your notebook and sample data to a gist via gist.github.com so that we can replicate the issue and fix it.

You can find other threads on this in the forums. It’s most likely because your environment’s locale isn’t set up properly.

I believe some people using the Amazon AMI had that problem.

I didn’t have the issue so I’m sorry I can’t share the right solution, but I know it’s been discussed.

@ramesh

Link to the Gist

Screenshot:

I tried to search for similar posts on the forum, I’ll search again.

I’ve tried updating the conda env,
git pull
Both after activating the fasai env and navigating to the fasai directory

see this comment and the three after it, as well as the links mentioned

edit: @guthl may know a solution

I tried setting the locale, rebooting. Still no luck

This is what locale command returns

locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

I think I’ll have to wait for @guthl 's tutorial :slight_smile:

Can you also upload the Notebook that you see the error? You only have the pre-processing notebook in the gist, but that’s not where you have the error.

Apologies! :sweat_smile:
I just uploaded the other file as well.

Clarifications


In this screenshot for lesson 4 – in Layer (1) It takes 1024 activations and halves them to 512 – is this MaxPool being applied as the last “hidden layer” of layer 1?

Layer 0 (BatchNorm1d) is this the layer that takes the pretrained model’s final output (a vector)
and outputs a layer of 1024 activations?

When taking the Dropout with a percentage of 0.5 are we essentially halving the
activations
(therefore halving the layer, similar to maxpool) or are we ignoring
the half of the activations (still keeping the layer at its same size) or neither??

There’s no MaxPool here. It’s just that Layer 1 takes input nodes(features) of 1024 and has 512 Nodes as output. You typically may not have MaxPooling unless it’s Image Features where it’s OK to take max of surrounding pixels to compress the H x W.

Without knowing which notebook this is using, it’s hard to tell if it’s using pre-trained network outout, but the Batch Norm layer doesn’t change the dimensions. It only re-centers the data to mean 0 although that can be changed by backprop process.

It’s the latter. We just set the Activations to Zero, but the network architecture and number of input / outputs do not change.

1 Like

What is the default activation function for the fully connected layers in the ColumnarModelData.from_data_frame model?

The model summary output indicates these are just linear layers. Is this correct?

@ramesh
For now, I found a workaround by manually converting the .txt file into an ASCII encoded .txt file.

I wanted to check different configurations before submitting but as we say in French: “better is the enemy of good” :slight_smile:

In my setting, this works:

export PYTHONIOENCODING=UTF-8
apt-get -qq update && apt-get -qqy install locales
sed -i -e ‘s/# ru_RU.UTF-8 UTF-8/ru_RU.UTF-8 UTF-8/’ /etc/locale.gen &&
sed -i -e ‘s/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/’ /etc/locale.gen &&
locale-gen &&
update-locale LANG=ru_RU.UTF-8 &&
echo “LANGUAGE=ru_RU.UTF-8” >> /etc/default/locale &&
echo “LC_ALL=ru_RU.UTF-8” >> /etc/default/locale

3 Likes

lesson4 IMDB
Hello, in the leaner.fit
learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name=‘adam3_20’)
I am getting the followin error:
A Jupyter Widget
0%| | 0/4603 [00:00<?, ?it/s]

AttributeError Traceback (most recent call last)
in ()
----> 1 learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name=‘adam3_20’)

~/workspace/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
190 self.sched = None
191 layer_opt = self.get_layer_opt(lrs, wds)
–> 192 self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
193
194 def lr_find(self, start_lr=1e-5, end_lr=10, wds=None):

~/workspace/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, metrics, callbacks, use_wd_sched, **kwargs)
137 n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
138 fit(model, data, n_epoch, layer_opt.opt, self.crit,
–> 139 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
140
141 def get_layer_groups(self): return self.models.get_layer_groups()

~/workspace/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
82 for (*x,y) in t:
83 batch_num += 1
—> 84 for cb in callbacks: cb.on_batch_begin()
85 loss = stepper.step(V(x),V(y))
86 avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)

AttributeError: ‘CosAnneal’ object has no attribute ‘on_batch_begin’

That’s odd. Can you git pull, restart jupyter, and try again?

Since lecture 4 I’ve struggled with a never-ending training of IMDB notebook. Fortunately, I got some pre-trained weights (thanks to @Moody and @wgpubs) thus I thought of creating this post to allow fellow students to explore the notebook since many of us have skipped it because of training time.

You can access the post at: Running IMDB notebook under 10 minutes

3 Likes

@Elfayoumi on_batch_begin() is part of new code. You might have done a git pull while your notebook was still loaded in memory. The notebook is executing well at my end. As Jeremy mentioned, git pull and restart notebook.

I’ve updated the IMDB file.

Updated file has information about object types, model structure, calculation/logic (for those with less knowledge about Pytorch and/or numpy), split etc.

@jeremy/all,

How data is split in test, valid and test set for IMDB?

torch by default gives 2 splits (of 25k items each) for IMDB dataset. My assumption is that one is train and other is test set? If so, can I say validation set is part of the train itself? If so, what’s the ratio?

Test set - 25k used here?
image

We aren’t explicitely specifying validation items anywhere.
image

1 Like

Looks like there is no validation in this split:
@yinterian - Would you be able to share insight on why Pytorch doesn’t include content for validation?
image
Source - https://github.com/pytorch/text/blob/master/torchtext/datasets/imdb.py