I´m trying to re-create lesson 8 with a kagle dataset of natural disasters from twitter with NLP

My Dataframe

id	keyword	location	text	                                                                                target
0	1	NaN	       NaN	        Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	       NaN	        Forest fire near La Ronge Sask. Canada	                1
2	5	NaN	       NaN	        All residents asked to 'shelter in place' are ...	                1
3	6	NaN	       NaN	        13,000 people receive #wildfires evacuation or...	        1
4	7	NaN        NaN	        Just got sent this photo from Ruby #Alaska as ...	        1

I’m only interested in the “text” and “target” columns.

This is my DataBlock

class_lm = DataBlock(
                      blocks=(TextBlock.from_df('text', seq_len=15, is_lm=True), CategoryBlock),
                      get_x=ColReader('text'), get_y=ColReader('target'), splitter=RandomSplitter(0.1))

This is my dataloaders

dls = dls_lm.dataloaders(df2, bs=24)

This is my learner

learn = language_model_learner(

    dls, AWD_LSTM, drop_mult=0.3,

    metrics=[accuracy, Perplexity()]

)

learn.fine_tune(1, 2e-2)

epoch	train_loss	valid_loss	accuracy	perplexity	time
0	3.496560	3.430030	0.419661	30.877579	17:33
epoch	train_loss	valid_loss	accuracy	perplexity	time
0	2.915300	3.042652	0.471342	20.960751	25:20

When I want to make a prediction:
TEXT = “I was afraid that”

N_WORDS = 20

N_SENTENCES = 4

preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]

I ve got this error:

---------------------------------------------------------------------------

ValueError Traceback (most recent call last)
in ()
2 N_WORDS = 20
3 N_SENTENCES = 4
----> 4 preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)]

1 frames
/usr/local/lib/python3.7/dist-packages/fastai/text/learner.py in predict(self, text, n_words, no_unk, temperature, min_p, no_bar, decoder, only_last_word)
157 self.model.reset()
158 idxs = idxs_all = self.dls.test_dl([text]).items[0].to(self.dls.device)
→ 159 if no_unk: unk_idx = self.dls.vocab.index(UNK)
160 for _ in (range(n_words) if no_bar else progress_bar(range(n_words), leave=False)):
161 with self.no_bar(): preds,_ = self.get_preds(dl=[(idxs[None],)])

ValueError: ‘xxunk’ is not in list

When creating dataloaders for language modeling you don’t need to include CategoryBlock and get_y into your datablocks definition, something like this:

dblock_lm = DataBlock(blocks=[TextBlock.from_df('text', seq_len=15, is_lm=True)],
                      get_x=ColReader('text'), splitter=RandomSplitter(0.1))
dls = dblock_lm.dataloaders(...)

or alternatively you can use TextDataLoaders.from_df see 3rd example here Text data | fastai

1 Like

Thank you, makes total sense of not including the target and the get_y parameter.

How would you decide on which number to use for seq_len and bs when creating a Language model and then use it for classification?

These depend on the hardware you using, you got to choose seq_len and bs such that it fits into memory and computation time is ok. Increasing seq_len will increase both memory consumption and computation time as it is also a bptt (see chapter 12 of the book for details). fastai default for seq_len is 72 and I guess that’s something that usually works well in practice, so you got to adjust your bs accordingly to fit into memory.
Higher values for seq_len might improve quality of your LM (SHA-RNN was trained with bptt=1024). On the other hand If your data naturally consists of short texts (like twitter data) you can try to decrease seq_len for speed-up and arguably this shouldn’t degrade your LM performance as backpropagating gradients trough the unrelated tweets doesn’t make much sense and model should learn to stop the information flow between those anyway.
So those are my thoughts on it. MB other folks can add to that? :wink:

1 Like

Thank you! I will check out chapter 12 for more info about this subject.