Part 2 Lesson 10 wiki

I understand why this is much better than word embeddings but with this model is there a way to use the words (in the context of the corpus you fed the model) like you would with word embeddings?
For example: the King - man + woman = queen equation?
What exactly does implementing the transfer learned, trained model look like? What can it do besides classifying?
I know chatbots use LSTMs. Maybe someone can point me in the direction to how this model would work in a chatbot? How does one glom meaning from new text run through the trained model? How it would work with translation? How it works with a GAN? How it works with a search? How it works as a Q and A? This would help me understand what exactly is happening with the whole process. Please?

(If someone just wants to give an answer to one of these that’d be appreciated. I know no one person will answer all of this.)

It’s a bit handwavy, but I thought about this a bit (actually the second half).
One thing the LM does over word vectors is keep quite a bit more context - as we are looking at LSTM states.
As such I would expect the equivalent of King - Man + Woman = Queen to be a relatively poor use of such a model.
For chatbots, QA, MT, I think using the encoder (or in the latter encoder + decoder) will be beneficial, as for those the history is natural.

1 Like

Thanks for the answer. I’d like to make a Q and A ‘bot’ with this but I feel I don’t know which direction to go to learn how to make one.
Maybe I can just take a chatbot using word embeddings and modify it to use this lesson’s model instead? I’m too new at this. Sometimes the steps to progress are too high to climb.

Hey guys, have written a blog on Generating your own music using RNNs. Hope you enjoy it.

1 Like

Oh, now I see lesson 11 is a translator.
Lesson 11 used to be about a CNN with pictures of fish at the beginning.

(I ripped the videos to my hard drive to play on my other devices and hadn’t seen the switch-a-roo)

the robots.txt file is a file that would live on the root or any sub-dir of the root. The goal is to inform web-crawlers to not index or index on the connected site.

I don’t think it’s used much anymore but still a hold over of a earlier version of the internet.

Hello everyone!

I have been reading the paper and investigating the ULMFiT model. Does anybody know what is exactly test? In the paper there are some tables that make reference to Test error and others to Validation error. Are they the same?

As I understand, in IMDb model, the validation set is the only test set that is used. Am I wrong and there is another one? Thanks!

Going through sentiment analysis through the twitter dataset, found the dataset contains lots of url and text emoji what is the best way to handle this, remove or leave it? Also I am seeing lot of continues exclamation marks like !!! any way to avoid those things.?

I’m getting an error when trying to tokenize the data. “os.fork()” is returning “OSError: [Errno 22] Invalid argument,” which seems pretty bizarre. I’m running on windows subsystem linux. One solution would be for someone to send me their saved file of the tokens for the model (I would love you so so so much). Otherwise, any help is appreciated.


OSError Traceback (most recent call last)
in ()
----> 1 tok_trn, trn_labels = get_all(df_trn, 1)
2 tok_val, val_labels = get_all(df_val, 1)

in get_all(df, n_lbls)
3 for i, r in enumerate(df):
4 print(i)
----> 5 tok_, labels_ = get_texts(r, n_lbls)
6 tok += tok_;
7 labels += labels_

in get_texts(df, n_lbls)
5 texts = texts.apply(fixup).values.astype(str)
6
----> 7 tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
8 return tok, list(labels)

~/fastai/courses/dl2/fastai/text.py in proc_all_mp(ss, lang, ncpus)
99 ncpus = ncpus or num_cpus()//2
100 with ProcessPoolExecutor(ncpus) as e:
–> 101 return sum(e.map(Tokenizer.proc_all, ss, [lang]*len(ss)), [])
102
103

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/process.py in map(self, fn, timeout, chunksize, *iterables)
494 results = super().map(partial(_process_chunk, fn),
495 _get_chunks(*iterables, chunksize=chunksize),
–> 496 timeout=timeout)
497 return _chain_from_iterable_of_lists(results)
498

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in map(self, fn, timeout, chunksize, *iterables)
573 end_time = timeout + time.time()
574
–> 575 fs = [self.submit(fn, *args) for args in zip(*iterables)]
576
577 # Yield must be hidden in closure so that the futures are submitted

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in (.0)
573 end_time = timeout + time.time()
574
–> 575 fs = [self.submit(fn, *args) for args in zip(*iterables)]
576
577 # Yield must be hidden in closure so that the futures are submitted

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/process.py in submit(self, fn, *args, **kwargs)
464 self._result_queue.put(None)
465
–> 466 self._start_queue_management_thread()
467 return f
468 submit.doc = _base.Executor.submit.doc

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/process.py in _start_queue_management_thread(self)
425 if self._queue_management_thread is None:
426 # Start the processes so that their sentinels are known.
–> 427 self._adjust_process_count()
428 self._queue_management_thread = threading.Thread(
429 target=_queue_management_worker,

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/process.py in _adjust_process_count(self)
444 args=(self._call_queue,
445 self._result_queue))
–> 446 p.start()
447 self._processes[p.pid] = p
448

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/process.py in start(self)
103 ‘daemonic processes are not allowed to have children’
104 _cleanup()
–> 105 self._popen = self._Popen(self)
106 self._sentinel = self._popen.sentinel
107 # Avoid a refcycle if the target function holds an indirect

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
221 @staticmethod
222 def _Popen(process_obj):
–> 223 return _default_context.get_context().Process._Popen(process_obj)
224
225 class DefaultContext(BaseContext):

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/context.py in _Popen(process_obj)
275 def _Popen(process_obj):
276 from .popen_fork import Popen
–> 277 return Popen(process_obj)
278
279 class SpawnProcess(process.BaseProcess):

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/popen_fork.py in init(self, process_obj)
17 util._flush_std_streams()
18 self.returncode = None
—> 19 self._launch(process_obj)
20
21 def duplicate_for_child(self, fd):

~/anaconda3/envs/fastai/lib/python3.6/multiprocessing/popen_fork.py in _launch(self, process_obj)
64 code = 1
65 parent_r, child_w = os.pipe()
—> 66 self.pid = os.fork()
67 if self.pid == 0:
68 try:

OSError: [Errno 22] Invalid argument

1 Like

I am using Paperspace Gradient notebook for the course and would appreciate some help in downloading the IMDb data to the environment. It appears as though Gradient notebook is some closed environment which only contains some fast.ai data sets (https://paperspace.zendesk.com/hc/en-us/articles/360003092514-Public-Datasets) but not IMDb!

Asked another way: can we use Gradient notebook to run Lesson 10 ULMFit?

Thanks

import matplotlib.pyplot as plt
#import nltk
import numpy as np
import pandas as pd
import seaborn as sns
#from wordcloud import WordCloud, STOPWORDS

df = pd.read_csv(’…/…/…/data/datasets/women_reviews.csv’)
print (df.head())

#print df.shape

for column in [“Division Name”,“Department Name”,“Class Name”,“Review Text”]:
df = df[df[column].notnull()]
df.drop(df.columns[0], inplace=True, axis=1)

#print df.shape

df[‘Label’]=0
df.loc[df.Rating>=3, [‘Label’]] = 1

#print df.head()

cat_dtypes = [‘Rating’,‘Label’]

increment=0
f, axes = plt.subplots(1, len(cat_dtypes), figsize=(16, 6), sharex=False)

for i in range(len(cat_dtypes)):
sns.countplot(x=cat_dtypes[increment], data=df, order=df[cat_dtypes[increment]].value_counts().index, ax=axes[i])
axes[i].set_title(‘Frequency Distribution for\n{}’.format(cat_dtypes[increment]))
axes[i].set_ylabel(‘Occurrence’)
axes[i].set_xlabel(’{}’.format(cat_dtypes[increment]))
increment += 1
axes[1].set_ylabel(’’)
#axes[2].set_ylabel(’’)
plt.savefig(‘freqdist-rating-recommended-label.png’, format=‘png’, dpi=300)
#plt.show()

‘’‘huevar = ‘Rating’
f, axes = plt.subplots(1, 2, figsize=(16, 7))
sns.countplot(x=‘Rating’, hue=‘Recommended IND’, data=df, ax=axes[0])
axes[0].set_title(‘Occurrence of {}\nby {}’.format(huevar, ‘Recommended IND’))
axes[0].set_ylabel(‘Count’)
percentstandardize_barplot(x=‘Rating’, y=‘Percentage’, hue=‘Recommended IND’, data=df, ax=axes[1])
#axes[1].set_title(‘Percentage Normalized Occurrence of {}\nby {}’.format(huevar, ‘Recommended IND’))
#axes[1].set_ylabel(’% Percentage by Rating’)
plt.savefig(‘rating-recommended.png’, format=‘png’, dpi=300)
plt.show()’’’

pd.set_option(‘max_colwidth’, 300)
#print df[[“Title”,“Review Text”, “Rating”, “Label”]].sample(10)

import os, sys
import re
import string
import pathlib
import random
from collections import Counter, OrderedDict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from tqdm import tqdm, tqdm_notebook, tnrange
tqdm.pandas(desc=‘Progress’)
import torch.cuda
if torch.cuda.is_available():
import torch.cuda as t
else:
import torch as t

import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import torchtext
from torchtext import data
from torchtext import vocab

from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings(‘ignore’)

#device = torch.device(“cuda:0”)

datapath = pathlib.Path(’./datasets’)
print (datapath)

df=df.rename(columns={‘Review Text’: ‘ReviewText’})

#print df.head()

df[‘ReviewText’]=df.ReviewText.progress_apply(lambda x: re.sub(’\n’, ’ ', x))

#split datasets
def split_dataset(df, test_size=0.2):
train, val=train_test_split(df, test_size=test_size, random_state=42)
return train.reset_index(drop=True), val.reset_index(drop=True)

traindf, valdf=split_dataset(df, test_size=0.2)

#shape of traindf, valdf
‘’'print ‘train-shape’
print traindf.shape

print traindf.Label.value_counts()
print (‘val-shape’)
print valdf.Label.value_counts()’’’

#save csv files for training and validation
traindf.to_csv(‘traindf.csv’, index=False)
valdf.to_csv(‘valdf.csv’, index=False)

#preprocessing

#print traindf.head()

nlp = spacy.load(‘en’, disable=[‘parser’, ‘tagger’, ‘ner’])

def tokenizer(s):
return [ w.text.lower() for w in nlp(tweet_clean(s))]

def tweet_clean(txt):
txt=re.sub(r’[^A-Za-z0-9]+’, ’ ‘, txt)
txt=re.sub(r’https?://\S+’, ’ ', txt)
return txt.strip()

‘’‘For text columns or fields, below parameters are used.
‘sequential=True’
It tell torchtext that the data is in form of sequence and not discrete
‘tokenize=tokenizer’
This attribute takes a function that will tokenize a given text. In this case the function will tokenize a single tweet. You can also pass ‘spacy’ string in this attribute if spacy is installed.
‘include_lengths=True’
Apart from tokenized text we will also need the lengths of the tweets for RNN
‘use_vocab=True’
Since this is used to process the text data, we need to create the vocabulary of unique words. This attribute tells torchtext to create the vocabulary
‘’’

‘’'For label columns or fields, below parameters are used.
‘sequential=False’
Now we are defining the blueprint of label columns. Labels are not sequential data, they are discrete. So this attribute is false

‘use_vocab=False’
Since it is a binary classification problem and labels are already numericalized, we will set this to false
‘pad_token=None’
‘unk_token=None’
We don’t need padding and out of vocabulary tokens for labels.’’’

#define fields
txt_field=data.Field(sequential=True,tokenize=tokenizer,include_lengths=True,use_vocab=True, postprocessing= lambda x: float(x))
label_field=data.Field(sequential=False, use_vocab=False,pad_token=None,unk_token=None,postprocessing=data.Pipeline(lambda x: float(x)))

train_val_fields=[
(‘Clothing ID’, None),
(‘Age’, None),
(‘Title’, None),
(‘ReviewText’, txt_field),
(‘Rating’,None),
(‘Recommended IND’,None),
(‘Positive Feedback Count’,None),
(‘Division Name’, None),
(‘Department Name’, None),
(‘Class Name’,None),
(‘Label’, label_field)]

‘’‘path=’./data’
Path were the csv or tsv files are stores
format=‘csv’
format of the files that will be loaded and processed
train=‘traindf.csv’
Name of train file. The final path will become ./data/traindf.csv
validation=‘valdf.csv’
Name of validation file. The final path will become ./data/valdf.csv
fields=train_val_fields
Tell torchtext how the coming data will be processed
skip_header=True
skip the first line in the csv, if it contains header’’’

trainds, valds = data.TabularDataset.splits(path=’’,format=‘csv’,train=‘traindf.csv’,validation=‘valdf.csv’,fields=train_val_fields,skip_header=True)

print (type(trainds))

print ((len(trainds), len(valds)))
print (trainds.fields.items())

example = trainds[0]
print (type(example))
print (type(example.ReviewText))
print (type(example.Label))

#load pretrained word vectors
from torchtext import vocab
#vec = vocab.Vectors(‘glove.42B.300d.txt’, ‘…/…/…/data/’)
vec = vocab.GloVe(name=‘twitter.27B’, dim=100)
print (vec)

txt_field.build_vocab(trainds, valds,max_size=100000, vectors=vec)

#build vocab for labels
#label_field.build_vocab(trainds)

print (txt_field.vocab.vectors.shape)

#print (txt_field.vocab.vectros[txt_field.vocab.stoi[‘awesome’]])

help withme error of this code

Jupyter Notebook
Untitled1
Last Checkpoint: 15 hours ago
(autosaved)
Current Kernel Logo
Python 3
File
Edit
View
Insert
Cell
Kernel
Widgets
Help

import matplotlib.pyplot as plt
#import nltk
import numpy as np
import pandas as pd
import seaborn as sns
#from wordcloud import WordCloud, STOPWORDS

df = pd.read_csv(’…/…/…/data/datasets/women_reviews.csv’)
print (df.head())

#print df.shape

for column in [“Division Name”,“Department Name”,“Class Name”,“Review Text”]:
df = df[df[column].notnull()]
df.drop(df.columns[0], inplace=True, axis=1)

#print df.shape

df[‘Label’]=0
df.loc[df.Rating>=3, [‘Label’]] = 1

#print df.head()

cat_dtypes = [‘Rating’,‘Label’]

increment=0
f, axes = plt.subplots(1, len(cat_dtypes), figsize=(16, 6), sharex=False)

for i in range(len(cat_dtypes)):
sns.countplot(x=cat_dtypes[increment], data=df, order=df[cat_dtypes[increment]].value_counts().index, ax=axes[i])
axes[i].set_title(‘Frequency Distribution for\n{}’.format(cat_dtypes[increment]))
axes[i].set_ylabel(‘Occurrence’)
axes[i].set_xlabel(’{}’.format(cat_dtypes[increment]))
increment += 1
axes[1].set_ylabel(’’)
#axes[2].set_ylabel(’’)
plt.savefig(‘freqdist-rating-recommended-label.png’, format=‘png’, dpi=300)
#plt.show()

‘’‘huevar = ‘Rating’
f, axes = plt.subplots(1, 2, figsize=(16, 7))
sns.countplot(x=‘Rating’, hue=‘Recommended IND’, data=df, ax=axes[0])
axes[0].set_title(‘Occurrence of {}\nby {}’.format(huevar, ‘Recommended IND’))
axes[0].set_ylabel(‘Count’)
percentstandardize_barplot(x=‘Rating’, y=‘Percentage’, hue=‘Recommended IND’, data=df, ax=axes[1])
#axes[1].set_title(‘Percentage Normalized Occurrence of {}\nby {}’.format(huevar, ‘Recommended IND’))
#axes[1].set_ylabel(’% Percentage by Rating’)
plt.savefig(‘rating-recommended.png’, format=‘png’, dpi=300)
plt.show()’’’

pd.set_option(‘max_colwidth’, 300)
#print df[[“Title”,“Review Text”, “Rating”, “Label”]].sample(10)

import os, sys
import re
import string
import pathlib
import random
from collections import Counter, OrderedDict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from tqdm import tqdm, tqdm_notebook, tnrange
tqdm.pandas(desc=‘Progress’)
import torch.cuda
if torch.cuda.is_available():
import torch.cuda as t
else:
import torch as t

import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import torchtext
from torchtext import data
from torchtext import vocab

from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings(‘ignore’)

#device = torch.device(“cuda:0”)

datapath = pathlib.Path(’./datasets’)
print (datapath)

df=df.rename(columns={‘Review Text’: ‘ReviewText’})

#print df.head()

df[‘ReviewText’]=df.ReviewText.progress_apply(lambda x: re.sub(’\n’, ’ ', x))

#split datasets
def split_dataset(df, test_size=0.2):
train, val=train_test_split(df, test_size=test_size, random_state=42)
return train.reset_index(drop=True), val.reset_index(drop=True)

traindf, valdf=split_dataset(df, test_size=0.2)

#shape of traindf, valdf
‘’'print ‘train-shape’
print traindf.shape

print traindf.Label.value_counts()
print (‘val-shape’)
print valdf.Label.value_counts()’’’

#save csv files for training and validation
traindf.to_csv(‘traindf.csv’, index=False)
valdf.to_csv(‘valdf.csv’, index=False)

#preprocessing

#print traindf.head()

nlp = spacy.load(‘en’, disable=[‘parser’, ‘tagger’, ‘ner’])

def tokenizer(s):
return [ w.text.lower() for w in nlp(tweet_clean(s))]

def tweet_clean(txt):
txt=re.sub(r’[^A-Za-z0-9]+’, ’ ‘, txt)
txt=re.sub(r’https?://\S+’, ’ ', txt)
return txt.strip()

‘’‘For text columns or fields, below parameters are used.
‘sequential=True’
It tell torchtext that the data is in form of sequence and not discrete
‘tokenize=tokenizer’
This attribute takes a function that will tokenize a given text. In this case the function will tokenize a single tweet. You can also pass ‘spacy’ string in this attribute if spacy is installed.
‘include_lengths=True’
Apart from tokenized text we will also need the lengths of the tweets for RNN
‘use_vocab=True’
Since this is used to process the text data, we need to create the vocabulary of unique words. This attribute tells torchtext to create the vocabulary
‘’’

‘’'For label columns or fields, below parameters are used.
‘sequential=False’
Now we are defining the blueprint of label columns. Labels are not sequential data, they are discrete. So this attribute is false

‘use_vocab=False’
Since it is a binary classification problem and labels are already numericalized, we will set this to false
‘pad_token=None’
‘unk_token=None’
We don’t need padding and out of vocabulary tokens for labels.’’’

#define fields
txt_field=data.Field(sequential=True,tokenize=tokenizer,include_lengths=True,use_vocab=True, postprocessing= lambda x: float(x))
label_field=data.Field(sequential=False, use_vocab=False,pad_token=None,unk_token=None,postprocessing=data.Pipeline(lambda x: float(x)))

train_val_fields=[
(‘Clothing ID’, None),
(‘Age’, None),
(‘Title’, None),
(‘ReviewText’, txt_field),
(‘Rating’,None),
(‘Recommended IND’,None),
(‘Positive Feedback Count’,None),
(‘Division Name’, None),
(‘Department Name’, None),
(‘Class Name’,None),
(‘Label’, label_field)]

‘’‘path=’./data’
Path were the csv or tsv files are stores
format=‘csv’
format of the files that will be loaded and processed
train=‘traindf.csv’
Name of train file. The final path will become ./data/traindf.csv
validation=‘valdf.csv’
Name of validation file. The final path will become ./data/valdf.csv
fields=train_val_fields
Tell torchtext how the coming data will be processed
skip_header=True
skip the first line in the csv, if it contains header’’’

trainds, valds = data.TabularDataset.splits(path=’’,format=‘csv’,train=‘traindf.csv’,validation=‘valdf.csv’,fields=train_val_fields,skip_header=True)

print (type(trainds))

print ((len(trainds), len(valds)))
print (trainds.fields.items())

example = trainds[0]
print (type(example))
print (type(example.ReviewText))
print (type(example.Label))

#load pretrained word vectors
from torchtext import vocab
#vec = vocab.Vectors(‘glove.42B.300d.txt’, ‘…/…/…/data/’)
vec = vocab.GloVe(name=‘twitter.27B’, dim=100)
print (vec)

txt_field.build_vocab(trainds, valds,max_size=100000, vectors=vec)

#build vocab for labels
#label_field.build_vocab(trainds)

print (txt_field.vocab.vectors.shape)

#print (txt_field.vocab.vectros[txt_field.vocab.stoi[‘awesome’]])

import matplotlib.pyplot as plt
#import nltk
import numpy as np
import pandas as pd
import seaborn as sns
#from wordcloud import WordCloud, STOPWORDS

df = pd.read_csv(’…/…/…/data/datasets/women_reviews.csv’)
print (df.head())

#print df.shape


for column in [“Division Name”,“Department Name”,“Class Name”,“Review Text”]:
df = df[df[column].notnull()]
df.drop(df.columns[0], inplace=True, axis=1)

#print df.shape

df[‘Label’]=0
df.loc[df.Rating>=3, [‘Label’]] = 1

#print df.head()

cat_dtypes = [‘Rating’,‘Label’]

increment=0
f, axes = plt.subplots(1, len(cat_dtypes), figsize=(16, 6), sharex=False)

for i in range(len(cat_dtypes)):
sns.countplot(x=cat_dtypes[increment], data=df, order=df[cat_dtypes[increment]].value_counts().index, ax=axes[i])
axes[i].set_title(‘Frequency Distribution for\n{}’.format(cat_dtypes[increment]))
axes[i].set_ylabel(‘Occurrence’)
axes[i].set_xlabel(’{}’.format(cat_dtypes[increment]))
increment += 1
axes[1].set_ylabel(’’)
#axes[2].set_ylabel(’’)
plt.savefig(‘freqdist-rating-recommended-label.png’, format=‘png’, dpi=300)
#plt.show()


‘’‘huevar = ‘Rating’
f, axes = plt.subplots(1, 2, figsize=(16, 7))
sns.countplot(x=‘Rating’, hue=‘Recommended IND’, data=df, ax=axes[0])
axes[0].set_title(‘Occurrence of {}\nby {}’.format(huevar, ‘Recommended IND’))
axes[0].set_ylabel(‘Count’)
percentstandardize_barplot(x=‘Rating’, y=‘Percentage’, hue=‘Recommended IND’, data=df, ax=axes[1])
#axes[1].set_title(‘Percentage Normalized Occurrence of {}\nby {}’.format(huevar, ‘Recommended IND’))
#axes[1].set_ylabel(’% Percentage by Rating’)
plt.savefig(‘rating-recommended.png’, format=‘png’, dpi=300)
plt.show()’’’


pd.set_option(‘max_colwidth’, 300)
#print df[[“Title”,“Review Text”, “Rating”, “Label”]].sample(10)


import os, sys
import re
import string
import pathlib
import random
from collections import Counter, OrderedDict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from tqdm import tqdm, tqdm_notebook, tnrange
tqdm.pandas(desc=‘Progress’)
import torch.cuda
if torch.cuda.is_available():
import torch.cuda as t
else:
import torch as t

import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import torchtext
from torchtext import data
from torchtext import vocab

from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.metrics import accuracy_score


import warnings
warnings.filterwarnings(‘ignore’)

#device = torch.device(“cuda:0”)

datapath = pathlib.Path(’./datasets’)
print (datapath)

df=df.rename(columns={‘Review Text’: ‘ReviewText’})

#print df.head()

df[‘ReviewText’]=df.ReviewText.progress_apply(lambda x: re.sub(’\n’, ’ ‘, x))

#split datasets
def split_dataset(df, test_size=0.2):
train, val=train_test_split(df, test_size=test_size, random_state=42)
return train.reset_index(drop=True), val.reset_index(drop=True)



traindf, valdf=split_dataset(df, test_size=0.2)

#shape of traindf, valdf
‘’‘print ‘train-shape’
print traindf.shape

print traindf.Label.value_counts()
print (‘val-shape’)
print valdf.Label.value_counts()’’’



#save csv files for training and validation
traindf.to_csv(‘traindf.csv’, index=False)
valdf.to_csv(‘valdf.csv’, index=False)








#preprocessing

#print traindf.head()


nlp = spacy.load(‘en’, disable=[‘parser’, ‘tagger’, ‘ner’])

def tokenizer(s):
return [ w.text.lower() for w in nlp(tweet_clean(s))]

def tweet_clean(txt):
txt=re.sub(r’[^A-Za-z0-9]+’, ’ ‘, txt)
txt=re.sub(r’https?://\S+’, ’ ', txt)
return txt.strip()



‘’‘For text columns or fields, below parameters are used.
‘sequential=True’
It tell torchtext that the data is in form of sequence and not discrete
‘tokenize=tokenizer’
This attribute takes a function that will tokenize a given text. In this case the function will tokenize a single tweet. You can also pass ‘spacy’ string in this attribute if spacy is installed.
‘include_lengths=True’
Apart from tokenized text we will also need the lengths of the tweets for RNN
‘use_vocab=True’
Since this is used to process the text data, we need to create the vocabulary of unique words. This attribute tells torchtext to create the vocabulary
‘’’



‘’'For label columns or fields, below parameters are used.
‘sequential=False’
Now we are defining the blueprint of label columns. Labels are not sequential data, they are discrete. So this attribute is false

‘use_vocab=False’
Since it is a binary classification problem and labels are already numericalized, we will set this to false
‘pad_token=None’
‘unk_token=None’
We don’t need padding and out of vocabulary tokens for labels.’’’

#define fields
txt_field=data.Field(sequential=True,tokenize=tokenizer,include_lengths=True,use_vocab=True, postprocessing= lambda x: float(x))
label_field=data.Field(sequential=False, use_vocab=False,pad_token=None,unk_token=None,postprocessing=data.Pipeline(lambda x: float(x)))

train_val_fields=[
(‘Clothing ID’, None),
(‘Age’, None),
(‘Title’, None),
(‘ReviewText’, txt_field),
(‘Rating’,None),
(‘Recommended IND’,None),
(‘Positive Feedback Count’,None),
(‘Division Name’, None),
(‘Department Name’, None),
(‘Class Name’,None),
(‘Label’, label_field)]



‘’‘path=’./data’
Path were the csv or tsv files are stores
format=‘csv’
format of the files that will be loaded and processed
train=‘traindf.csv’
Name of train file. The final path will become ./data/traindf.csv
validation=‘valdf.csv’
Name of validation file. The final path will become ./data/valdf.csv
fields=train_val_fields
Tell torchtext how the coming data will be processed
skip_header=True
skip the first line in the csv, if it contains header’’’

trainds, valds = data.TabularDataset.splits(path=’’,format=‘csv’,train=‘traindf.csv’,validation=‘valdf.csv’,fields=train_val_fields,skip_header=True)


print (type(trainds))

print ((len(trainds), len(valds)))
print (trainds.fields.items())

example = trainds[0]
print (type(example))
print (type(example.ReviewText))
print (type(example.Label))




#load pretrained word vectors
from torchtext import vocab
#vec = vocab.Vectors(‘glove.42B.300d.txt’, ‘…/…/…/data/’)
vec = vocab.GloVe(name=‘twitter.27B’, dim=100)
print (vec)

txt_field.build_vocab(trainds, valds,max_size=100000, vectors=vec)

#build vocab for labels
#label_field.build_vocab(trainds)


print (txt_field.vocab.vectors.shape)


#print (txt_field.vocab.vectros[txt_field.vocab.stoi[‘awesome’]])




#loading data in batches
#traindl, valdl=data.BucketIterator.splits(datasets=(trainds, valds), batch_sizes(3,3),sort_key=lambda x: len(x.ReviewText), device=None, sort_within_batch=True,repeat=False)

#print len(traindl), len(valdl)

#batch = next(iter(traindl))




#generate batch

‘’‘class BatchGenerator:

def init(self, dl, x_field, y_field):
self.dl, self.x_field, self.y_field=dl, x_field, y_field

def len(self):
return len(self.dl)

def iter(self):
for batch in seld.dl:
X = getattr(batch, self.x_field)
y = getattr(batch, self.y_field)
yield (X,y)
‘’’
















Unnamed: 0 Clothing ID Age Title
0 0 767 33 NaN
1 1 1080 34 NaN
2 2 1077 60 Some major design flaws
3 3 1049 50 My favorite buy!
4 4 847 47 Flattering shirt

                                                                                                                                                                                                                                                                                               Review Text  \

0 Absolutely wonderful - silky and sexy and comfortable
1 Love this dress! it’s sooo pretty. i happened to find it in a store, and i’m glad i did bc i never would have ordered it online bc it’s petite. i bought a petite and am 5’8". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly …
2 I had such high hopes for this dress and really wanted it to work for me. i initially ordered the petite small (my usual size) but i found this to be outrageously small. so small in fact that i could not zip it up! i reordered it in petite medium, which was just ok. overall, the top half was com…
3 I love, love, love this jumpsuit. it’s fun, flirty, and fabulous! every time i wear it, i get nothing but great compliments!
4 This shirt is very flattering to all due to the adjustable front tie. it is the perfect length to wear with leggings and it is sleeveless so it pairs well with any cardigan. love this shirt!!!

Rating Recommended IND Positive Feedback Count Division Name
0 4 1 0 Initmates
1 5 1 4 General
2 3 0 0 General
3 5 1 0 General Petite
4 5 1 6 General

Department Name Class Name
0 Intimate Intimates
1 Dresses Dresses
2 Dresses Dresses
3 Bottoms Pants
4 Tops Blouses
Progress: 100%|██████████| 22628/22628 [00:00<00:00, 336453.11it/s]
datasets

<class ‘torchtext.data.dataset.TabularDataset’>
(18392, 4608)
dict_items([(‘Clothing ID’, None), (‘Age’, None), (‘Title’, None), (‘ReviewText’, <torchtext.data.field.Field object at 0x7fe1931da390>), (‘Rating’, None), (‘Recommended IND’, None), (‘Positive Feedback Count’, None), (‘Division Name’, None), (‘Department Name’, None), (‘Class Name’, None), (‘Label’, <torchtext.data.field.Field object at 0x7fe15199b160>)])
<class ‘torchtext.data.example.Example’>
<class ‘list’>
<class ‘str’>
<torchtext.vocab.GloVe object at 0x7fe10c781630>

AttributeError Traceback (most recent call last)
in ()
228 print (vec)
229
–> 230 txt_field.build_vocab(trainds, valds,max_size=100000, vectors=vec)
231
232 #build vocab for labels

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torchtext/data/field.py in build_vocab(self, *args, **kwargs)
247 sources.append(arg)
248 for data in sources:
–> 249 for x in data:
250 if not self.sequential:
251 x = [x]

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torchtext/data/dataset.py in getattr(self, attr)
145 if attr in self.fields:
146 for x in self.examples:
–> 147 yield getattr(x, attr)
148
149 @classmethod

AttributeError: ‘Example’ object has no attribute ‘ReviewText’

torch.cuda.is_available()
True

What does the predict method acutually output in terms of the Language model . Is that the correct way to get predictions from the model? What I did was loaded the model

.

Then I called the predict method and did a softmax on the predictions. Is this a correct approach to get the predictions from the Language model ?

Does anyone have a link to Yann Lecun’s paper that Jeremy mentions in the lesson for setting a standard for NLP datasets?

Could someone please help me understand this bit of code:

def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = list(texts.apply(fixup).values)

    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)

In particular, I’m trying to understand the concept of fields.

  1. On all calls of this function, as far as I can tell, n_labls is always one. Consequently, the for loop for i in range(n_lbls+1, len(df.columns)): texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str) never gets executed as both n_lbls+1 and len(df_columns) is 2. Am I understanding that correctly?
  2. Are there any examples where there would be multiple fields? Jeremy mentions that documents have structure such as title, abstract etc which would constitute different fields. But how are they detected in this piece of code? I don’t see how there would be xfld <value> where value is not equal to 1 (which is set at the beginning of stream).

Thanks.

when i run the pretrain_lm.py
dir_path data/en_data; cuda_id 0; cl 12; bs 64; backwards False; lr 0.001; sampled True; pretrain_id
Traceback (most recent call last):
File “pretrain_lm.py”, line 53, in
if name == ‘main’: fire.Fire(train_lm)
File “/home/yhl/anaconda3/envs/fastai/lib/python3.6/site-packages/fire/core.py”, line 127, in Fire
component_trace = _Fire(component, args, context, name)
File “/home/yhl/anaconda3/envs/fastai/lib/python3.6/site-packages/fire/core.py”, line 366, in _Fire
component, remaining_args)
File “/home/yhl/anaconda3/envs/fastai/lib/python3.6/site-packages/fire/core.py”, line 542, in _CallCallable
result = fn(*varargs, **kwargs)
File “pretrain_lm.py”, line 42, in train_lm
learner,crit = get_learner(drops, 15000, sampled, md, em_sz, nh, nl, opt_fn, tprs)
File “/home/yhl/fastai/courses/dl2/imdb_scripts/sampled_sm.py”, line 85, in get_learner
m = to_gpu(get_language_model(md.n_tok, em_sz, nhid, nl, md.pad_idx, decode_train=False, dropouts=drops))
File “/home/yhl/fastai/courses/dl2/imdb_scripts/sampled_sm.py”, line 46, in get_language_model
rnn_enc = RNN_Encoder(n_tok, em_sz, n_hid=nhid, n_layers=nlayers, pad_token=pad_token,dropouti=dropouts[0], wdrop=dropouts[2], dropoute=dropouts[3], dropouth=dropouts[4])
TypeError: init() got an unexpected keyword argument ‘n_hid’
please lend me a hand
thank you

I found this to work on Google Colab to install spacy
!pip install spacy && python -m spacy download en
Thanks to Emil for pointing to

I’m trying to expand to predict a single element but every single time that I try it I received an error message

trn_lm looks like this

trn_lm[1] looks like

preds_one = learn.predict_array(np.array(trn_lm[1]))

When I run predict_array (trn_lm[1]) I get this error.
ValueError: not enough values to unpack (expected 2, got 1)

This is for the imdb example on Lesson 10.

I am running the notebook. but the fine-tune process is extremely slow.
I am sure the GPU is visible to pytorch. But I don’t know how to force fastai.text use GPU.
I feel the learner only uses CPU.
Does anyone have similar issue?
Here is the performance: 0%| | 5/5029 [00:53<14:52:55, 10.66s/it, loss=5.54]

Hi,
has anyone managed to export a confusion matrix for the predictions by ULMFiT?
Would very much appreciate some help :wink:

@zzzz you can run nvidia-smi to check GPU usage