Pytorch: getting error while trying to set max_size argument for torchtext.vocab.Vocab object in Colab

ninja16180 · May 12, 2020, 12:12pm

Hi,

I am building this transformer model using pytorch where I am using custom embedding layer made of two pre-trained embeddings: fasttext and glove.
Dimension of both the pretrained embedding matrix is 300.
crawl-300d-2M.vec
glove.6B.300d.txt

But I want to limit the dimension of my custom embedding to 256.

Wrote this function to limit the embedding dimention to 256:

def load_embedding(embedding_file):

def get_coefs(word,*arr): 
  return word, np.asarray(arr, dtype='float32')[:256]  # keeping the embedding size 256

embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(embedding_file, encoding="utf8", errors='ignore') if len(o)>100)

return embeddings_index

glove_file = ‘/glove.6B.300d.txt’
fasttext_file = ‘/crawl-300d-2M.vec’

glove_embeddings_index = load_embedding(glove_file)

fasttext_embeddings_index = load_embedding(fasttext_file)

custom embedding

creating a placeholder embedding matrix first

all_embs = np.stack(fasttext_embeddings_index.values()) #using fasttext embedding as base
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

nb_words = len(word_index)
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

print(embedding_matrix.shape[1]) #output: 256

custom embedding creation which is a dictionary containing word and corresponding word vector

cust_embedding = {}

for word, indx in word_index.items():
if indx < nb_words:
embedding_vector = fasttext_embeddings_index.get(word)
if embedding_vector is None:
embedding_vector = glove_embeddings_index.get(word)
if embedding_vector is None:
embedding_vector = embedding_matrix[indx]

cust_embedding[word] = embedding_vector

saving custom embedding in .txt file which will be later used during preprocessing using torchtext

with open(’/custom_embeddings.txt’, ‘w+’) as f:
for token, vector in cust_embedding.items():
vector_str = ’ ‘.join([str(v) for v in vector])
f.write(f’{token} {vector_str}\n’)

Next I am trying to create a torchtext.vocab.Vectors object

import torchtext.vocab as vocab

custom_embeddings = vocab.Vectors(name = ‘/custom_embeddings.txt’, max_size= 256)

Here using ‘max_size=’ argument is throwing error:

TypeError: init() got an unexpected keyword argument ‘max_size’

But, I checked the torchtext.vocab.Vectors documentation where I could see this max_size argument is present :
class torchtext.vocab.Vocab(counter, max_size=None, min_freq=1, specials=[’’], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)

And I need to set the size of my custom embedding to 256 or else later during training my model I am getting run time error.

code snippet of vocabulary building for encoder(ENC) and decoder(DEC) input using custom embedding:

ENC_TEXT.build_vocab(train_data, vectors = custom_embeddings)
DEC_TEXT.build_vocab(train_data, vectors = custom_embeddings)

model.embedding.weight.data.copy_(ENC_TEXT.vocab.vectors)

model.embedding.weight.data.copy_(DEC_TEXT.vocab.vectors)

Giving below the parameter setting and run time error received during training the model if I do not change the custom embedding dimension to 256:

INPUT_DIM = len(ENC_TEXT.vocab)
OUTPUT_DIM = len(DEC_TEXT.vocab)
HIDDEN_DIM = 256 # size of each pretrained word vector in the embedding matrix ie size[1] of the embedding matrix
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 10
DEC_HEADS =10
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

enc = Encoder(INPUT_DIM,
HIDDEN_DIM ,
ENC_LAYERS,
ENC_HEADS,
ENC_PF_DIM,
ENC_DROPOUT,
device)

dec = Decoder(OUTPUT_DIM,
HIDDEN_DIM ,
DEC_LAYERS,
DEC_HEADS,
DEC_PF_DIM,
DEC_DROPOUT,
device)

RuntimeError Traceback (most recent call last)
in ()
20 ENC_PF_DIM,
21 ENC_DROPOUT,
—> 22 device)
23
24 dec = Decoder(OUTPUT_DIM,

in init(self, input_dim, hid_dim, n_layers, n_heads, pf_dim, dropout, device, max_length)
18
19 # step added for custom embedding
—> 20 self.tok_embedding.weight.data.copy_(SRC.vocab.vectors)
21
22 self.pos_embedding = nn.Embedding(max_length, hid_dim)

RuntimeError: The size of tensor a (256) must match the size of tensor b (300) at non-singleton dimension 1

The main issue I am facing is this run time error of size mismatch between tensors.
Looking for the reason behind led to the size of the custom embedding being fixed to 300(instead of 256)

Really appreciate if you kindly help in resolve the issue.