Hi,
I am building this transformer model using pytorch where I am using custom embedding layer made of two pre-trained embeddings: fasttext and glove.
Dimension of both the pretrained embedding matrix is 300.
crawl-300d-2M.vec
glove.6B.300d.txt
But I want to limit the dimension of my custom embedding to 256.
Wrote this function to limit the embedding dimention to 256:
def load_embedding(embedding_file):
def get_coefs(word,*arr):
return word, np.asarray(arr, dtype='float32')[:256] # keeping the embedding size 256
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(embedding_file, encoding="utf8", errors='ignore') if len(o)>100)
return embeddings_index
glove_file = ‘/glove.6B.300d.txt’
fasttext_file = ‘/crawl-300d-2M.vec’
glove_embeddings_index = load_embedding(glove_file)
fasttext_embeddings_index = load_embedding(fasttext_file)
custom embedding
creating a placeholder embedding matrix first
all_embs = np.stack(fasttext_embeddings_index.values()) #using fasttext embedding as base
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]
nb_words = len(word_index)
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
print(embedding_matrix.shape[1]) #output: 256
custom embedding creation which is a dictionary containing word and corresponding word vector
cust_embedding = {}
for word, indx in word_index.items():
if indx < nb_words:
embedding_vector = fasttext_embeddings_index.get(word)
if embedding_vector is None:
embedding_vector = glove_embeddings_index.get(word)
if embedding_vector is None:
embedding_vector = embedding_matrix[indx]
cust_embedding[word] = embedding_vector
saving custom embedding in .txt file which will be later used during preprocessing using torchtext
with open(’/custom_embeddings.txt’, ‘w+’) as f:
for token, vector in cust_embedding.items():
vector_str = ’ ‘.join([str(v) for v in vector])
f.write(f’{token} {vector_str}\n’)
Next I am trying to create a torchtext.vocab.Vectors object
import torchtext.vocab as vocab
custom_embeddings = vocab.Vectors(name = ‘/custom_embeddings.txt’, max_size= 256)
Here using ‘max_size=’ argument is throwing error:
TypeError: init() got an unexpected keyword argument ‘max_size’
But, I checked the torchtext.vocab.Vectors documentation where I could see this max_size argument is present :
class torchtext.vocab.Vocab(counter, max_size=None, min_freq=1, specials=[’’], vectors=None, unk_init=None, vectors_cache=None, specials_first=True)
And I need to set the size of my custom embedding to 256 or else later during training my model I am getting run time error.
code snippet of vocabulary building for encoder(ENC) and decoder(DEC) input using custom embedding:
ENC_TEXT.build_vocab(train_data, vectors = custom_embeddings)
DEC_TEXT.build_vocab(train_data, vectors = custom_embeddings)
model.embedding.weight.data.copy_(ENC_TEXT.vocab.vectors)
model.embedding.weight.data.copy_(DEC_TEXT.vocab.vectors)
Giving below the parameter setting and run time error received during training the model if I do not change the custom embedding dimension to 256:
INPUT_DIM = len(ENC_TEXT.vocab)
OUTPUT_DIM = len(DEC_TEXT.vocab)
HIDDEN_DIM = 256 # size of each pretrained word vector in the embedding matrix ie size[1] of the embedding matrix
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 10
DEC_HEADS =10
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1
enc = Encoder(INPUT_DIM,
HIDDEN_DIM ,
ENC_LAYERS,
ENC_HEADS,
ENC_PF_DIM,
ENC_DROPOUT,
device)
dec = Decoder(OUTPUT_DIM,
HIDDEN_DIM ,
DEC_LAYERS,
DEC_HEADS,
DEC_PF_DIM,
DEC_DROPOUT,
device)
RuntimeError Traceback (most recent call last)
in ()
20 ENC_PF_DIM,
21 ENC_DROPOUT,
—> 22 device)
23
24 dec = Decoder(OUTPUT_DIM,
in init(self, input_dim, hid_dim, n_layers, n_heads, pf_dim, dropout, device, max_length)
18
19 # step added for custom embedding
—> 20 self.tok_embedding.weight.data.copy_(SRC.vocab.vectors)
21
22 self.pos_embedding = nn.Embedding(max_length, hid_dim)
RuntimeError: The size of tensor a (256) must match the size of tensor b (300) at non-singleton dimension 1
The main issue I am facing is this run time error of size mismatch between tensors.
Looking for the reason behind led to the size of the custom embedding being fixed to 300(instead of 256)
Really appreciate if you kindly help in resolve the issue.