Tensorboard Projector for Word Embeddings

neuradai · October 2, 2020, 3:07am

I’m trying to use the TensorBoardProjectorCallback to visualize word embeddings for NLP and I’m struggling to figure out what to provide as the layer input to the callback. I’m using the text_classifier_learner() with the AWD_LSTM architecture shown below.

>>> learn_cls.model
SequentialRNN(
  (0): SentenceEncoder(
    (module): AWD_LSTM(
      (encoder): Embedding(1192, 400, padding_idx=1)
      (encoder_dp): EmbeddingDropout(
        (emb): Embedding(1192, 400, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): WeightDropout(
          (module): LSTM(400, 1152, batch_first=True)
        )
        (1): WeightDropout(
          (module): LSTM(1152, 1152, batch_first=True)
        )
        (2): WeightDropout(
          (module): LSTM(1152, 400, batch_first=True)
        )
      )
      (input_dp): RNNDropout()
      (hidden_dps): ModuleList(
        (0): RNNDropout()
        (1): RNNDropout()
        (2): RNNDropout()
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): LinBnDrop(
        (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Dropout(p=0.2, inplace=False)
        (2): Linear(in_features=1200, out_features=50, bias=False)
        (3): ReLU(inplace=True)
      )
      (1): LinBnDrop(
        (0): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (1): Dropout(p=0.1, inplace=False)
        (2): Linear(in_features=50, out_features=2, bias=False)
      )
    )
  )
)

Does anyone know how to call the appropriate layer for hook_output() - which is called by the _setup_projector() method in the TensorBoardBaseCallback parent class - to get the output of the Embedding layer?

None of the following seem to work:

learn_cls.model[0]
learn_cls.model[0].module
learn_cls.model[0].module.encoder

florianl · October 2, 2020, 6:14am

Hi Walter,

I did a image similarly project recently and added the projector support to the callback. Unfortunately as of now it just works for images. But with the following code you can visualize the word embeddings in tensorboard projector - but it still needs improvement (images and metadata).

from fastai.text.all import *
from torch.utils.tensorboard import SummaryWriter

from torch.nn import functional as F

# found this in the fastai forum
def get_normalized_embeddings():
  return F.normalize(learn.model[0].encoder.weight)

def get_embeddings():
  return F.normalize(learn.model[0].encoder.weight)

def most_similar(token, embs):
  idx = learn.dls.o2i[token]
  sims = (embs[idx] @ embs.t()).cpu().detach().numpy()

  print(f'Similar to: {token}')
  for sim_idx in np.argsort(sims)[::-1][1:11]:
    print(f'{learn.dls.vocab[sim_idx]:<30}{sims[sim_idx]:.02f}')


path = untar_data(URLs.IMDB)

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

e = get_normalized_embeddings()

writer = SummaryWriter()

imgs = torch.rand(len(e),3,16,16)

vocab = learn.dls.vocab

for i,v in enumerate(vocab):
    vocab[i] = f'{v}_'

writer.add_embedding(e, metadata=vocab,label_img=imgs)

I’ll try to add support for AWD_LSTM to the callback too :).

neuradai · October 4, 2020, 4:32am

Thanks for the tips, @florianl. Your suggestions got me started on the right track.

Here’s the code for the streamlined solution I was able to hack together. It relies a little more on TensorFlow than the Pytorch Tensorboard integration due to some incompatibilities with the latest version of TensorFlow.

Note: The following code was successful in Google Colab. The solution may differ for locally hosted runtime environments.

%load_ext tensorboard
import os
import tensorflow as tf
from fastai.basics import *
from fastai.text import *
from tensorboard.plugins import projector

log_dir = '/content/runs'

if not os.path.exists(log_dir):
    os.makedirs(log_dir)

learn_cls = load_learner('full-text') # trained & exported text_classifier_learner() with AWD_LSTM arch

e = F.normalize(learn_cls.model[0].module.encoder.weight)
e = tf.Variable(e.detach().numpy())
vocab = learn_cls.dls.vocab[0]

with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
    for i, word in enumerate(vocab):
        f.write(f'{word}\n')

checkpoint = tf.train.Checkpoint(embedding=e)
checkpoint.save(os.path.join(log_dir,'embedding.ckpt'))

config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = "metadata.tsv"

projector.visualize_embeddings(log_dir, config)

%tensorboard --logdir {log_dir}

florianl · October 5, 2020, 10:19pm

Hi Walter,

did you have trouble with PyTorchs Tensorboard integration? Didn’t have any trouble with the PyTorch implementation.

I am using this code for word embeddings now.

def projector_word_embeddings(learn=None, layer=None, vocab=None, limit=-1, start=0, log_dir=None):
    "Extracts and exports word embeddings from language models embedding layers"
    if not layer:
        if   isinstance(learn, LMLearner):   layer = learn.model[0].encoder
        elif isinstance(learn, TextLearner): layer = learn.model[0].module.encoder
    emb = layer.weight
    img = torch.full((len(emb),3,8,8), 0.7)
    vocab = learn.dls.vocab[0] if vocab == None else vocab
    vocab = list(map(lambda x: f'{x}_', vocab))
    writer = SummaryWriter(log_dir=log_dir)
    end = start + limit if limit >= 0 else -1
    writer.add_embedding(emb[start:end], metadata=vocab[start:end], label_img=img[start:end])
    writer.close()

Example:

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
projector_word_embeddings(learn=learn, limit=2000, start=2000)

for non-fastai models (e.g. huggingface):

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# get the word embedding layer
layer = model.embeddings.word_embeddings

# get and sort vocab
vocab_dict = tokenizer.get_vocab()
vocab = [k for k, v in sorted(vocab_dict.items(), key=lambda x: x[1])]

# write the embeddings for tb projector
projector_word_embeddings(layer=layer, vocab=vocab, limit=2000, start=2000)