AllenNLP TokenEmbedder implementation of ULMFiT

pvcastro · October 23, 2018, 10:27am

Hi there!

I do know that there are plans to implement sequence labeling using ULMFiT, but since it’s still a pending work (I’m aware of @sebastianruder 's branch for this), I was thinking that maybe it’s worth to implement a TokenEmbedder from AllenNLP, based on ULMFiT (similar to the one they have for OpenAI Transformer) and use their seq2seq component to train sequence labeling models, such as NER and POS.

Anyone thought about doing this? In case someone wants to collaborate with me.

sebastianruder · October 23, 2018, 10:41am

Hey Pedro,

I think that’s a great idea! I don’t have the bandwidth to do this at the moment (currently writing my thesis and have to deal with a couple of other projects), so would love if someone would do this. I’d be happy to review the code or help out if possible.

pvcastro · October 23, 2018, 10:55am

Great, thanks Sebastian! I’ll let you know when I have any progress. Your review would be most welcome!

In case anyone is interested in collaborating on this, feel free to reach me.

Varal7 · October 26, 2018, 3:15pm

Hi @pvcastro,

I’d be interested in collaborating!

annanana · November 16, 2018, 7:47pm

any progress on that @pvcastro? I would also be interested in collaborating

pvcastro · November 16, 2018, 7:49pm

Not yet Anna! I was focusing on training the ELMo embeddings, but it’s taking some time. When it’s done I’ll move onto this. It would be great if you could join in

Have you taken a look at the OpenAI Transformer embedder at the allennlp repo?

annanana · November 16, 2018, 7:57pm

not yet. I also focused on implementing ELMo atm Ideally, I would like to implement several experiments to make a comparison between ULMFiT, ELMo, OpenAI Transformer and BERT applied to my case of multilabel classification. I think that AllenNLP would be a good choice for this due to modularity. Let’s stay in contact then! I will keep you posted @pvcastro

pvcastro · November 16, 2018, 7:59pm

OK great! AFAIK the OpenAI Transformer is not available for doing the pre-training, is it? I think they shared only the pretrained model for English.

annanana · November 16, 2018, 8:18pm

They pretrained a language model that can be used to initialize parameters of Transformer network and this model is available to download. So it is a similar approach to ULMFiT, but trained with multi headed self attention instead of any recurrent network + they pretrained this model on 7000 books from various genres in contrast to Wikitext103 trained on Wikipedia. I think it’s worth trying both and comparing the results.

pvcastro · December 19, 2018, 4:18pm

Hi @sebastianruder! I have a first version ready:

Can you please take a look and let me know what you think?

I’m running a NER training on CoNLL 2003 to evaluate the wt103 pre-trained model on the NER task, without the fine tuning on the training corpus.

pvcastro · December 21, 2018, 8:45am

@jeremy, @piotr.czapla, anyone? Can anyone give me a hint as to who could review this code?

Thanks!

piotr.czapla · December 21, 2018, 11:06pm

Hi @pvcastro, I’ve thought we wanted to this the other way around and get NER implemented in fastai. I’m not sure I’m a good person to review contributions to allennlp, I don’t know their coding standards.
But I’m interested in the results. Have you ran the NER task using ULMFiT and can you share the results?

pvcastro · December 26, 2018, 9:45am

Hi @piotr.czapla, I thought it would be easier for me to evaluate ULMFiT embeddings by creating an embedder from allennlp and plugging the ULMFiT model there. They are currently developing embedders for other language models as well, such as Open AI Transformer, Flair and Bert. So I thought I would give a try for ULMFiT, but I must have done very poorly, because the results applied to CoNLL 2003 NER are worse, compared to the baseline NER model (89.5% F1 Score ULMFiT embeddings + Baseline x 90.18% F1 Score Baseline only).

I’m asking for help reviewing the code not regarding the coding standards, but to try and figure out what I’m doing wrong that could be causing the results to be so bad.

pvcastro · December 26, 2018, 9:49am

Here are the results for the baseline evaluation:

/home/pedro/anaconda3/envs/allennlp/bin/python /home/pedro/repositorios/allennlp/allennlp/run.py evaluate /media/discoD/models/elmo/ner_no_elmo/model.tar.gz /home/pedro/repositorios/portuguese-tagger/dataset/eng.testb
2018-12-26 07:47:52,387 - WARNING - allennlp.common.file_utils - Deprecated cache directory found (/home/pedro/.allennlp/datasets).  Please remove this directory from your system to free up space.
2018-12-26 07:47:53,709 - INFO - allennlp.models.archival - loading archive file /media/discoD/models/elmo/ner_no_elmo/model.tar.gz
2018-12-26 07:47:53,710 - INFO - allennlp.models.archival - extracting archive file /media/discoD/models/elmo/ner_no_elmo/model.tar.gz to temp dir /tmp/tmpyudippkx
2018-12-26 07:47:53,822 - INFO - allennlp.data.vocabulary - Loading token dictionary from /tmp/tmpyudippkx/vocabulary.
2018-12-26 07:47:53,840 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.models.model.Model'> from params {'calculate_span_f1': True, 'constrain_crf_decoding': True, 'dropout': 0.5, 'encoder': {'bidirectional': True, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 178, 'num_layers': 2, 'type': 'lstm'}, 'include_start_end_transitions': False, 'label_encoding': 'BIOUL', 'text_field_embedder': {'token_embedders': {'token_characters': {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'}, 'tokens': {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'}}}, 'type': 'crf_tagger'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f9adac3c518>}
2018-12-26 07:47:53,841 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.models.crf_tagger.CrfTagger'> from params {'calculate_span_f1': True, 'constrain_crf_decoding': True, 'dropout': 0.5, 'encoder': {'bidirectional': True, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 178, 'num_layers': 2, 'type': 'lstm'}, 'include_start_end_transitions': False, 'label_encoding': 'BIOUL', 'text_field_embedder': {'token_embedders': {'token_characters': {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'}, 'tokens': {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'}}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f9adac3c518>}
2018-12-26 07:47:53,841 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.text_field_embedders.text_field_embedder.TextFieldEmbedder'> from params {'token_embedders': {'token_characters': {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'}, 'tokens': {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f9adac3c518>}
2018-12-26 07:47:53,841 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.token_embedders.token_embedder.TokenEmbedder'> from params {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f9adac3c518>}
2018-12-26 07:47:53,842 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.seq2vec_encoders.seq2vec_encoder.Seq2VecEncoder'> from params {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'} and extras {}
2018-12-26 07:47:53,842 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.seq2vec_encoders.cnn_encoder.CnnEncoder'> from params {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128} and extras {}
2018-12-26 07:47:53,843 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.token_embedders.token_embedder.TokenEmbedder'> from params {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f9adac3c518>}
2018-12-26 07:47:53,854 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder'> from params {'bidirectional': True, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 178, 'num_layers': 2, 'type': 'lstm'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f9adac3c518>}
2018-12-26 07:47:53,879 - INFO - allennlp.common.checks - Pytorch version: 0.4.1
2018-12-26 07:47:53,880 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'coding_scheme': 'BIOUL', 'tag_label': 'ner', 'token_indexers': {'token_characters': {'min_padding_length': 3, 'type': 'characters'}, 'tokens': {'lowercase_tokens': True, 'type': 'single_id'}}, 'type': 'conll2003'} and extras {}
2018-12-26 07:47:53,880 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.conll2003.Conll2003DatasetReader'> from params {'coding_scheme': 'BIOUL', 'tag_label': 'ner', 'token_indexers': {'token_characters': {'min_padding_length': 3, 'type': 'characters'}, 'tokens': {'lowercase_tokens': True, 'type': 'single_id'}}} and extras {}
2018-12-26 07:47:53,880 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'min_padding_length': 3, 'type': 'characters'} and extras {}
2018-12-26 07:47:53,880 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer from params {'min_padding_length': 3} and extras {}
2018-12-26 07:47:53,880 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'lowercase_tokens': True, 'type': 'single_id'} and extras {}
2018-12-26 07:47:53,880 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer from params {'lowercase_tokens': True} and extras {}
2018-12-26 07:47:53,881 - INFO - allennlp.commands.evaluate - Reading evaluation data from /home/pedro/repositorios/portuguese-tagger/dataset/eng.testb
0it [00:00, ?it/s]2018-12-26 07:47:53,881 - INFO - allennlp.data.dataset_readers.conll2003 - Reading instances from lines in file at: /home/pedro/repositorios/portuguese-tagger/dataset/eng.testb
3453it [00:00, 11729.94it/s]
2018-12-26 07:47:54,175 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.data_iterator.DataIterator'> from params {'batch_size': 64, 'type': 'basic'} and extras {}
2018-12-26 07:47:54,175 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.basic_iterator.BasicIterator'> from params {'batch_size': 64} and extras {}
2018-12-26 07:47:54,176 - INFO - allennlp.commands.evaluate - Iterating over dataset
accuracy: 0.98, accuracy3: 0.98, precision-overall: 0.90, recall-overall: 0.90, f1-measure-overall: 0.90, loss: 101.57 ||: 100%|██████████| 54/54 [00:08<00:00,  8.44it/s]
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - Finished evaluating.
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - Metrics:
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - accuracy: 0.9778184559061053
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - accuracy3: 0.9803165715516313
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - precision-overall: 0.8995948564382596
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - recall-overall: 0.90421388101983
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - f1-measure-overall: 0.9018984547460868
2018-12-26 07:48:03,003 - INFO - allennlp.commands.evaluate - loss: 113.81811673552902
2018-12-26 07:48:03,014 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmpyudippkx

Process finished with exit code 0

pvcastro · December 26, 2018, 9:51am

And here are the results for baseline + ULMFiT evaluation:

/home/pedro/anaconda3/envs/allennlp/bin/python /home/pedro/repositorios/allennlp/allennlp/run.py evaluate /media/discoD/models/elmo/ner_ulmfit/model.tar.gz /home/pedro/repositorios/portuguese-tagger/dataset/eng.testb
2018-12-26 07:48:46,164 - WARNING - allennlp.common.file_utils - Deprecated cache directory found (/home/pedro/.allennlp/datasets).  Please remove this directory from your system to free up space.
2018-12-26 07:48:47,035 - INFO - allennlp.models.archival - loading archive file /media/discoD/models/elmo/ner_ulmfit/model.tar.gz
2018-12-26 07:48:47,036 - INFO - allennlp.models.archival - extracting archive file /media/discoD/models/elmo/ner_ulmfit/model.tar.gz to temp dir /tmp/tmpg7gpm8_q
2018-12-26 07:48:50,973 - INFO - allennlp.data.vocabulary - Loading token dictionary from /tmp/tmpg7gpm8_q/vocabulary.
2018-12-26 07:48:50,991 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.models.model.Model'> from params {'dropout': 0.5, 'encoder': {'bidirectional': True, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 578, 'num_layers': 2, 'type': 'lstm'}, 'include_start_end_transitions': False, 'label_encoding': 'BIOUL', 'regularizer': [['scalar_parameters', {'alpha': 0.1, 'type': 'l2'}]], 'text_field_embedder': {'allow_unmatched_keys': True, 'embedder_to_indexer_map': {'fastai_ulmfit': ['fastai_ulmfit'], 'token_characters': ['token_characters'], 'tokens': ['tokens']}, 'token_embedders': {'fastai_ulmfit': {'language_model': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'vocab_size': 238462}, 'type': 'fastai_ulmfit_embedder'}, 'token_characters': {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'}, 'tokens': {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'}}}, 'type': 'crf_tagger'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:48:50,992 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.models.crf_tagger.CrfTagger'> from params {'dropout': 0.5, 'encoder': {'bidirectional': True, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 578, 'num_layers': 2, 'type': 'lstm'}, 'include_start_end_transitions': False, 'label_encoding': 'BIOUL', 'regularizer': [['scalar_parameters', {'alpha': 0.1, 'type': 'l2'}]], 'text_field_embedder': {'allow_unmatched_keys': True, 'embedder_to_indexer_map': {'fastai_ulmfit': ['fastai_ulmfit'], 'token_characters': ['token_characters'], 'tokens': ['tokens']}, 'token_embedders': {'fastai_ulmfit': {'language_model': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'vocab_size': 238462}, 'type': 'fastai_ulmfit_embedder'}, 'token_characters': {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'}, 'tokens': {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'}}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:48:50,992 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.text_field_embedders.text_field_embedder.TextFieldEmbedder'> from params {'allow_unmatched_keys': True, 'embedder_to_indexer_map': {'fastai_ulmfit': ['fastai_ulmfit'], 'token_characters': ['token_characters'], 'tokens': ['tokens']}, 'token_embedders': {'fastai_ulmfit': {'language_model': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'vocab_size': 238462}, 'type': 'fastai_ulmfit_embedder'}, 'token_characters': {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'}, 'tokens': {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'}}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:48:50,992 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.token_embedders.token_embedder.TokenEmbedder'> from params {'language_model': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'vocab_size': 238462}, 'type': 'fastai_ulmfit_embedder'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:48:50,992 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.token_embedders.fastai_ulmfit_embedder.FastaiUlmfitEmbedder'> from params {'language_model': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'vocab_size': 238462}} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:48:50,992 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.fastai_ulmfit.FastaiLanguageModel'> from params {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'vocab_size': 238462} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:48:52,816 - INFO - allennlp.modules.fastai_ulmfit - loading weights from /media/discoD/models/fastai/wt103/wt103.tar.gz
2018-12-26 07:49:00,436 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.token_embedders.token_embedder.TokenEmbedder'> from params {'embedding': {'embedding_dim': 16}, 'encoder': {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'}, 'type': 'character_encoding'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:49:00,436 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.seq2vec_encoders.seq2vec_encoder.Seq2VecEncoder'> from params {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128, 'type': 'cnn'} and extras {}
2018-12-26 07:49:00,436 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.seq2vec_encoders.cnn_encoder.CnnEncoder'> from params {'conv_layer_activation': 'relu', 'embedding_dim': 16, 'ngram_filter_sizes': [3], 'num_filters': 128} and extras {}
2018-12-26 07:49:00,438 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.token_embedders.token_embedder.TokenEmbedder'> from params {'embedding_dim': 50, 'trainable': True, 'type': 'embedding'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:49:00,447 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder'> from params {'bidirectional': True, 'dropout': 0.5, 'hidden_size': 200, 'input_size': 578, 'num_layers': 2, 'type': 'lstm'} and extras {'vocab': <allennlp.data.vocabulary.Vocabulary object at 0x7f7853a20668>}
2018-12-26 07:49:01,606 - INFO - allennlp.common.checks - Pytorch version: 0.4.1
2018-12-26 07:49:01,606 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.dataset_reader.DatasetReader'> from params {'coding_scheme': 'BIOUL', 'tag_label': 'ner', 'token_indexers': {'fastai_ulmfit': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'type': 'fastai_ulmfit_indexer'}, 'token_characters': {'min_padding_length': 3, 'type': 'characters'}, 'tokens': {'lowercase_tokens': True, 'type': 'single_id'}}, 'type': 'conll2003'} and extras {}
2018-12-26 07:49:01,606 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.dataset_readers.conll2003.Conll2003DatasetReader'> from params {'coding_scheme': 'BIOUL', 'tag_label': 'ner', 'token_indexers': {'fastai_ulmfit': {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'type': 'fastai_ulmfit_indexer'}, 'token_characters': {'min_padding_length': 3, 'type': 'characters'}, 'tokens': {'lowercase_tokens': True, 'type': 'single_id'}}} and extras {}
2018-12-26 07:49:01,606 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz', 'type': 'fastai_ulmfit_indexer'} and extras {}
2018-12-26 07:49:01,606 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.fastai_ulmfit_indexer.FastaiUlmfitIndexer from params {'model_path': '/media/discoD/models/fastai/wt103/wt103.tar.gz'} and extras {}
2018-12-26 07:49:04,703 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'min_padding_length': 3, 'type': 'characters'} and extras {}
2018-12-26 07:49:04,704 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer from params {'min_padding_length': 3} and extras {}
2018-12-26 07:49:04,704 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.token_indexer.TokenIndexer from params {'lowercase_tokens': True, 'type': 'single_id'} and extras {}
2018-12-26 07:49:04,704 - INFO - allennlp.common.from_params - instantiating class allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer from params {'lowercase_tokens': True} and extras {}
2018-12-26 07:49:04,704 - INFO - allennlp.commands.evaluate - Reading evaluation data from /home/pedro/repositorios/portuguese-tagger/dataset/eng.testb
0it [00:00, ?it/s]2018-12-26 07:49:04,747 - INFO - allennlp.data.dataset_readers.conll2003 - Reading instances from lines in file at: /home/pedro/repositorios/portuguese-tagger/dataset/eng.testb
3453it [00:00, 9581.97it/s]
2018-12-26 07:49:05,065 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.data_iterator.DataIterator'> from params {'batch_size': 64, 'type': 'basic'} and extras {}
2018-12-26 07:49:05,065 - INFO - allennlp.common.from_params - instantiating class <class 'allennlp.data.iterators.basic_iterator.BasicIterator'> from params {'batch_size': 64} and extras {}
2018-12-26 07:49:05,066 - INFO - allennlp.commands.evaluate - Iterating over dataset
accuracy: 0.98, accuracy3: 0.98, precision-overall: 0.89, recall-overall: 0.90, f1-measure-overall: 0.90, loss: 72.90 ||: 100%|██████████| 54/54 [01:02<00:00,  1.03it/s] 
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - Finished evaluating.
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - Metrics:
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - accuracy: 0.9760310110907721
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - accuracy3: 0.9788952298912458
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - precision-overall: 0.8865728678130971
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - recall-overall: 0.9036827195467422
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - f1-measure-overall: 0.8950460324418615
2018-12-26 07:50:07,331 - INFO - allennlp.commands.evaluate - loss: 108.48684803644817
2018-12-26 07:50:07,371 - INFO - allennlp.models.archival - removing temporary unarchived model dir at /tmp/tmpg7gpm8_q

Process finished with exit code 0

pvcastro · December 26, 2018, 10:24am

What I’m basically doing in the implementation of the embedder is:

Load a pretrained model from ULMFiT (not fine-tuned to the CoNLL 2003, but I think it should be at least better than the baseline, anyway), that one used for the imdb example: wt103.tar.gz;
Before running the forward on the batch, I add the ULMFiT tags for xbos xfld 1 such as is done in the classification example. When indexing the tokens (converting words to ids), I didn’t run the full pre-processing of ULMFiT (the one that uses the Tokenizer), only the fixup (from the imdb scripts) and lower casing. The reason why I didn’t run the full pre-processing is because this changes the tensors dimension, since it adds additional tokens to flag stuff like tk_rep, t_up, etc.
The forward is run, and I get only the last layer (outputs from RNN_Encoder) to be used as embedding features (the same done with the classifier example as well).
After the forward, I remove the embeddings for the ULMFiT tags (xbos xfld 1) and return the embeddings only for the provided tokens, keeping and embedding per token in the batch.

So…any thoughts?

jbkjr · December 26, 2018, 5:51pm

Hey, I’ve actually just been doing some work with AllenNLP and seq2seq with pretrained embeddings like ELMo, ULMFiT, etc. (but for semantic parsing, not sequence labeling). I wrote a post up on it here.

pvcastro · December 26, 2018, 7:21pm

Hi @jbkjr. Did you get to do what we’re discussing here? Evaluating the ULMFiT embeddings in the AllenNLP library.

piotr.czapla · December 27, 2018, 7:43pm

The post_processing_rules are lowercasing the input sentence if you don’t do that you have plenty of words not recognized by the pre-trained model.
Try to compare your dictionaries most likely you will see quite a few words missing in the dictionary. What is worst you will miss most of the names as they are written with uppercase in English.

The classifier is doing the pulling over the hidden state outputs you might want to do something similar, the output of the rnn might not be enough.

I think the best way is to reimplement NER in fastai, and to proper finetuning. If you want to skip the preprocessing you might want to at least finetune a model on a modified dictionary that contains words with mixed case .

pvcastro · December 27, 2018, 8:03pm

From my understanding, I did the same thing as the classifier, I just didn’t run the output of the RNN through a softmax.

I’ll do that

Are you saying that what I did for the embedder should be enough?

Are you saying that the kind of preprocessing that ULMFiT does prevents the model from being plugged in AllenNLP components?

I’ll work on that, but I’m thinking that there’s too much of a difference there.