Lesson 4 official topic

wgpubs · May 17, 2022, 8:26am

It varies based on the architecture.

If you look up the various language models on hugging face, often they’ll include a model card with those details. For example, check out the “Training data” section of gpt-2 here: gpt2 · Hugging Face

JaviNavarro · May 17, 2022, 8:26am

How good is state of the art NLP at information extraction these days?

miwojc · May 17, 2022, 8:27am

if you want to make submission to the kaggle patent competition, the notebook needs to run offline, i have modified slightly Jeremy’s great notebook so that submission to the competition can be made from it and you can start competing on the leader board!

the changes are made are very minor, i have downloaded the datasets package and uploaded to kaggle as dataset so it can be accessed offline, also downloaded deberta-v3-small model from huggingface hub and uploaded to kaggle datasets, so it can be accessed offline.

SamFogarty · May 17, 2022, 8:27am

You don’t tend to train on raw data (consider your non-natural language may have hundreds of blank spaces in a row sometimes) - rather you need an appropriate tokeniser to preprocess the raw text into an efficient representation.
A second issue would be ensuring you had enough quality data for your dataset.
However, you should not need to do anything exotic with the model architecture - it is more just the data challenges.

Raymond-Wu · May 17, 2022, 8:29am

Both if possible

wgpubs · May 17, 2022, 8:29am

Your best bet is likely a sequence-to-sequence (seq2seq) model. These are used most often in tasks like summarization and translation, the later of which is most similar to what it sounds like you want to do.

n-e-w · May 17, 2022, 8:30am

The underlying concepts (tokenization etc) are the same. You’re learning sequences. When a machine learns “language” it is first converted into a mathematical format that is capable of being learned. The chains of these tokens form sequences (word, sentence, paragraph level) and those sequences are then learned. These accelerated over the last ~7 years with recurrent neural networks and evolving to Transformer architectures.

For those that are asking about the underlying architectures, you can think about the evolution as LSTM / GRU → ULMFIT → BERT → Transformers / Attention is all you Need

To learn more detail about the key insights in the development of transformers, see the foundational paper, “Attention is All You Need”

To understand older architectures like LSTM / GRU, see this great post by Chris Olah:

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

ilovescience · May 17, 2022, 8:30am

HuggingFace Transformers has a whole hub of hundreds of models from papers, community-contributed, etc…

Maybe for performance you can check www.paperswithcode.com for NLP-related tasks…

SamFogarty · May 17, 2022, 8:31am

If you start with a pre-trained model, it will converge faster than starting from scratch - so it is likely more efficient to begin with a pre-trained model. It also depends on the scale of your final dataset. If it is large enough, you could train from scratch. But for a smaller final dataset, you could only use fine tuning.

checkmate404 · May 17, 2022, 8:31am

If TEXT2 in this example “eliminating process” were actually 10000 words long, would you repeat the TEXT2 label multiple times, say every 64 words? Or is it enough just having it at the beginning for these models?

Raymond-Wu · May 17, 2022, 8:32am

FYI for the mods: cups.fast.ai is down.

tuppackj · May 17, 2022, 8:32am

When using ULMFit (and now subsequently Transformers) how do you handle any questions around the explainability/transparency of the predicted outcome? Take for example “Legal discovery (which documents are in scope for a trial)”, the prediction may be binary (1 - in scope, 0 - out of scope) but the stakeholders may want to know why? This is also very important when you have a highly biased dataset which may be apparent in law documents?

ilovescience · May 17, 2022, 8:33am

yep we are aware, Jeremy mentioned at the beginning that Radek is working on it…

n-e-w · May 17, 2022, 8:34am

They’re brilliant. Have a play around with GPT-3 on Open AI’s website.

Alternatively, you can play around with GPT-2, BART and others on Huggingface via their API:

Zakia · May 17, 2022, 8:34am

Question from the Getting started with NLP for absolute beginners notebook:

Regarding this line of code:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

How come the sentence above is created with the order of columns from right to left from the table of the data, so basically, context, then target then anchor?

Are there any general rules about the order of creating the sentence like that for NLP models?

SamFogarty · May 17, 2022, 8:37am

You can ask the model to explain itself as part of its answer.
You can finetune the model to answer questions in a certain way - ‘My answer is in 3 parts…’.
You can show the model the type of answer you want with demonstration examples within the question itself (‘few shot learning’).

JackV · May 17, 2022, 8:37am

Help: I’m getting FileNotFoundError: [Errno 2] No such file or directory: ‘…/input/us-patent-phrase-to-phrase-matching/train.csv’

I’m running on my local machine so I have added ‘LocalHost’

Thank you.

wgpubs · May 17, 2022, 8:38am

Great question.

I’d start with a consideration of the training data and standard approaches to evaluating the performance of any classification model (e.g., using a confusion matrix, reporting f1/recall/precision/auc/etc…). You could also create document embeddings, reduce their dimensionality down to 2, and plot them to see if, and how well, they create clusters around your classes.

Just a few ideas.

ilovescience · May 17, 2022, 8:39am

ULMFit had its own interpretability methods which was available in an older version of fastai but was not ported over to the current version of fastai:

Transformer models have their own interpretability methods, but basically the transformer models inherently provide “attention weights” to the tokens in the text that it focuses on. If you search up “transformers interpretability” lots of different open-source libraries and approaches will pop up. Here is an example of one:

n-e-w · May 17, 2022, 8:40am

This is an error with references to the path where you’re downloading the data.

First of all, check you’ve set your API key correctly. Then check your path references.