Part 2 Lesson 10 wiki

Vishucyrus · April 3, 2018, 5:15am

I sure am interested…

narvind2003 · April 3, 2018, 5:35am

Outlook has a nice API…I’ve used Python to pull email body, subject etc…
Images/PDFs - ocr software or your own models
I’d love to hear more about this as well - we need better ways to gather more labeled data for NLP!

chloews · April 3, 2018, 5:37am

Does the hinge loss accomplish something like this?

Solid notes on hinge loss vs. cross-entropy loss from CS231n: http://cs231n.github.io/linear-classify/#svmvssoftmax

TL;DR: For a given example, as long as the correct class score outperforms each incorrect class score by a fixed margin such as 1, loss is 0. If the correct class score is not at least +1 higher than any incorrect class scores, loss is equal to whatever the total shortfall is from those pairwise comparisons. So, per your comment, unlike cross-entropy, you’re not rewarded for being superduper confidently correct (loss is 0 whether your correct class score is higher than the others by 1.1 or 100), and obversely, you’re not penalized for being only a bit correct (so long as you’ve beat the margin).

poppingtonic · April 3, 2018, 5:55am

Tmux should also be good, if you run it in the server.

poppingtonic · April 3, 2018, 5:56am

I like Tabula: http://tabula.technology/
And PDFxStream: https://www.snowtide.com/

mcleavey · April 3, 2018, 6:03am

Thanks, that looks interesting! Right now I’m trying to implement last week’s focal loss on the imdb training, but if this doesn’t work I’ll try out hinge.

YJP · April 3, 2018, 7:00am

From the first reading of the FitLaM paper (hope we are allowed to discuss it in the thread), I tried to understand the difference between a FitLaM and others and came up with the following understanding:

Hypercolumns: A hypercolumns approach takes to pretrain word embeddings to capture word representations (e.g. syntax and semantics). From the paper, it sounded like this method was used in Computer Vision before. However, since the use of end-to-end fine-tuning got utilised frequently in Computer Vision, this method has not been used much (which a FitLaM approach may bring the similar effect to NLP studies).

Multi-task learning (MTL): MTL suggests to train a language model with a task model together. The main issue with MTL is that one has to train from scratch every time where the target tasks become different from each other.

Fine-tuned Language Models:

pretrain the language model on a large general-domain corpus (e.g. in lesson 10, Wikitext-103 for English) based on a language you need for your target task.
Fine-tune the pretrained model from 1) with the target task’s data because the data for the target task is likely to have different traits. This second step takes much less time since the model needs to deal with the different words that were not contained in the previous large data set. Also, one could still utilise transfer learning advantages where the data size for the target task is small.
In FitLaM, target task classifier fine-tuning is the last stage where the parameters of these layers need to be trained from scratch for a specific task. If there is a new task for the same language, one can start from the second step, not redoing the first step.

In summary, the main difference between a FitLaM and Hypercolumns seems to be whether one uses an adaptable model that gets adjusted through further fine-tuning processes for a specific task or (v.s.) a ‘frozen’ information.

A FitLaM could be similar to MTL in that MTL also utilises a language model and a specific task approach. However, the difference between a FitLaM and MTL is that a FitLaM maximises transfer learning advantages, hence enhancing the efficiency (speed of the second step and no repetition of the first step required) and effectiveness (using the pretrained model with a large corpus).

Appreciate your comments on my short understanding. Also, I am not sure what the statement below means:

“MTL could potentially be combined with FitLaM.” (p.2)
Could anyone help me to understand the context? Thank you in advance.

merajat · April 3, 2018, 7:04am

Hi, @jeremy can you please share weights for language model that you have trained over yelp dataset. That would be really helpful. Thanks

piotr.czapla · April 3, 2018, 8:14am

Guys do you know where is @jeremy’s paper with the ablation studies? The one from January https://arxiv.org/pdf/1801.06146.pdf does not have that information.

chunduri · April 3, 2018, 10:37am

what is perplexity?

narvind2003 · April 3, 2018, 11:31am

It’s a very common NLP model metric expressed as an exponent. Did you see how Jeremy did math.exp in the class last night and got 45 or something like that and then talked about how this score has been rapidly falling in the past 12-18 months?

If you think about it, the general idea is this:
if you’re less perplexed by something, it could be an indicator that you have a greater understanding of it.

chunduri · April 3, 2018, 11:33am

thanks for the answer.
So perplexity is inversely proportional to accuracy?

narvind2003 · April 3, 2018, 11:45am

Sure.

Yes, somewhat…you want low perplexity in your knowledge representation - which could lead to more accurate results when used for some prediction.

NLP and language in general, is all about compressing knowledge into low dimension forms. An image(of a cat) is a very high dimension item(lot of pixels). By contrast, a word like “cat” is the most compact (well, a bit is the most compact) representation of that item. So when you go from image to label you throw away a ton of information and encode your “essential” information only. The amount of knowledge is usually calculated using Shannon entropy. And traditionally, perplexity has been the exponent of entropy. If you’re interested in the field of knowledge representation, I encourage you to watch Naftali Tishby’s lecture on Deep Learning models.

But in LM terms, there is a slight variance in the definition of perplexity - encourage you to read the Wikipedia article on perplexity.

keratin · April 3, 2018, 12:23pm

@narvind2003 Any resources to get further into embeddings? Why LMs are superior to word2vec/glove? I understand that they only contain one layer and here, in the model, we have a dedicated network for it to explore the latent space and better represent the words but would like to get into further detail.
Also, here, Jeremy used a word based LM and talked about a sub-word, phrase based and sentence based LM. Would a character based LM be something like what we did with Nietzsche’s writings in Lesson 6? What would be the advantages or disadvantages of something like that over here (for classifcation)?

narvind2003 · April 3, 2018, 12:26pm

Since you got me talking about my favorite topic( knowledge representation), I’ll say 1 more thing :

Words don’t always correspond to bits. Some do…like proper nouns.
But with natural language, you have to deal with word disambiguation, and this is a huge problem with verbs, adjectives etc.

When we consciously throw away the ambiguity, we say this word means “a thing” only. Then it’s easy to one-hot encode them and treat them as labels.
With nouns you don’t have to worry much because they are the set of words with least ambiguity.

Jeremy talked about how he’ll bring CV and NLP together in a future lesson. In last year’s course(search for fish in nets) this was done by carefully picking nouns from wordnet(low dim representations) and matching them to corresponding images from imagenet(high dim). Then you can jump back and forth between the representations and do fancy stuff with fast approximate nearest neighbors.

I hope this year we step beyond just wordnet nouns and try dealing with verbs and face the ambiguity/context challenge head on.

Bodhi94 · April 3, 2018, 1:21pm

perplexity is a way of evaluating language models.
What this youtube video for more info…

ppleskov · April 3, 2018, 2:02pm

Any chance to have a look at the last presentation? https://github.com/fastai/fastai/tree/master/courses/dl2/ppt has not been updated for a while

jeremy · April 3, 2018, 2:02pm

Just added link to slides to the top post.

jeremy · April 3, 2018, 2:06pm

Just to have a consistent format. Makes it a little simpler when all the code can assume 1st col is labels and rest are text. You can certainly remove that labels column and adjust the code to handle that special case.

jeremy · April 3, 2018, 2:15pm

I haven’t found them finicky - but larger batch sizes are generally a good idea, to a point.