Language models for multiple languages

(Arvind Nagaraj) #41


We’ll be doing something known as “fish in nets” soon, in upcoming lessons. That will blow your mind for sure.

(Kaitlin Duck Sherwood) #42

If this is tranch+net, I saw it in part2v1 a few months ago and it did in fact completely blow my mind. :exploding_head: :exploding_head: :exploding_head:

(Kaitlin Duck Sherwood) #43

(I made an edit to the top, repeating it here in case people have already read past the top post)

This post has uh kind of diverged (mostly through my fault) into a discussion of the philosophy of NLP and word embeddings. Fortunately, @lesscomfortable made a thread

Language Model Zoo :gorilla:,

which has a place to sign up to make models for different languages. You should probably go there and let this thread die a peaceful death.

(Arvind Nagaraj) #44

Before we kill -9 this thread, I want to share the first paper I know of, that introduced word embeddings.
There’s an interview with Geoff Hinton where he explains that the first word embeddings demo was what helped him get the original backprop paper published! Watch from 4:32 to 6:00 -

(Kaitlin Duck Sherwood) #45

Interesting: in the interview, he talks about how there was a tension between psychologists who looked at concepts as a bag of features and the AI people who looked at concepts as being nodes in a network. If you squint, that kind of seems analogous to the argument errr civil discussion we had today here, where Jeremy was talking about (paraphrasing heavily) the important thing being the network, the full model, and me talking about how we shouldn’t count out the bag of features (the word embedding). :slight_smile:

(And whoa, 1986!)

(Jeremy Howard) #46

Sorry I’m not willing to squint that much. :stuck_out_tongue: I’m all about distributed representations - I just distribute them over multiple layers!

(Kaitlin Duck Sherwood) #47

What I got out of reading CoVe and ELMo papers: they’re like word embeddings++. Instead of using one layer’s outputs as your vector for the input word, you concatenate all the layers’ outputs together and use THAT as your input word’s vector. Then use that kind of like you would use an “ordinary” embedding vector.

This seems kind of breathtaking to me: seems like your dimensionality would get really big pretty fast.

(Jeremy Howard) #48

That’s pretty much right. And the reason it’s obsolete is that fine-tuning the pre-trained model end-to-end (as we did in the last lesson) is better than treating it as fixed.

(Jeremy Howard) #49

Well… I say “pretty much right”. But note that it not an input word’s vector. It’s a document’s vector, generally speaking.

(Kaitlin Duck Sherwood) #50

Oops, my bad! Yes, that was emphasized as being important: that you don’t even have the option of looking at a word by itself, you have to look at the word in the context of a document.

It got me thinking at first, “well, but that would mean that you couldn’t make a dictionary!” and then got further to thinking, “well, but words don’t mean anything in isolation; dictionaries are a crude way of dealing with the fact that it’s impractical with paper to give the context of the word you are curious about.”

(Bharadwaj Srigiriraju) #51

@jeremy I am a native speaker of Telugu, and I was wondering how useful training the language model is for languages like Telugu, Malayalam etc., which are classified as Agglutinative. Other languages that are agglutinative include Japanese, Korean, Turkic languages, and even Klingon :smiley: Apologies if these questions were asked before…

In a discussion about tokenization on Stack Overflow, it was mentioned that regular tokenization isn’t sufficient and a “full-blown morphological analysis” is needed to split sentences to respective tokens in agglutinative languages, as they don’t have proper word boundaries (like in languages such as English)… I couldn’t find a decent morphological analyzer for Telugu though… heck, not even spacy tokenizer (yet)… so I have to go ahead with regular english-like tokenization and cleaning for now.

  • Is training regular language model as described in lesson (using english-like tokenizer) the best way to model such language? How useful is a resulting language model when the tokens themselves might not be “correct”? Has anyone tried this and got good results? How good are embeddings when good percent of words themselves might be “duplicates” of each other, I bet we lose some relationships this way :thinking:

  • Can existing fasttext word embeddings be used in some way to make the model better? Maybe just to identify tokens and to initialize the weights?

  • Experiment: Training language model the traditional fastai way using wiki dumps and regular english-like tokenizer, and comparing word vector similarity in the resulting embeddings to create a new morphological analyzer for the language. I have a strong feeling this should give a decent analyzer, but I am not sure how to measure the token similarity and put the “similarity threshold” based on embeddings alone…

Any suggestions/ readings/ comments that would give me more perspective on this problem are welcome!

(Jeremy Howard) #52

It works great, but first you need to use sentencepiece to segment the corpus. This is an area that I’ve briefly experimented with but haven’t written up yet, and results looked really great.

If you or anyone else takes a serious look at this please let me know - I’d be happy to cooperate on a paper assuming you get good results. I know that @shoof is looking at Chinese, which has the same issue.

(Although I can’t speak for Klingon from experience…)

(Bharadwaj Srigiriraju) #53

Sentence piece looks promising, will check it out and keep you guys posted. Hoping to see some interesting results!

(Igor Kasianenko) #55

This might be useful: Plenty of parallel data for translation:

Does anyone deal with Hebrew or other languages, where prepositions are writen in one word with the word that they reffer to?