Sure you can, “this movie is shit” and “this movie is the shit” mean very different things
i did remove stop words with imdb before and it does help
bag of words is a method to throw away order and keep bundled text with more frequent combined occurrences, whereas, vocab is a unique set of all possible words we want to keep in for modeling purpose.
It sounds like that break doesn’t happen with review boundaries in mind? Does that matter? And in particular if all the reviews tend to be short does it matter if multiple reviews get appended together?
Does someone know what last papers is talking Jeremy about? Papers that are a breaktrough in NLP?
Jeremy came up with this idea he is discussing. He then found some papers that talk about it.
< lecture thoughts – BPTT I think >
I wonder how you could apply this to vision – say a self-driving car or a plane; context is very important… (maybe esp. if you could learn a weighting since some states of a car or plane have a large effect on what’s possible later on) … is there a way to encode images / state the way Jeremy just showed with words?
– around 1:50:00 in the lecture, when Jeremy showed the array of text after someone asked a question.
Hmm… how to do backprop-through-time for images… maybe a multi-input with perceived state --> fed into a ‘decision-maker’ NN…
Are the sentences just concatenated together? And what about something like movie genre - would that have an impact on the type of language / reviewer etc. Is it just the biggest number that wins?
Batching and bptt explained well on this link https://github.com/pytorch/examples/blob/master/word_language_model/main.py
What’s the size of the embedding for this language model?
— edit —
answer = 200
Yes, the sentences are concatenated. A column represents the same.
Movie genre may impact the sentiment given the bias(not ML wise) of the reviewer with associated movie genres.
Its not simply the biggest number winning. You can think of it as complex combination of sentiment index associated with each word. Jeremy will talk about it I guess.
@yinterian Why aren’t we using pre-trained embedding like word2vec or Glove?
How can we use CNNs for NLP tasks?
What kind of hardware do you need to do this in a timely fashion?
Some details for Machine Translation in here: https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
Would you need to use this approach to solving other NLP problems like topic modelling?
Coz word2vec or Glove aren’t IMDB content specific and pretty huge in the embeddings dimension. By training our own IMDB embeddings we are creating is small and meaningful embedding for our language model.
You can still use word2vec or Glove on our current dataset
Topic modeling often is not modeled as a classification problem. For a NLP classification problem this should work.