Lesson 11 wiki thread

This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

<<< Wiki: Lesson 10Wiki: Lesson 12 >>>

Lesson resources

Reflections and Observations

Learning from Kaggle Competitions after the competition closes

Some of the best learning opportunities with Kaggle competitions come after the competition closes.

Firstly, you only know know what worked in your own data processing and analysis pipeline after the fact. You can iron out inefficiencies that you experienced.

Secondly, you can learn what the winners tried. Kaggle competition winners write blog posts or host meetups. You can compare your pipeline to the winner’s pipeline. They often release the code to github and write blog posts. Looking at the winner’s blog posts, kernels, or github repositories reveal their thinking processes as well as solid code examples that have evidence of being successful.

Looking at Kaggle winners’ blogs and presentations, allow you to reflect on if your code was buggy in a critical place. Also perhaps you had a specific strategy that you wanted to apply but didn’t have time to follow through, if the winning team tried that strategy, your idea was validated in a way. Next time you can try it out.

Lastly, you’re able review different perspectives on the same problem. By the end of the competition, the experience is almost like working with a team of data scientists. You all have a shared understanding of the problem space, are working with the same vocabulary and same libraries. However, not all approaches perform as well as others. Often winning solutions approach the problem with an approach that you may not have imagined. So you’re expanding your own problem solving abilities by learning about the winners’ approach.

By working on a Kaggle problem, you’ve joined a community with similar interests and experience, and you can probably reach out to the winners to learn more, and most are more than happy to share their experience and expertise with you.

Review of Logistic Regression model-building

Introduction to Naive Bayes

Introduction to Natural Language Processing (NLP)

Working with Word Vectors

NLP with PyTorch and fastai library

Introduction to Word Embeddings

Feature Engineering with an Embedding Matrix

As long you have enough data, keeping data as categorical where possible is a good idea.

Hi @jeremy, I have a quick question about the NLP text analysis. Going off of the example that was shown in class Thursday/Friday, what would happen if a sentence had a duplicate word. Such as “The movie is good, really good.” Would the probability of “good” change? Or is this solely based on unique words? Thank you!

1 Like

Duplicate words results in a 2 in the doc-term-vectors or a simple 1 in the binarized version. For the embeddings, i believe it was binarized as well. As a compression, the doc is just a list of words encoded as numbers, one number per unique word I think.

1 Like

Thank you for the clarification @parrt. Just to make sure I understand correctly, if there is a duplicate word, it would affect the naive bayes approach, but not the binarized (or embedded) version since a duplicate word would just be a 1 in that case? And the compressed matrix just has the index of which words appear, not accounting for duplicate words?

Check out chapter 13 on Naive Bayes text classification in this book https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf. The document vectors are often word-occurrence counts but I’ve seen these vectors contain TFIDF scores or binarized “this word is present” etc… all with naive bayes. For the embeddings, I believe it’s always a binary doc vector bag o words representation. The presence not count of words are what matters mostly.

1 Like

Best is to look at the code in the notebook - when we used the sign() method we were passing in binarized, otherwise we were passing in counts. Naive Bayes can use either (we tried both in the notebook and showed the comparison). We only used binarized features with the embeddings.

1 Like

Maybe it skipped my attention, but why do we not normalise our dataset, which in this case would be the vocabulary? If I remember correctly, neural nets, linear and logistic regression need mean and variance normalization. Binarized Naive Bayes does look a bit like normalization to me, but it’s not the same, is it?

1 Like