Wiki: Lesson 4

rachel · January 2, 2018, 11:39pm

<<< Wiki: Lesson 3 ｜ Wiki: Lesson 5 >>>

Lesson links

Articles

Video timeline

00:00:04 More cool guides & posts made by Fast.ai classmates
“Improving the way we work with learning rate”, “Cyclical Learning Rate technique”,
“Exploring Stochastic Gradient Descent with Restarts (SGDR)”, “Transfer Learning using differential learning rates”, “Getting Computers to see better than Humans”
00:03:04 Where we go from here: Lesson 3 -> 4 -> 5
Structured Data Deep Learning, Natural Language Processing (NLP), Recommendation Systems
00:05:04 Dropout discussion with “Dog_Breeds”,
looking at a sequential model’s layers with ‘learn’, Linear activation, ReLu, LogSoftmax
00:18:04 Question: “What kind of ‘p’ to use for Dropout as default”, overfitting, underfitting, ‘xtra_fc=’
00:23:45 Question: “Why monitor the Loss / LogLoss vs Accuracy”
00:25:04 Looking at Structured and Time Series data with Rossmann Kaggle competition, categorical & continuous variables, ‘.astype(‘category’)’
00:35:50 fastai library ‘proc_df()’, ‘yl = np.log(y)’, missing values, ‘train_ratio’, ‘val_idx’. “How (and why) to create a good validation set” post by Rachel
00:39:45 RMSPE: Root Mean Square Percentage Error,
create ModelData object, ‘md = ColumnarModelData.from_data_frame()’
00:45:30 ‘md.get_learner(emb_szs,…)’, embeddings
00:50:40 Dealing with categorical variables
like ‘day-of-week’ (Rossmann cont.), embedding matrices, ‘cat_sz’, ‘emb_szs’, Pinterest, Instacart
01:07:10 Improving Date fields with ‘add_datepart’, and final results & questions on Rossmann, step-by-step summary of Jeremy’s approach

Pause

01:20:10 More discussion on using Fast.ai library for Structured Data.
01:23:30 Intro to Natural Language Processing (NLP)
notebook ‘lang_model-arxiv.ipynb’
01:31:15 Creating a Language Model with IMDB dataset
notebook ‘lesson4-imdb.ipynb’
01:31:34 Question: “So why don’t you think that doing just directly what you want to do doesn’t work better?” (referring to the pre-training of a language model before predicting whether a review is positive or negative)
01:33:09 Question: “Is this similar to the char-rnn by karpathy?”
01:39:30 Tokenize: splitting a sentence into an array of tokens
01:43:45 Build a vocabulary ‘TEXT.vocab’ with ‘dill/pickle’; ‘next(iter(md.trn_dl))’
The rest of the video covers the ins and outs of the notebook ‘lesson4-imdb’, don’t forget to use ‘J’ and ‘L’ for 10 sec backward/forward on YouTube videos.
02:11:30 Intro to Lesson 5: Collaborative Filtering with Movielens

Notes

Embeddings vs One-Hot Encoding: Embeddings are better than One-Hot Encodings because it allows for relationships to be shown between days. (Saturday and Sunday are both weekends). One-Hot Encoding shows each value perfectly equal to each other. Wednesday and Saturday have the same relationship as Saturday and Sunday. In other words, Embedding gives a neural network a chance to learn “rich representations”.

Overfitting vs. Underfitting, an example

training, validation, accuracy
0.3, 0.2, 0.92 = under fitting
0.2, 0.3, 0.92 = over fitting

maddogS · January 27, 2018, 11:39pm

thanks @grez911 giving this a shot now…

jeremy · January 28, 2018, 12:15am

@anurag any chance that could be added to the crestle template?

jeremy · January 28, 2018, 12:41am

Thanks - fixed now. FYI the z flag to tar is now redundant AFAIK - it figures it out for itself. (Although I’m still glad to have this fixed since it was using unnecessary space!)

anurag · January 28, 2018, 3:07am

Done. spacy.load(‘en’) works as expected.

grez911 · January 28, 2018, 5:30am

Thank you. Could you please also include this IMBd data into /datasets/fast.ai/ in crestle? I don’t know why is it unpacking so long, but this took more than 2 hours.

anurag · January 30, 2018, 1:54am

Now available under /datasets/fast.ai/data/aclImdb.

superexistential · February 1, 2018, 9:21am

I have a question regarding the RMSPE (Root Mean Square Percentage Error) calculation. In the video (https://youtu.be/gbceqO8PpBg?t=39m45s) Jeremy makes the point that ln(a/b) = ln(a)-ln(b). However, I don’t see how this relates to the calculation of RMSPE using exp_rmspe.

RMSPE (https://www.kaggle.com/c/rossmann-store-sales#evaluation) is defined as sqrt(mean(((targ-y_pred)/targ)^2))

We can express this in two lines as:
pct_var=(targ-y_pred)/targ
RMSPE = sqrt(mean(pct_var^2))

Since we took the ln of the data previously, we now need to take the exponent. So, in 3 lines:
targ=exp(targ); y_pred=exp(y_pred)
pct_var=(targ-y_pred)/targ
RMSPE = sqrt(mean(pct_var^2))

It looks like that’s exactly what the function exp_rmspe does:

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

This all makes sense, but I don’t see how any of it relates to ln(a/b) = ln(a)-ln(b).

Help?

minimumnz · February 2, 2018, 11:52pm

What would you do in the situation where you have missing data. For example, imagine the Rossmann data but you only had weather data for the two most recent years. You would like to include all the years for which you don’t have weather data because you have other features.

One idea would be to turn a continuous variables into a categorical variables with bins so that the years you don’t have temperature can just be their own bin and put into the embedding layers?

pekoto · February 3, 2018, 1:59am

Just sharing my notes for this lesson:

Training, validation, test sets, and notes on dropout:

github.com

pekoto/fast.ai/blob/master/Lesson4-notes1 (training-validating-test sets, dropout).ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Training/Validation/Test\n",
    "When creating our model, note that we have 3 sets:\n",
    "\n",
    "* The training set\n",
    "* The validation set\n",
    "* The test set\n",
    "\n",
    "The training set is used with stochastic gradient descent to train the various weights within the network.\n",
    "\n",
    "The test set is used to check how good our model is -- can it correctly identify data it hasn't seen before?\n",
    "\n",
    "But what about the validation set?\n",
    "\n",
    "Fast.ai tells us the validation set is used to determine what kind of model to use, but let's dig deeper.\n",

This file has been truncated. show original

Encoding, structured data predictions, including some extra notes on one hot encoding:

github.com

pekoto/fast.ai/blob/master/Lesson4-notes2 (encoding, predicting structured data).ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encoding/Predicting structured data\n",
    "\n",
    "## Categorical and continuous data\n",
    "When working with structured data, we can think of it as having two types of data:\n",
    "\n",
    "1. Categorical data (data broken down into categories)\n",
    "2. Continuous data (numbers, etc. -- things that can continue indefinitely)\n",
    "\n",
    "Realize that any continuous data could also be used as categorical data (though obviously there are cases where this might be impractical -- if you would end up with thousands of categories, for example). Though not all categorical data could be used as continuous data (if we had two types of store in our data, that couldn't be continuous. We just have A and B).\n",
    "\n",
    "One interesting case is dates. When it comes to dates, it's often better to handle them as categorical data rather than continuous data.\n",
    "\n",
    "## Data types and neural nets\n",
    "With continuous data, neural nets will try to come up with a smooth function to predict each data point.\n",

This file has been truncated. show original

Natural language processing (I kept running into errors running the code here):

github.com

pekoto/fast.ai/blob/master/Lesson4_notes3 (NLP).ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Language Models/Natural Language Processing\n",
    "\n",
    "We can use neural networks to create a __language model__.\n",
    "\n",
    "* __Language model__: A model build that, given some words, can prediction the next word.\n",
    "\n",
    "* __Vocabulary__: The unique words in a language model\n",
    "\n",
    "* __Sentiment classification problem__: Given some text, classify it according to category. For example, is a movie review positive or negative, is an online comment benign or should it be deleted, etc.\n",
    "\n",
    "* __Perplexity__: How we measure the accuracy of language models, just `exp()` of the loss function we use (`exp()` == $e^x$)\n",
    "\n",
    "We will take a sentiment classification problem for IMDB: given a review, we want to classify it as +ve or -ve.\n",
    "\n",

This file has been truncated. show original

jeremy · February 3, 2018, 2:22am

See the ML course here - we show how to handle missing data in some detail. (TL;DR - fastai can do it for you)

jeremy · February 3, 2018, 2:24am

Thanks for sharing your notes!

shubham3121 · February 7, 2018, 11:07am

Wrote my first blogpost on Entity Embedding of categorical variables for structured data, hope you find it useful. Any suggestions are most welcome.

nikhil.ikhar · February 9, 2018, 10:00am

I m not clear on embedding matrices. We start with one rank tensor (one row * n col). Then we create 7*4 matrix for weeks. We pick sun in one rank tensor and replace with 4 col value. Now I m not clear how 4 col will fit in our one rank tensor.

mcintyre1994 · February 10, 2018, 4:36pm

Is there somewhere we can access the Arxiv dataset used for the language modeling notebook? The path in the notebook is /data2/datasets/part1/arxiv/, it’s not in the fastai Paperspace machine or on files.fast.ai/data. Is it available to us anywhere?

mcintyre1994 · February 10, 2018, 5:00pm

If anybody else gets a stack trace on the first cell of lesson-4-imdb.ipynb, you probably don’t have the ‘en’ spacy model installed (I didn’t, using the fast.ai paperspace machine image). You can check by inserting a cell with just

import spacy
spacy.load('en')

If that fails, then (in another terminal) run python -m spacy download en and after a few minutes you’ll have that model and it’ll work.

Similar, I didn’t have a data/aclImdb/models directory to save the TEXT.

NickM · February 13, 2018, 7:33pm

nikhil,
Before this would have been one hot encoded as it is categorical and not continuous. The # of embeddings will also affect the shape of the weights.

smitsheth · February 16, 2018, 1:14pm

Hi @jeremy ,
I tried using embedding on different dataset using keras.

The loss graph is very weird.
What am i doing wrong?

ecdrid · February 16, 2018, 1:29pm

Can you share your whole notebook?

smitsheth · February 16, 2018, 3:38pm

Here is the link of my notebook https://github.com/SmitSheth/Passenger-Survival-Analysis/blob/master/titanic.ipynb