Lesson 4 Advanced Discussion ✅

jeremy · November 23, 2018, 12:17pm

If you make it a separate field in your CSV, then it’ll automatically be tagged for you when you import it.

cwerner · November 23, 2018, 12:32pm

Thanks, but I don’t quite follow. I was thinking along the lines that the title content should probably hold more weight than the body of the abstract as it’s the abstract of the abstract if you will…

If my goal later is to do topic modeling, would it help to mark the title content (that I currently have in a separate column and merge with the body text col for a joined col) as more important for the model?

If I do, would that be at the language model learning step, or rather at the classification trainman step…

Sorry if I ask the wrong questions - pretty new to NLP…

jeremy · November 23, 2018, 11:46pm

Put your title and doc in different fields in a CSV. Then it’ll be tagged by the dataset automatically. If the model finds that tag is more important, it’ll use it automatically.

cwerner · November 23, 2018, 11:59pm

Oh wow. That’s cool. Have to read up into the docs a bit more it seems

gerardo · November 25, 2018, 3:15am

github.com

fastai/fastai/blob/master/examples/tabular.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tabular example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.tabular import *  # Quick accesss to tabular functionality"
   ]
  },
  {
   "cell_type": "markdown",

This file has been truncated. show original

Looks like there’s something wrong with the example

On the section with the prediction of the tabular data I’m getting an error.

martijnd · November 26, 2018, 10:09am

You can set the batch size as follows (it’s in the updated course-v3)

bs =24
data_lm = TextLMDataBunch.load(path, 'tmp_lm', bs=bs)

You can specify the settings of max_vocab and the language as follows.

txt_proc = [
    TokenizeProcessor(tokenizer=Tokenizer(lang='nl') ),
    NumericalizeProcessor(min_freq=1, max_vocab=10000 )
]

data_lm = (TextList.from_df(df, cols='text', processor=txt_proc)           
            .random_split_by_pct(0.1)
            .label_for_lm()           
            .databunch())

martijnd · November 26, 2018, 10:17am

When we want to train the language model for our own dataset on the wiki103 LM. Why don’t we have to align the vocab of our new dataset (IMDB) with the Wiki103? For example like this.
data_lm = (TextList.from_folder(path, **vocab=data_lm.vocab**)

Like we do when we want to use the Classifier.
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)

fredguth · November 26, 2018, 10:34am

Thanks! This is great.

sparalic · November 26, 2018, 3:45pm

@mb4310 I’m curious. Do you ever run these as an ensemble?

Andreas_Daiminger · December 10, 2018, 7:24pm

Hi,
Is there a way to use the collaborative filtering approach with additional metadata?
Let’s say I have the exact same dataset as discussed in the class. But additionally I have metadata for both movies (e.g genre, release date, … ) and users (eg. age, occupation …). How can I create I NN that uses a collaborative filtering approach but also takes in account metadata???

seb0 · December 12, 2018, 11:13am

of course this works. You probably have to write your own Items and ItemLists as defined in tutorial 3 (see docs) and them maybe define forward / collate functions. Should be pretty straight forward to extend

Andreas_Daiminger · December 12, 2018, 2:25pm

I am having trouble understanding how this would work though. How would the Matrix Factorization process look like?

seb0 · December 12, 2018, 2:35pm

I assume you have meta data that is only based on the interactions? Like lets say the distance between the main site of the movie’s plot and the home address of the user? Then you would need to hook into the model after the factorization has taken place and concatenate your additional dimensions I would assume. But I haven’t done this before, so maybe someone more experienced could help here But since you can basically add layers wherever you want at the pytorch level (by overriding some fastai things) I assume this shouldn’t be too hard. If your metadata is only related to one of the items, then you can just append it before factorization and run everything like you normally would.

Andreas_Daiminger · December 12, 2018, 2:52pm

I do not have metadata for the interaction. Sorry if I use the term metadata incorrectly here.
What I mean is information like age, occupation, nationality for the user and information like genre, budget, oscars won for movies.
The idea of concatenating this features with the embedding vector sounds interesting. But I cannot wrap my head around how exactly this could be done. Thanks for the response!

avinregmi · December 12, 2018, 7:26pm

Hey guys, how do I use ULMFit for text similarity?

KarlH · December 14, 2018, 9:20pm

Is there a straightforward way to create a TextDataBunch for a paired text corpus? Where the model inputs are two different text strings.

seb0 · December 18, 2018, 11:46am

yes, see 3rd tutorial on custom items and itemlists

seb0 · December 18, 2018, 12:01pm

Okay, so what you need to do (imho) is have a look at the code of collab data bunch and see how items are encoded. If each item is only encoded as a number, you need to create custom items and item lists (see 3rd tutorial) that accept vectors. These vectors describe your item in vector space. (If I understood things correctly as a by-product you sort of get latent factors that you could use to compute item similarity if you’re at all interested in that). So then it wouldn’t be: User 1 likes item number 4, but rather User(x1, x2, x3) likes item(z1, z2, z3) where x and z are features of users / items. Is this clear at all?

Andreas_Daiminger · December 18, 2018, 2:13pm

Thanks for following along my thought process and pointing out some interesting ideas.
But I can not see how something like this could work in a neural net with only one layer. All we do is multiply 2 vectors of latent factors (embeddings) to get a single number (the rating) as a result. The values of those 2 vectors is what the NN learns.

So is your idea to concatenate the embeddings with another vector containing weights for the metadata features?

seb0 · December 18, 2018, 2:57pm

Yes, exactly.