Lesson 7 - Nietsche train / val?


(Ralph Brooks) #1

Hi everyone,

I had a quick question. I am looking at the video for Part 1 of the fast.ai course. In short, I am trying to understand the setup for the trn and the val directories associated with training the Nietzsche data set with the “Stateful” model of https://github.com/fastai/fastai/blob/master/courses/dl1/lesson6-rnn.ipynb (which actually seems to be covered in “Lesson 7”).

The code that I see is

from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH='data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

and I see:

models/ nietzsche.txt trn/ val

Unfortunately, when I go into the train directory, I don’t see anything.

%ls {PATH}trn

[ No result ]

Just curious what everyone else is doing. Is everyone just splitting the nietzsche data set by hand into two parts and placing that in trn and val?


(Malcolm McLean) #2

Hi Ralph,

In case you have not already figured this out, splitting Nietzsche.txt manually is mentioned in the Lesson 7 video. See the video timeline for the exact spot. HTH.


(Ralph Brooks) #3

Pomo,

Thanks for the advice. I saw the same about a couple of hours after I made the post. I was unable to delete the post or close out the issue.

Thanks again for the follow up.

Ralph


(Vlad) #4

I was confused too. I normally don’t have time to complete the class in one day and coming to the video on a different day - I forgot what has happened before.
In fact, Jeremy explained that he had created train and validation part by hand.
But since I was too lazy to copy-paste text and make folders manually, I wrote few lines of code to prepare the data:

os.makedirs(TRN, exist_ok=True)
os.makedirs(VAL, exist_ok=True)

train_perc = .8
with open(f'{PATH}/nietzsche.txt', 'r') as fp:
    lines = fp.readlines()
    text_len = len(lines)
    part_train = open(f'{TRN}nietzsche1.txt', 'w')
    part_val = open(f'{VAL}nietzsche2.txt', 'w')    
    for ix,l in enumerate(lines):

        if ix/text_len<train_perc:
            part_train.write(l)
        else:
            part_val.write(l)
    part_train.close()
    part_val.close()    

You need to run those once TRN and VAL are declared, so after:

from torchtext import vocab, data

from fastai.nlp import *
from fastai.lm_rnn import *

PATH = 'data/nietzsche/'

TRN_PATH = 'trn/'
VAL_PATH = 'val/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'
%ls {PATH}