Having a bear of a time with Tabular course, vs docs

Artfldgr · February 28, 2020, 11:46pm

Hi,
consider me the newb of the day…
for the sake of not being told to follow on jupyter etc… I did follow on google colaboratory
and for a bit more background, i have been writing applications professionally since the 1980s when dinosaurs roamed the earth and keven mitnick was a criminal…

I built my own system i use for all manner of development and play (others call it masochism).
its a i7 system, Asus board, water cooled, 32 gigs of ripjaw, and a titan X card…
(currently out of work after 15 years last job, and looking, and this is how i use my time in between searches… ultimately want to do Kaggle, and other things)

I am fine following the courses on colabatory
but now i want to start writing my own nets, and have spent two weeks prepping data with the idea that after i run things several ways in tabular i will do other ideas.
the data is huge… however i am starting with a subset of 90,000 rows…

my jupyter seems to be running right, but i prefer to run native in spyder…
both say my GPU is available… even telling me its a GTX Titan X
and even enjoyed the page code where you can load a 500,500,500 and see the time difference
on my machine the cpu did it in milliseconds, while the gpu did it in microseconds
and when i was young and foolish i thought a 300 baud modem was exciting…

love the courses, they have been the most productive so far towards getting working code up!!!
a stupendous feat if you have spent as much time as i have…

Now i am trying to do my first tabular data run and having problems (i hope someone here would find simple) -
and i am trying to run it native, then tried in jupyter as that is adding more issues than removing them…

one is that the tutorial code from git does does not match the docs.fast.ai/tabular.html code… whee!!! such fun…

1st problem:
path = untar_data(URLs.ADULT_SAMPLE)

instead i am using and the file is in the local directory i am running from
df = pd.read_csv(‘filename.CSV’)

This wouldn’t much be a problem, except that TabularList.from_df throws an error without path
and i tried even putting one in by path = ‘text’

2nd problem:
The tutorial gives example of Cats and Cont names… and a list of procs…
[the video gives >50, the tutorial gives salary when the column is targets, and targets html gets it right using salary with a column names salary]
The targets page only gives cat names…
and my project has no cat names, but only cont names… fun fun fun…
(the idea of all numeric tabular data seems to escaped the mind)

3rd problem:
the tutorial page from git and online creates a test df using iloc
test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cat_names=cat_names, cont_names=cont_names)
the docs page doesnt make a test, and uses valid_idx = range(len(df)-2000, len(df))
it also leaves out cont names…

so far this is a lot of different combinations i am trying way before i got here…
that is, i have been trying to make this work with these varying examples!!!

and it does run without error to this point…
but given the differences, who knows if its working right? not i

4th problem…
tutorial says data.show_batch(rows=10), which so far seems to work for me…
but the docs page says…
(cat_x,cont_x),y = next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))

which throws a broken pipe error 32 which i have not been able to resolve…
do note that cat_x is no where in the page other than there and they use cat_names
so i have no idea what that code is to refer to… does anyone?

here is the error

File “C:\Anaconda3\lib\multiprocessing\reduction.py”, line 60, in dump
ForkingPickler(file, protocol).dump(obj)

BrokenPipeError: [Errno 32] Broken pipe

so now i am quite stuck!

but at least i know my cuda core works…
torch.cuda.is_available() and print(torch.cuda.is_available())
prints True in both spyder and jupyter…

i was so hoping to see my first net running after two and a half weeks of full time effort…

here is all my code… (i shortened the cont_names list for space)

import torch
import pandas as pd
from fastai.tabular import *

path = ‘M:\tabular’

df = pd.read_csv(‘filename.CSV’) #dtype=np.float64

valid_idx = range(len(df)-8000, len(df))
print(valid_idx)

dep_var = ‘Target’
cat_names = []
cont_names = [‘H1’,‘M1’,‘OP1’,‘HP1’, -list shortened-]
procs = [FillMissing, Categorify, Normalize]

test = TabularList.from_df(df.iloc[800:1000].copy(), path=path, cont_names=cont_names)

data = TabularDataBunch.from_df(path, df, dep_var, valid_idx=valid_idx, procs=procs, cont_names=cont_names)

(cat_x,cont_x),y = next(iter(data.train_dl))
for o in (cat_x, cont_x, y): print(to_np(o[:5]))

Artfldgr · February 28, 2020, 11:56pm

forgot to add that the course code ran fine on my jupyter page…
downloaded the data, and i assume ran on my gpu…

I think i fixed the path problem

from pathlib import Path
path = Path(“a/”)

the above seems to work (putting the target file in a path under the running directory)

this let me replace the file input, and was able to start training under jupyter
but how to get this working as a native solution?

not to mention i have a lot more work to do given an accuracy of zero…

Artfldgr · February 29, 2020, 3:41pm

Given the lack of success of the model, i am going to ground and will simplify it
while in the background working to get it working without jupyter…

by going to ground i mean simplify the model data as i think its getting confused with too many similar points
change the way it codes the dates (let procs handle it)
and change the target to be less specific (harder) and more general (easier)

The above seemed to help…
now at least it gives an accuracy of .37
but it doesnt learn…
its going to take a while to regenerate the large pool of data for it to learn from…
one thing that made things easier is that the final percentage (target) was turned into a limited set of values that represented groups of answers with the width of each group being larger and larger the more rare the answer was in the data

after a much larger pool is created (overnight as it takes many hours to process the volume of data), i will try balancing the target output

and yet… the native version still gives pipes error… hmmm…

onwards!!!

Artfldgr · March 1, 2020, 12:12am

the more i improve the dataset the better it seems to get…
but i also have to understand what i am seeing…
[480,000 rows processed]
anyone?