TabularDataBunch from_df parsing goes wrong

luis_luett · April 20, 2020, 8:45pm

Hi everyone,

i had some pleasant experiences with fast.ai so far, but now I’ve encountered my first mayor issue. I’m using the following piece of code to read a DataFrame into a DataBunch:

def get_valid_idx(data: pd.DataFrame,valid_percentage: float=0.2):
split_idx = int(np.floor(data.shape[0]*(1-valid_percentage)))
return range(split_idx, data.shape[0])
dep_var = 'Transaction Label'
cat_names = ['Buchungstext','Auftraggeber / 
Beguenstigter','Verwendungszweck','Kontonummer', 'BLZ', 'Glaeubiger- 
ID','Mandatsreferenz','Kundenreferenz','Balance']

cont_names = ['Betrag (EUR)']
procs = [FillMissing, Categorify, Normalize]

path = './tmp'
df = ba.get_data()
df_dropped = df.copy()
del df_dropped['Buchungstag']
del df_dropped['Wertstellung']

valid_idx=get_valid_idx(df)

data = TabularDataBunch.from_df(path, df_dropped, dep_var, valid_idx=valid_idx, 
procs=procs, cat_names=cat_names)

However after the parsing, many of the entries are replaced with #na#. This especially happens in ‘Auftraggeber / Beguenstigter’ german for Client / beneficiary’. This columns contains multiple words is it an issue to have columns with more that word or special charaters like ‘+’,’/’,’&’?

Im using the fastai Notebook on paperspace.com btw. If you need information to help me, I happy to provide more.

Thanks to all of you in advance!