['AfterStateHoliday']: Input contains NaN, infinity or a value too large for dtype('float32')

poppingtonic · December 3, 2017, 8:41pm

While reviewing lesson 4 and 5, I’m testing out the lesson3-rossman notebook, with no modifications, and hit this error when running the following code:

df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True)
yl = np.log(y)

Screenshot:

Is there a change to the fastai/structured.py library that makes this not work any more? I just ran git pull

poppingtonic · December 3, 2017, 9:56pm

Somewhere, during the processing in the notebook, the values in the dataframe df get replaced with the same very large negative integer. They should be either zero or one. Cells 40-44 have some unexpected behaviour.

beacrett · December 3, 2017, 10:03pm

I added the following

for o in ['Before', 'After']:
     for p in columns:
        a = o+p
        df.loc[df[a]<0, a] = 0

Its seems to have set unknowns to min val instead of nan. Since this field should have no negative values, figured we could just set the min to 0.

poppingtonic · December 3, 2017, 10:20pm

Thanks! I’ll test this out.

pekoto · January 29, 2018, 10:46pm

I also saw this error.

Before the failing line (proc_df), I inserted a cell and ran the following code to fix it:

joined_samp['AfterStateHoliday']=joined_samp['AfterStateHoliday'].fillna(0)
joined_samp['BeforeStateHoliday']=joined_samp['BeforeStateHoliday'].fillna(0)

Dooley · March 3, 2018, 6:05pm

The issue with this piece of code is that the “Before” values are negative. I have built upon your idea with the following:

for o in ['After']:
for p in columns:
    a = o+p
    df_train.loc[df_train[a]<0, a] = 0
    df_test.loc[df_test[a]<0, a] = 0
    
for o in ['Before']:
    for p in columns:
        a = o+p
        df_train.loc[df_train[a]<-500, a] = -df_train['After' + p].max()
        df_test.loc[df_test[a]<-500, a] = -df_train['After' + p].max()

Which seems to work well. It basically remove all negative values from after fields and clips the negative values to the maximum distance to a future event. I am surprised not more people are commenting about issues with this. The notebook in its current state does not work for me

wespiser · March 5, 2018, 1:51am

Hi Dooley,
Thank you for this response.I converted the large negative values using df.replace(-9223372036854775808,0).
Did you see a drop on performance when you submitted to kaggle? In my approach, I did worse than taking the median for day of week.

Looking futher into the issue, I think the code is more broken…

cat_sz = [(c, len(joined_samp[c].cat.categories)+1) for c in cat_vars]

 [('Store', 1116),
 ('DayOfWeek', 8),
 ('Year', 4),
 ('Month', 13),
 ('Day', 32),
 ('StateHoliday', 3),
 ('CompetitionMonthsOpen', 26),
 ('Promo2Weeks', 27),
 ('StoreType', 5),
 ('Assortment', 4),
 ('PromoInterval', 4),
 ('CompetitionOpenSinceYear', 24),
 ('Promo2SinceYear', 9),
 ('State', 13),
 ('Week', 53),
 ('Events', 22),
 ('Promo_fw', 1),
 ('Promo_bw', 1),
 ('StateHoliday_fw', 1),
 ('StateHoliday_bw', 1),
 ('SchoolHoliday_fw', 1),
 ('SchoolHoliday_bw', 1)]

Thus, I think get_elapsed is just broken, particularly the line:

res.append(((d-last_date).astype('timedelta64[D]') / day1).astype(int))

Thanks!

nminhptnk · March 5, 2018, 5:28am

Same here. For DL solution I got my validation set suggests 0.098765, although public score 0.11238 and private score 0.12887.
If I use random forest with the same features, validation score 0.11557, public score 0.12166 and private score 0.13661.
So DL solution, even though over-fitting (we can observe from training result) is still better. Any comments on how to achieve a better DL result will help me, thanks!

wespiser · March 6, 2018, 3:16am

Sorry, I wasn’t clear, my score was over 0.24, which is the benchmark if you predicted using the average day per week. Also, ‘overfitting’ usually refers to the pathological case where your test set error starts to diverge from your training set error. I think Jeremy explained it wrong, here’s a better description: https://elitedatascience.com/overfitting-in-machine-learning

nok · March 8, 2018, 3:06pm

I think the huge number come from elapsed function which cast nan to integer, nan is only for float thus python assign the largest negative number for it probably.

In addition, make sure you watch the lecture and don’t just run the notebook directly, you have to run some of the cells 2 times.
df = train[columns]
df = test[columns]

See Jeremy’s ML course.

adrian · March 11, 2018, 9:59am

Looking at the get_elapsed function, where s ! last_store and last_date has not been set yet, last_date is set to numpy naT, then when doing any operations on this value ie calculating np_delta it blows out to a huge number.

In the code below I check all dates first then use if available, if not I set delta_day to a invalid number. I haven’t completed the full notebook/submitted yet but this may help:

def get_elapsed(fld, pre, df):
    '''adds a column pre+fld to the dataframe
    where fld is for example SchooHoliday'''
    day1 = np.timedelta64(1, 'D')
    last_date = np.datetime64()

    last_store = 0
    res = []
    special_dates = {}
    #forward pass to get starting dates
    for s, v, d in zip(df.Store.values, df[fld].values, df.Date.values):
        if v:
            last_date = d
            special_dates[s] = d

    for s, v, d in zip(df.Store.values, df[fld].values, df.Date.values):
        #d is type datetime64 eg 2015-07-31T00:00:00.00000000
        if s != last_store:
            if s in special_dates:
                last_date = special_dates[s]
            else:
                last_date = np.datetime64()
            last_store = s
        if v: last_date = d
        if  np.isnat(last_date):
            #np_delta means nothing, as we don't have a date to reference against
            delta_day = -9999
        else:
            np_delta = (d-last_date).astype('timedelta64[D]')
            #by dividing by day1 we should get values between -n*10^2 and n*10^2
            delta_day = ( np_delta/ day1).astype(int)
        res.append(delta_day)
    invalids = res.count(-9999)
    valids = len(res) - invalids
    print(f'invalid date deltas {invalids}, valid date deltas {valids}')
    df[pre+fld] = res
    return df

The probelm with bad data occurrs only with StateHoliday.

For state holiday you can add the following just after the Duruations line before calling get_elapsed :

## Durations

logger.debug('states and holidays:')
states_df = get_store_states(joined)
logger.debug(states_df)
state_holidays = get_state_holiday_dates(joined)
logger.debug(state_holidays)

def get_state_holiday_dates(train):
    state_columns = ["Date", "StateHoliday", "State"]
    state_df = train[state_columns]
    state_df = state_df.sort_values(['State', 'Date'])
    #drop non state holidays
    state_df.drop(state_df[state_df['StateHoliday'] == False].index, inplace=True)
    state_df.drop_duplicates(inplace=True)
    return state_df

def get_store_states(train):
    store_states = ["Store", "State"]
    store_states_df = train[store_states]
    store_states_df.drop_duplicates(inplace=True)
    return store_states_df

Then generate a lookup for each store instead of the

#forward pass to get starting dates

code block in get_elapsed (which I havent done yet as my head hurts and im done for the day)

So then you would have all the State Holidays for each store and you could just look them up.

wespiser · March 11, 2018, 5:32pm

Hi folks,
It looks like the notebook has been updated as well to address the get_elapsed function.

jacobsimon · March 12, 2018, 1:11am

I’m still seeing this error after pulling the latest update, but my data is showing that all these columns contain NaN rather than negative numbers. It seems to occur in the merge step of join_df.

Any idea how to resolve?

NitinP · March 12, 2018, 2:06pm

We are using two types of columns.

AfterStateHoliday is a continuous variable. To treat nan values of continuous column we can use fillna

for v in contin_vars:
    joined[v] = joined[v].fillna(0).astype('float32')
    joined_test[v] = joined_test[v].fillna(0).astype('float32')

For categorical variable type columns, set their dtype as category

for v in cat_vars: joined[v].astype('category').cat.as_ordered()

Then apply the same categories on test set

apply_cats(joined_test, joined)

Then calling proc_df should fix_missing nan values of categorical columns

df, y, nas, mapper = proc_df(joined, 'Sales', do_scale=True)

RogerS49 · March 13, 2018, 9:29am

I have applied this fix to get_elapsed problem and it can be seen to work as it tells how many valid/invalid entries in the field there are. As to whether the values in the fields are correct is not known On inspection the After/BeforeStateHoliday seem to take the same values as well as their positivity where you would expect all values to be positive and the before and after to be different in every case.

Working further through the notebook we seem to have dropped the Afters/Before when this is highlighted at the join after the definition of the cat and contain vars where After/BeforeStateHoliday is missing.

Unfortunately this fix does not seem to work for me. More later

superives · March 13, 2018, 10:39am

This is very helpful! Thank you!

I have been debugging this for a while…

wespiser · March 13, 2018, 10:52am

Yes, you need to re-write the get_elapsed function, so it works! I know that’s not much help, but maybe if I have time this weekend that’s what I’ll do!

Spoetnik1 · March 13, 2018, 12:23pm

Hi Adam,
If you could find the time I would really appreciate it. I am in way over my head with the Rossmann data pre-processing but would like to go through the script.

RogerS49 · March 13, 2018, 6:22pm

The change I did to fix this

1 Did not execute the cell where StateHoliday is set to boolean the values here are [‘0’,‘a’,‘b’,‘c’]

2 Change the latest version of get_elapsed as follows as returned value is a type other than float32. These changes are just an attempt to force the issue and may not be the real problem but it worked and enabled me to submit an entry to the competition. Not the best one submitted

    def get_elapsed(fld, pre):
        day1 = np.timedelta64(1, 'D')
        last_date = np.datetime64()
        last_store = 0
        res = []
    
        for s,v,d in zip(df.Store.values,df[fld].values, df.Date.values):
            if s != last_store:
                last_date = np.datetime64()
                last_store = s
            if v or v in ['a', 'b', 'c']: last_date = d  # this is because In did not do No 1 above
            
            timd = np.timedelta64(d-last_date,'D')
            timd = timd/day1
            timd = timd.astype(np.int32) # I think this part makes the type saved
            res.append(timd)
        df[pre+fld] = res

You can use the break statement and a counter to break at 10 passes through the for loop and print out the type of timd which was int32 however the type dataframe printed out as int64 which I guess is acceptable

3 Where proc_df is run on joined_test create a cell before and enter
joined_test[‘AfterStateHoliday’]=joined_test['AfterStateHoliday].fillna(0)
joined_test[‘BeforeStateHoliday’]=joined_test['BeforeStateHoliday].fillna(0)

4 There were some other issues with variable column it did not seem to like Date and Store so removed them to run proc_df.

Not sure why I had to do these changes or what the theory is in getting this modelled, guess have watch lesson 4 video.
Hope this helps someone it’s taken the best part of dawn to dusk to get an upload to kaggle Now off to watch European Champions League on the box

CAUTION failed in the Test section on the git pulled version after mods above
Failed as running_mean should contain 20 element not 18.
Something to due with BatchNorm

I worked on a previous saved version of the rossmann nb and then applied the code above to the latest git pulled release. The saved version executed with no problem giving about position 2000 in the competition, but too late to find new bug in git pulled version today.

Oh my score was 0.14236 public score and 0.14119 private score. No 1spot is 0.08932 and 0.10021. This score was with the sample.

REMOVE CAUTION I ran the git pull version with get_elapsed modified as above today and it worked fine. I was able to submit another attempt this time running on the full training set but got a worse score of 0.16557.
Your welcome

nok · March 15, 2018, 3:14am

Make sure u did not run the notebook top to bottom directly. Please see my reply in earlier thread.

You need to do preprocessing for df=train first then do df=test.
If you are getting all na in the column, it is because you are joining train set with test set. Check the date column to confirm this.