Lesson 6: `one_batch()` returns 3 tensors on custom data set

nlearner · February 13, 2024, 4:55am

I’m trying to apply Rossman to my own problem where I’m trying to predict the total profit of a business on any given day.

I started with a very simple dataframe but whenever I check the first batch using dls.one_batch() I get 3 tensors instead of 2. The x tensor is my dataframe, the y tensor is empty and the z tensor is the total profit on the given day.

My initial dataframe has just 2 columns: “date” and “total” (total profit for the day).

I call add_datepart and then include all the newly added columns inside the list of categorical variables. I end up with no continuous variables since the only continuous variable is total which is the independent variable. When I call one_batch() I get the following:

# This looks right
x tensor([[ 4,  4,  2,  1, 25, 18,  1,  1,  1,  1,  1,  1],
        [ 4,  7,  1,  2,  7, 27,  1,  1,  1,  1,  1,  1],
        [ 4,  8,  1,  2, 14, 32,  1,  1,  1,  1,  1,  1],
        [ 5,  9,  1,  2, 22, 38,  1,  1,  1,  1,  1,  1],
        [ 3,  7,  1,  2,  6, 26,  1,  1,  1,  1,  1,  1],
....

# I can't figure out why this is happening
y tensor([], size=(20, 0))

# This should be the y tensor
z tensor([[22437.4707],
        [22872.96],
        [51414.55],
        [ 6833.65],
        [898375],
        [343409.35],

As far as I know I should only get 2 tensors, x corresponds to the input and y to the output. I can’t figure out why I’m getting this and I think it causes me to have weird training results (my predictions are negative and loss doesn’t decrease, in fact it almost overflows).

My code looks like this…

...
add_datepart(df, "date")
df.drop([ "Elapsed"], axis=1, inplace=True) # drop elapsed to simplify things, we already have a date column
cat_names = [
    "Dayofweek",
    "Week",
    "Year",
    "Month",
    "Day",
    "Dayofyear",
    "Is_month_end",
    "Is_month_start",
    "Is_quarter_end",
    "Is_quarter_start",
    "Is_year_end",
    "Is_year_start",
]
cont_names = [
    col for col in df.columns if col not in [ "date", "total"] + cat_names
]  # cont_names is effectively empty

#At this point cat_names is 12 columns, DF has 13 columns including "total"

valid = [x for x in range(len(train), len(df))] # train is a subset of the original df


dls = TabularDataLoaders.from_df(
    df,
    path=".",
    procs=[Categorify, FillMissing],
    cat_names=cat_names,
    cont_names=cont_names,
    y_names="total",
    bs=20,
    valid_idx=valid,
)

dls = TabularDataLoaders.from_df(
    df,
    path=".",
    procs=[Categorify, FillMissing],
    cat_names=cat_names,
    cont_names=cont_names,
    y_names="total",
    bs=20,
    valid_idx=valid,
)

x, y, z = dls.one_batch()
print("x", x)
print("y", y)
print("z", z)

I’d appreciate any help, thanks!

vbakshi · February 13, 2024, 5:50am

Can you show the code you used to create your DataLoaders?

Following the approach in chapter 9 (I was just reviewing this chapter so that’s why I’m referencing it) I would assume it’s something like the following:

procs = [Categorify, FillMissing]

cond = # some condition which splits early dates from later dates
train_idx = np.where( cond)[0] 
valid_idx = np.where(~cond)[0] 
splits = (list(train_idx),list(valid_idx))
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)

to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
dls = to.dataloaders(1024)

(or using TabularDataLoaders.from_df)

Also, have you tried creating a DataLoaders object without applying add_datepart? And if so were you able to get a sensible result with x and y instead of x, y, and z when calling dls.show_batch()?

Lastly, I’m assuming you can’t share the notebook you’re working with, but can you give a few rows of example fake data that populates your DataFrame?