I’m trying to apply Rossman to my own problem where I’m trying to predict the total profit of a business on any given day.
I started with a very simple dataframe but whenever I check the first batch using dls.one_batch()
I get 3 tensors instead of 2. The x tensor is my dataframe, the y tensor is empty and the z tensor is the total profit on the given day.
My initial dataframe has just 2 columns: “date” and “total” (total profit for the day).
I call add_datepart
and then include all the newly added columns inside the list of categorical variables. I end up with no continuous variables since the only continuous variable is total
which is the independent variable. When I call one_batch()
I get the following:
# This looks right
x tensor([[ 4, 4, 2, 1, 25, 18, 1, 1, 1, 1, 1, 1],
[ 4, 7, 1, 2, 7, 27, 1, 1, 1, 1, 1, 1],
[ 4, 8, 1, 2, 14, 32, 1, 1, 1, 1, 1, 1],
[ 5, 9, 1, 2, 22, 38, 1, 1, 1, 1, 1, 1],
[ 3, 7, 1, 2, 6, 26, 1, 1, 1, 1, 1, 1],
....
# I can't figure out why this is happening
y tensor([], size=(20, 0))
# This should be the y tensor
z tensor([[22437.4707],
[22872.96],
[51414.55],
[ 6833.65],
[898375],
[343409.35],
As far as I know I should only get 2 tensors, x corresponds to the input and y to the output. I can’t figure out why I’m getting this and I think it causes me to have weird training results (my predictions are negative and loss doesn’t decrease, in fact it almost overflows).
My code looks like this…
...
add_datepart(df, "date")
df.drop([ "Elapsed"], axis=1, inplace=True) # drop elapsed to simplify things, we already have a date column
cat_names = [
"Dayofweek",
"Week",
"Year",
"Month",
"Day",
"Dayofyear",
"Is_month_end",
"Is_month_start",
"Is_quarter_end",
"Is_quarter_start",
"Is_year_end",
"Is_year_start",
]
cont_names = [
col for col in df.columns if col not in [ "date", "total"] + cat_names
] # cont_names is effectively empty
#At this point cat_names is 12 columns, DF has 13 columns including "total"
valid = [x for x in range(len(train), len(df))] # train is a subset of the original df
dls = TabularDataLoaders.from_df(
df,
path=".",
procs=[Categorify, FillMissing],
cat_names=cat_names,
cont_names=cont_names,
y_names="total",
bs=20,
valid_idx=valid,
)
dls = TabularDataLoaders.from_df(
df,
path=".",
procs=[Categorify, FillMissing],
cat_names=cat_names,
cont_names=cont_names,
y_names="total",
bs=20,
valid_idx=valid,
)
x, y, z = dls.one_batch()
print("x", x)
print("y", y)
print("z", z)
I’d appreciate any help, thanks!