Tabular Learner Normalizing continues variables based on the costume Mean and Std

I am working with large tabular datasets which consist of 12 CSV files (one for each month).

The dataset has both categorical and continuous variables, which are normalized.

I have 2 interesting questions for working with large (multiple) files.

How to make sure that embedding layers (for categorical variables) will be aware of every single unique level each categorical feature has. E.g. I have Store ID as a categorical variable but it may be the case that the last 3 CSV files have new ID values which have not been presented. Can I somehow make sure that tabular_learner will be aware of that?

What I did is that I randomly took 10 percent of all 12 CSV files merged them into one data frame and create a tabular learner on this one. Then Start from this learner object and iterate through every single CSV file.

How to specify means and std to be used when normalizing the continuous variable. I have computed the means and STDs on dataset created from 10 percent of every single CSV file (expecting that the distribution shall be roughly the representative) and then in every iteration I rewrite the values in learner.dls.means (and stds) - this however does not work. So I am doing this in pandas and then only use procs = [Categorify] instead of procs = [Normalize, Categorify].

This is code of what I am doing.

for idx, csv_file in enumerate(tqdm(data_samples)):
    
    file = csv_file.name
    
    name = file.split("_")[0]
    
    print(f"processing file {file}")
    
    df_chunked = pd.read_csv(csv_file, chunksize = 4_000_000, dtype={"id": "category", "time":str})
    
    
    means = {'Dayofyear': 176.86231615789472,
         'A': 12.441854905628665,
         'B': 11.217872411941011,
         'C': 10.34373058169954,
         'D': 14.558722989157017}

    stds = {
         'Dayofyear': 91.90435008741207,
         'A': 8.592523554122467,
         'B': 9.849249912247844,
         'C': 8.683435356329042,
         'D': 8.551718797306815}
    
    procs = [Categorify]
    y_name = "my_Y"
    
    for df_i in tqdm(df_chunked):
        
        df_clean = df_i.pipe(set_dtypes).pipe(add_datepart, 'date', drop = False)
        df_clean["Y"] = np.log(df_clean[y_name] + 1)

        # +
        MIN = df_clean["Y"].min()
        MAX = df_clean["Y"].max()

        print(f" MIN: {MIN}")
        print(f" MAX: {MAX}")


        splits = RandomSplitter(valid_pct=0.15)(range_of(df_clean))

        TO = TabularPandas(
        df_clean,
        procs,
        cat_names=cat_vars,
        cont_names=cont_vars,
        y_names="Y",
        splits=splits,
        y_block=RegressionBlock(n_out=1),
        )

        dls = TO.dataloaders(bs=int(4096/2), shuffle_train=True)

        learner.dls = dls        
        

        learner.dls.means = means
        learner.dls.stds = stds

        learner.fit_one_cycle(12, lr_max = 0.001, wd = 0.2)

This likely will be useful for you, I did an article on that awhile back: (for 2)

Many thanks for that. This is very helpful.

As an extra question. When incrementally training on multiple csv file (one file each month, however not pure time series - no LSTM components). Have you found it more useful to train chronologically or randomly?

Thanks