df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice', subset=30000, na_dict=nas) X_train, _ = split_vals(df_trn, 20000) y_train, _ = split_vals(y_trn, 20000)
In the above snippet, we get 30k random rows from df_raw which is of length 401125.
And I went on to see how it fetches 30k rows I found this code below fetches random 30k indexes
idxs = sorted(np.random.permutation(len(df))[:30000]) # in get_sample
Now, i’m confused why it’s not that the indexes which are in X_valid
def split_vals(a,n): return a[:n].copy(), a[n:].copy() n_valid = 12000 # same as Kaggle's test set size n_trn = len(df)-n_valid raw_train, raw_valid = split_vals(df_raw, n_trn) X_train, X_valid = split_vals(df, n_trn) y_train, y_valid = split_vals(y, n_trn) X_train.shape, y_train.shape, X_valid.shape
never picks up in the above idxs ?
As a result of which X_train has total different indexes as compared to X_valid.
Please correct me.