Data from dls.test_dl appears to be shuffled

Erick · July 18, 2024, 5:26am

I currently have a trained model that achieved quite good results. I then exported the learner and exported the validation set: python nlp_learn.dls.valid.items.to_csv("../data/interim/cta_nn/valid.csv"). In a new notebook, I loaded the learner and the csv to a dataframe. I created a dataloader with
python dl = learn.dls.test_dl(df.tfm_text) . However, when I run python dl.show_batch(), I get a batch that has the order of the items different than the dataframe. This also applies if I pass shuffle=False to test_dl(). This obviously makes it impossible to match back up to the dataframe for downstream processing. I’m fairly certain I’m just missing something (aka, doing something incorrectly). Alternatively (perhaps the better solution), how can I get the dataframe index passed back with the preds? My current pipeline of load->predict is below.

import pandas as pd
from fastai.text.all import *
from fastai.callback.wandb import *

df = pd.read_csv("../data/interim/cta_nn/valid.csv")
learn = load_learner("../models/learner_best_dutiful-sponge-33", cpu=False)
learn.remove_cbs([WandbCallback])

df["tfm_text"] = df.CHIEF_COMPLAINT.str.replace("'", "").str.replace('"', "").str.strip().str.capitalize()
df = df.drop(columns=["text", "Unnamed: 0.1", "Unnamed: 0"])
df = df.dropna(subset="CHIEF_COMPLAINT")

dl = learn.dls.test_dl(df.tfm_text, shuffle=False)

learn.get_preds(dl=dl)

Thank you in advance.

Erick · July 18, 2024, 6:09am

It does appear that despite the batches showing them out of order, the predictions are in the correct order. I just can’t figure out why I’m getting different results if I run it through test_dl first. Same model, same dataset. I’ll post another question if I can narrow it down.