Wondering if there’s an obvious explanation for this – I have a simple model:
nn.Sequential(*[
nn.Embedding(50000, 128, padding_idx=0),
nn.ReLU(),
nn.Linear(128, 10000)
])
Think of this like predicting a hashtag from a tweet – your inputs are BOW featurizations of the tweets, and your output classes is the (large) number of hashtags.
If I train w/ the fastai.text.SortishSampler
, I get significantly worse result than if I train w/ a random ordering:
# SortishSampler
{"epoch": 0, "p_at_01": 0.17496875, "p_at_05": 0.13944375, "p_at_10": 0.11604374999999999, "elapsed": 12.517449378967285}
{"epoch": 1, "p_at_01": 0.1878125, "p_at_05": 0.14675, "p_at_10": 0.11996250000000001, "elapsed": 25.40839433670044}
{"epoch": 2, "p_at_01": 0.10959375, "p_at_05": 0.070675, "p_at_10": 0.05009374999999999, "elapsed": 38.185012102127075}
{"epoch": 3, "p_at_01": 0.25040625, "p_at_05": 0.19470625000000005, "p_at_10": 0.16205937499999998, "elapsed": 50.31228280067444}
{"epoch": 4, "p_at_01": 0.2653125, "p_at_05": 0.20280000000000004, "p_at_10": 0.168853125, "elapsed": 61.01092886924744}
{"epoch": 5, "p_at_01": 0.29490625, "p_at_05": 0.22359375, "p_at_10": 0.18587500000000004, "elapsed": 72.7346978187561}
{"epoch": 6, "p_at_01": 0.27859375, "p_at_05": 0.21149375, "p_at_10": 0.17360625000000005, "elapsed": 84.20354294776917}
{"epoch": 7, "p_at_01": 0.31578125, "p_at_05": 0.23923125, "p_at_10": 0.196478125, "elapsed": 96.55597829818726}
{"epoch": 8, "p_at_01": 0.3333125, "p_at_05": 0.2518625, "p_at_10": 0.20785312500000003, "elapsed": 108.56037592887878}
{"epoch": 9, "p_at_01": 0.32671875, "p_at_05": 0.24431875, "p_at_10": 0.19895, "elapsed": 120.693279504776}
# random order
{"epoch": 0, "p_at_01": 0.22846875, "p_at_05": 0.18122500000000002, "p_at_10": 0.151646875, "elapsed": 13.475401163101196}
{"epoch": 1, "p_at_01": 0.28275, "p_at_05": 0.2161375, "p_at_10": 0.17963125, "elapsed": 25.572567224502563}
{"epoch": 2, "p_at_01": 0.32896875, "p_at_05": 0.25131875, "p_at_10": 0.20738750000000006, "elapsed": 37.94573616981506}
{"epoch": 3, "p_at_01": 0.34846875, "p_at_05": 0.26486875, "p_at_10": 0.215721875, "elapsed": 50.27561402320862}
{"epoch": 4, "p_at_01": 0.3824375, "p_at_05": 0.28755625, "p_at_10": 0.23284375, "elapsed": 63.51650404930115}
{"epoch": 5, "p_at_01": 0.39121875, "p_at_05": 0.29014375000000003, "p_at_10": 0.2353, "elapsed": 77.00169587135315}
{"epoch": 6, "p_at_01": 0.41134375, "p_at_05": 0.30732500000000007, "p_at_10": 0.247771875, "elapsed": 89.96660947799683}
{"epoch": 7, "p_at_01": 0.4099375, "p_at_05": 0.30201874999999995, "p_at_10": 0.24224062500000001, "elapsed": 102.85762786865234}
{"epoch": 8, "p_at_01": 0.4213125, "p_at_05": 0.31561875, "p_at_10": 0.25602500000000006, "elapsed": 115.78191900253296}
{"epoch": 9, "p_at_01": 0.43459375, "p_at_05": 0.32563125, "p_at_10": 0.263815625, "elapsed": 128.54980063438416}
where p_at_k
is the precision of the top k
predictions. (Top-1 accuracy isn’t super useful because the number of classes is so large). The convergence of the random sampler is
a) much faster (3 epochs to p_at_01=0.3
vs 7 for SortishSampler
)
b) much smoother (monotonically increasing, whereas SortishSampler
bounces around)
Anyone have any thoughts on why this would be? SortishSampler
is “less random”, but I’m surprised it actually makes this big of a difference.
EDIT: Also… I wonder how much of a difference this is making for input into RNNs (eg in ULMFit finetuning)