Network architecture for Self Supervised Learning

I was reading through Jeremy’s tutorial on self-supervised learning, which is very useful. I had one question that arose which is regarding the actual implementation of the networks.
In short, I was wondering is:

what is the extent to which it is necessary that one uses the same network architecture for the pretext and the main tasks?

My goal is to implement a self-supervised pretext to train up a tabular model. I was thinking a good pretext task would be to predict missing attributes using row data. I am currently experimenting basic attribute classification with no time or space components (KDD’99). So missing attributes seems natural. In the future, I would experiment with a pretext of forecasting the next time steps, or space steps.

My thought process is by drawing an analog from the approaches in text and computer vision, which seem analogous. But, am I conceptualizing this correctly? My current network architectur eis just from the latest 40_tabular.ipynb

learn = tabular_learner(dls, layers=[500,250], n_out=len(y.unique()))

Note: part of the reason I want to look at using a pretext task is that this network is not performing well at all compared to the sklearn CART implementation. My goal is to obtain SOA with an NN.