Use pre-trained resnet model on 2D float tensor

Hello everyone.
I’m a beginner, started the formation a couple of weeks ago. I’m trying to make an entry in the Birdclef-2023 competition on kaggle to practice on a real problem.

The data is a set of audio files, separated in 264 categories. I’m converting them all into melspectrograms data of size 128*626. I’m creating a Dataloaders through a TfmdLists, in which I have implemented a Transform to convert my spectrogram data into float tensors of that size.
When I Type:

tls.train[0][0].shape the output is this :

torch.Size([128, 626])

tls.train[0] the output is this:

(tensor([[-0.1572, -0.1511, -0.1756, …, -0.1094, -0.0812, -0.0751],
[-0.1835, -0.1894, -0.1971, …, -0.1480, -0.1037, -0.0965],
[-0.2153, -0.2210, -0.2002, …, -0.1808, -0.1446, -0.1302],
…,
[-0.3137, -0.3137, -0.3137, …, -0.3137, -0.3137, -0.2982],
[-0.3137, -0.3137, -0.3137, …, -0.3137, -0.3137, -0.2983],
[-0.3137, -0.3137, -0.3137, …, -0.3137, -0.3137, -0.2984]]), 70)

Just so you have an idea of the shape of my data.

My problem is that I’m trying to use a pre-trained model to fine tune over my data, but (in my understanding, I could be wrong there) the melspectrogram data is a 2D float tensor and I believe the resnet models were trained over 3D tensors (2D image and one other dimension for RGB ?). Which would explain my error when I run this:

tls = TfmdLists(items, [npy_transform], splits=splits)
dls = tls.dataloaders(bs=64)
learn = vision_learner(dls, resnet18, opt_func=opt_func,metrics=error_rate, loss_func=CrossEntropyLossFlat)
learn.fine_tune(4)

I get this:

RuntimeError: Given groups=1, weight of size [64, 3, 7, 7], expected input[1, 64, 128, 626] to have 3 channels, but got 64 channels instead

Seems like my batch size is on the second dimension and messes up how the input is interpreted. How could I change that or could I set another shape for the input ? Or maybe my interpretation is wrong ?

Thanks for reading me ! Have a good day