Thomas’s code gives you (and me) some great examples and starting points. At the same time. I think it’s important that you have a clear concept of what you are trying to do. Please excuse me if I make any wrong assumptions about what you already understand.
RNNs are typically applied to a time dimension. The essence is that what has gone before is the cause of what comes after. The RNN learns the causal relationship in its hidden state and gives you a conclusion -a prediction or a class- after seeing the whole sequence. (You could in theory apply an RNN to the frequency dimension but there is not much of a cause-and-effect relationship, and there are more effective ways to extract its features.)
It looks like you want to classify time sequences (length 398709) made of [1,128] elements, each of which you are calling a image. If this is right, then you running a sequence of “images” through an RNN to classify the sequence, not treating the pixels of a single image as a sequence.
To get some clarity, stop calling these [1,128] things images. Images are inherently two dimensional, with structure along both dimensions. Instead, think of them as vectors of 128 features arranged in time sequence. Each single element of your training set is therefore 128 features by 398709 time steps. Channels and embeddings are irrelevant here.
As a roadmap, I suggest starting with the very simplest LSTM. It takes 128 raw features and processes the whole time sequence, in batches. It will probably give you a decent result.
But if this simple model does not train well enough (and I think this is what you are trying to do), then you could pre-process the elements of the sequences with CNN. That is, apply a trainable CNN to each element (128 features) of the time sequence to extract features. Then feed these extracted features (a time series of feature vectors) to the LSTM. I’m pretty sure you can do all this in batches without loops.
There are many, many approaches to the problem you describe. Lots have been discussed in these forums: give the entire spectrogram (truly an image!) to a resnet classifier; various combinations of CNN and RNNs; CNNs along both the frequency and time dimensions; ROCKET, self-attention and transformers. You will have to experiment to discover what works. But I recommend starting with the simplest model and gradually adding complexity based on its performance and your understanding.
HTH you to get started. Good luck!