I’m currently at week4, so maybe my question will be answered later on, but I’m a bit too curious to wait
Let’s say we’re doing classification with audio data, which was transformed into chromagrams and Mel-frequency cepstrums. These are basically 2D arrays with one dimension fixed and the other dependent on the length of the original audio signal.
Question is, how do we properly feed that into a neural network? One option would be to just clip all audio signals to some fixed length, but that appears rather wasteful for longer signals. What could be the better ways of dealing with this?
For practical reasons you will probably still end up clipping your data, but both convolutional and recurrent layers can handle arbitrary length input.
You would either need to bin your data or use a batch size of 1 though.
Uhm, I’m not sure I understand what you mean here. Let’s say the network is trained on data obtained from 30 seconds of audio. Then I guess I could do predictions one item at a time, feeding the net with 30s long fragments (and combining the predictions using some ad-hoc aggregate function)… but it leaves me with the feeling there must be a more elegant solution.
Both LSTM layers and convolutional layers + global pooling at the end will take an arbitrary length input and give you a fixed length result.
So you could train on 30s input and then use those trained weights on arbitrary length test cases. It would just require changing your input shape and using the same weights.