I was wondering why there were no Auto-encoders that had say, the appropriate padding for conv layer such that the # units never decreased. Is there any evidence that this always finds the trivial solution like the identity function or overfits or something like that?

The idea of an auto-encoder is to learn a compressed representation of the data. Hence the downsampling

I wonder the same question as you. The most satisfiying answer to that question to me is the identity function as you have mentioned. It seems reasonable that if we do not decrease the number of the size it is more likely to learn identity function than one with have a bottleneck. But how likely actually? I donâ€™t think it is easy for network to learn identity function as not learning an identity function is a problem in networks, for example the one problem that residual blocks are aiming to solve is to make learning the identity function between layers easier. So, to my understanding, it is not that easy to network to learn identity function, then why is it the common practice to downsample?

Also UNET s use this logic too. If someone can explain this intuitively, that would be awesome

(I suspect that learning identity function between layers might be harder than learning identity function that maps input to the same output, because in the first case the goal is to reduce loss fuction and there are lots of parts interacting with each other, so learning identity between layers at worst case is not the primary goal of the network but in the second case we are talking about network as a whole being a identity function , so because our goal is to have same output as the input , and the primary goal is this one, it could be much more likely to learn identity function as a whole, so if this is the case that is realy logical to do downsampling, but I am very curious about if this is the only reason we are doing in that way.)

(Also it is also reasonable to have low dimensional representation of the data as a meaningful information. For example if you watch a football match for 90 minutes, at the end of that you can produce the information from that , so called, higher dimensional video input and say score is â€ś2-0â€ť. We have reduced the dimension but it leads to the new information about the system. I am not very familiar with the various network architectures, is it a common practice to consecutively lower the dimension if we have lower dimensional output relatively to the input data or are there any cases where you consecutively upsample and then downsample to the size of the output? ) (this last question is not specific to the auto-encoders, it is a genaral question about architecture)

I havenâ€™t personally worked much with AutoEncoders.

But what I have read is this:

One way to obtain useful features from the autoencoder is to constrain h (representation layer) to have a smaller dimension than x(input) . An autoencoder whose code dimension is less than the input dimension is called undercomplete. Learning an undercomplete representation forces the autoencoder to

capture the most salient features of the training data.

This is from the Deep Learning book from the chapter about AutoEncoders.

I think what @chatuur quoted from the Deep Learning Book very elegantly captures the idea.

If a neural network is forced to give correct outputs with a reduced representation of the data, it will eventually learn to ignore the unimportant parts of the data and only focus on the parts that help it reduce the loss (i.e. give correct predictions)

It is reasonable and I think it is not specific to the autoencoder.

So can we say that it is a common practice to consecutively downsample when we have a lower dimensional output ? Is there any counter example to that where we upsample and then downsample to the size of the output ?