One Hundred Layers Tiramisu

(Brendan Fortuner) #1

Thread to implement the FC-DenseNet model introduced in One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. This post is a wiki. Please make changes as your understanding improves.

TODO list

  • Understand and implement densenet (Jeremy)
    • Keras
    • Pytorch
  • Benchmark it (Brendan - pytorch)
    • CIFAR10
    • Run existing implementations, to find accuracy at epoch
  • Densenet versions (Kent)
    • Regular
    • B/C (not mentioned in tiramisu)
  • Benchmark tiramisu (Kelvin)
    • Get theano with libgpuarray working
    • Install lasagne
    • Run on camvid
  • Understand and implement tiramisu (pytorch)
    • minimal architecture & dataset
    • add skip connections & increase dataset
    • keras? (yad)

Key Links


I created a github repo for us to collaborate. I initialized it with boilerplate code from the PyTorch DenseNet implementation. It should be straightforward from here to modify this code to recreate Tiramisu.

Related Papers

Key Contributions

  • Extended DenseNet architecture to image segmentation
  • Introduced clever up-sampling technique to improve trainability/performance


These links are old, data here is missing or still in video format. There are newer alternatives like PASAL VOC and MSCOCO, but the authors don’t provide benchmarks for these.



  • 3x3 Conv2D (pad=, stride=, in_chans=3, out_chans=48)


  • BatchNorm
  • ReLU
  • 3x3 Conv2d (pad=, stride=, in_chans=, out_chans=) - “no resolution loss” - padding included
  • Dropout (.2)


  • Input = FirstConvLayer, TransitionDown, or TransitionUp
  • Loop to create L DenseLayers (L=n_layers)
  • On TransitionDown we Concat(Input, FinalDenseLayerActivation)
  • On TransitionUp we do not Concat with input, instead pass FinalDenseLayerActivation to TransitionUp block


  • BatchNorm
  • ReLU
  • 1x1 Conv2D (pad=, stride=, in_chans=, out_chans=)
  • Dropout (0.2)
  • 2x2 MaxPooling


  • DenseBlock (15 layers)


  • 3x3 Transposed Convolution (pad=, stride=2, in_chans=, out_chans=)
  • Concat(PreviousDenseBlock, SkipConnection) - from cooresponding DenseBlock on transition down


  • 1x1 Conv2d (pad=, stride=, in_chans=256, out_chans=n_classes)
  • Softmax

FCDenseNet103 Architecture

  • input (in_chans=3 for RGB)
  • 3x3 ConvLayer (out_chans=48)
  • DB (4 layers) + TD
  • DB (5 layers) + TD
  • DB (7 layers) + TD
  • DB (10 layers) + TD
  • DB (12 layers) + TD
  • Bottleneck (15 layers)
  • TU + DB (12 layers)
  • TU + DB (10 layers)
  • TU + DB (7 layers)
  • TU + DB (5 layers)
  • TU + DB (4 layers)
  • 1x1 ConvLayer (out_chans=n_classes) n_classes=11 for CamVid
  • Softmax



  • WeightInitialization = HeUniform
  • Optimizer = RMSProp
  • LR = .001 with exponential decay of 0.995 after each epoch
  • Data Augmentation = Random Crops, Vertical Flips
  • ValidationSet with early stopping based on IoU or MeanAccuracy with patience of 100 (50 during finetuning)
  • WeightDecay = .0001
  • Finetune with full-size images, LR = .0001
  • Dropout = 0.2
  • BatchNorm “we use current batch stats at training, validation, and test time”


  • TrainingSet = 367 frames
  • ValidationSet = 101 frames
  • TestSet = 233 frames
  • Images of resolution 360x480
  • Images random cropped to 224x224 for training
  • FullRes images used for finetuning
  • NumberOfClasses = 11 (output)
  • BatchSize = 3
  • Epochs = ??


  • GrowthRate = 16 (k, number of filters to each denselayer adds to the ever-growing concatenated output)
  • No pretraining


  • GrowthRate (k) = 12
  • 4 layers per dense block
  • 1 Conv Layer
  • 5 DenseBlocks Downsample (20 layers)
  • 5 TransitionDown
  • 4 Bottleneck layers
  • 5 Dense Blocks Upsample (20 layers)
  • 5 TransitionUp
  • 1 Conv Layer
  • 1 Softmax layer (doesn’t count)
  • 56 Total layers


(Matthew Kleinsmith) #2

A mini-question:

Do we need softmax on the final layer when doing inference? Won’t the result of argmax be the same without it?

(David Gutman) #3

Think you only need softmax for training (since it’s differentiable). That said argmax(softmax(x)) should always equal argmax(x) and it is quick so no real reason to remove it.

(Brendan Fortuner) #4

Room 451 from 11am - 5pm today. Right now on 5th floor but will move down shortly.

(yad.faeq) #5

@brendan, just a quick clarification that might be helpful during the implementation:
Last night I implemented the Tiramsu and it’s working as it is, but I need to clean it and replicate the results perhaps first, here is a problem though:

There is this thing called ‘m’, it’s the factor of which the feature map grows, the 103 layered one grows at the rate of 16, except in the paper at the middle it says:

880, which should be ‘896’ from the result of previous stage = 656 feature_map + (16 growth_rate * 15 conv_layers)

Also, the 103 will require at least 11 to 12 Gb to compile (TitanX), try going for the lower ones that grows at rate of 12 and have 4 layer per each dense-block. the 53 or 67.

(Constantin) #6

Guys, in case you run into the InvalidArgumentError problem in keras:
I kept running into this many times when trying to access layer outputs from keras architectures which use merge layers, like ResNet, DenseNet, etc. You get an InvalidArgumentError asking you initialize a tf.Placeholder.
I finally found this Github issue which solved my problem.
Just set K.set_learning_phase(0) at the beginning of your script and you are done with it.
It sets a tf.Placeholder to a constant in the tf backend.
OK, it might help in case you are trying to visualize intermediate activations or the like.
And, thanks for looking into this paper. Really excited about it.

(Brendan Fortuner) #7

I got the pytorch densenet in the bamos repo working. I’m training it now with 40 layers, no reduction, no bottleneck, so far so good.

His repo provides some handy helper scripts like and I copied these scripts into our repo and made some modifications. I think these could be really handy for us!

Here’s an example plot. It “grows” every epoch while you’re training:

Here’s a cool repo for visualizing the ConvTranspose layer, which we see in Tiramisu.

(Jeremy Howard) #8

I was assuming that we should use Upsampling2d & a normal conv, to handle the checkerboard issue?..

(Kent) #9

Hi guys, I joined this late and I am trying to catch up. I’ve read the paper I am glad to contribute to tasks that are not picked up by anybody yet. I noticed that the 3rd bullet point in the TODO list has no name yet, so I can start from that one. If there is anything else more important I can certainly help too.

@brendan can you brief me what’s expected for this task? Technically speaking there are 4 types, DensNet, DensNet-B, DensNet-C and DensNet-BC. But B and C are not really used (except being referred to once in the below diagram)

(Kent) #10

A question that is not the most relevant to this thread, but since the Tiramisu skills can be potentially very useful to the problem domain, I am asking here: in order to get state-of-the-art performance in a self-driving car competition like this:, what do you think are the major areas missing from what we have learned in this course so far?


Using the given data, competitors must:

Automatically detect and locate obstacles in 3D space to inform the driver/SDC system (e.g. using deep learning and classification approaches)
Fuse detection output from camera and LIDAR sensors
Remove noise and environment false detections

Round 1 - Vehicles

The first round will provide data collected from sensors on a moving car, and competitors must identify position as well as dimensions of multiple stationary and moving obstacles.

Round 2 - Vehicles, Pedestrians

The second round will also challenge participants to identify estimated orientation, in addition to added cyclists and pedestrians

(Matthew Kleinsmith) #11




self.conv1 = nn.Conv2d(3, nChannels, kernel_size=3, padding=1,


local function ConvInit(name)
  for k,v in pairs(model:findModules(name)) do
     local n = v.kW*v.kH*v.nOutputPlane
     if cudnn.version >= 4000 then
        v.bias = nil
        v.gradBias = nil


     if cudnn.version >= 4000 then
        v.bias = nil

I don’t know why the authors turn off bias for conv layers, but bamos’ PyTorch implementation is consistent with the their Lua code.

Weight initialization


for m in self.modules():
    if isinstance(m, nn.Conv2d):
        n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels, math.sqrt(2. / n))
    elif isinstance(m, nn.BatchNorm2d):
    elif isinstance(m, nn.Linear):


        n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels, math.sqrt(2. / n))


local function ConvInit(name)
  for k,v in pairs(model:findModules(name)) do
     local n = v.kW*v.kH*v.nOutputPlane
     if cudnn.version >= 4000 then
        v.bias = nil
        v.gradBias = nil


     local n = v.kW*v.kH*v.nOutputPlane

v.nOutputPlane == # of output channels
v.kW, v.kH == kernel_width, kernal_height

(Brendan Fortuner) #12

I’m free to come to 101 Howard today. Anyone want to meet me there at 10/11?

Here are my training results. I used 40 layers, no reduction, no bottleneck, but I did include data augmentation.

My final error rate was 5.49 (loss .24). This is fairly close to the author’s reported value at 5.24 (C10+).

(Brendan Fortuner) #13

In the Tiramisu paper they mention a Transposed Convolution with stride 2 in the Transition Up block. However I’m not familiar with the Upsampling2D layer.

(Jeremy Howard) #14

@brendan I’ll be there a bit after 11.30. 5th floor?

We learnt about Upsampling2d when we discussed checkerboard artifacts in class a couple of weeks ago.

(Brendan Fortuner) #15

Sounds good see you then

(Jeremy Howard) #16

Finally managed to get away. Will be 25 mins

(Kent) #17

I looked into the Keras implementation of DenseNet checked in by @jeremy and noticed a few points that I would like to confirm:

  • The dense_block is not implemented as “direct connections from any layers to all subsequent layers”. Rather, it is only concatenating the block immediately preceding the block. Is this on purpose?

The visualized result is like this:

I was expecting something a lot messier visually like this, i.e. all previous blocks are concatenating into the current block:

  • The concatenation is on the last axis, rather than on the feature map axis (i.e. the first axis). My understanding from the paper is that each layer adds K feature maps, implying that the concatenation is at axis 0. But I guess it does not matter too much because the same information is captured anyway regardless which axis is used.

  • I trained the model freshly cloned without any changes and found that my validation loss is a lot higher, and accuracy is lower, comparing to Jeremy’s result in the notebook.

In comparison, here is Jeremy’s final result:

I will do a few more experiments and report back.

(Matthew Kleinsmith) #18

Let “sub-block” refer to a batchnorm-relu-conv-dropout sequence. Dense blocks contain many sub-blocks.

Each sub-block receives a concatenation of the input and output of the previous sub-block, but the key is that the input of the previous sub-block is also a concatenation. These nested concatenations keep everything connected.

Exception: This doesn’t apply to the first sub-block, since it has no previous sub-block.


Here’s the beginning of the network:

x: the original input
x0: the output of the initial conv layer. It’s the input to first sub-block
x1: the output of the first sub-block
c1: the concatenation of x1 and x0. It’s the input to the second sub-block
x2: the output of the second sub-block
c2: the concatenation of x2 and c1. It’s the input to the third sub-block
x3: the output of the third sub-block
c3: the concatenation of x3 and c2. It’s the input to the fourth sub-block

Let “[ ]” mean a concatenation.

c3 == [x3, c2] == [x3, [x2, c1]] == [x3, [x2, [x1, x0]]] == [x3, x2, x1, x0]

And so, each sub-block is receiving the outputs of all previous sub-blocks.

Exception: This doesn’t apply to the first sub-block, since it has no previous sub-block.

Note on implementations:

Implementations of DenseNets naturally leave the main idea of the architecture implicit, by using nested concatenations. This is unfortunate. The main idea is what you said: “direct connections from any layers to all subsequent layers”, where “layers” means sub-blocks, and “subsequent layers” include only those within the given dense block. I’m working on my own implementation, but I’ve also failed to make the main idea of the architecture explicit in the code. I have bigger problems at the moment (my implementation isn’t working), but if I find a way to make the main idea explicit I’ll post it here.

How did you create this diagram? It’s good.

Axis 0 is the batch axis.

In TensorFlow / Keras+TensorFlow:

Axis -1 (the last axis) is the channel axis (a.k.a. the feature map axis).

In PyTorch / Theano / Keras+Theano / Torch:

Axis 1 (the second axis) is the channel axis (a.k.a. the feature map axis).

(Kent) #19

Thanks for your long answers @Matthew !

This is how I created the visualized models. A link is generated after running the below codes. Simply click it to open the generated graph. PS: You will need to install pydot3 and graphviz for it to work.

Yes, it makes sense! I also read the original Torch implementation, which illustrates exactly the same implementation.

(Matthew Kleinsmith) #20

Thank you!

You can use “```python” on a line before your code on the forums to highlight its syntax. On the line after your code you’d type “```”. Pasted code has the advantage of being copyable.

For example:

def add(x, y):
    return x + y



def add(x, y):
    return x + y

Sometimes the “python” string isn’t needed.