Confused about segmentation tasks

I was trying out segmentation using the CamVid dataset, but I’m slightly confused about how everything is working. I know that Jeremy has a notebook that covers the 100 layer Tiramisu , but I am unsure about the general data.

1. What is the shape of the Y (the data we want to train on)

I assumed that since the final convolution has 12 filters(it is a 1x1 convolution), the y ‘image’ should be of shape (Batch,height,width,12). One category per filter. So I created a numpy array of that shape, where each category channel would be full of 1’s , and 0 otherwise. Is this approach correct? If so , why did Jeremy use sparse categorical entropy ?

2. How much memory does it take ?

I created a test model , starting from 64 filters (3x3 convolution), and doubling up till i reached 512 filters, then reducing by half , till I reached 12 filters. I used no max pooling or upsampling layers. This architecture gives me an out of memory error even in batch sizes of 8. I am using an gtx 1070 with 8 gb of GPU memory.

Update:

I also tried lesson 14’s large tiramisu model. I again get out of memory. Was anybody able to get the large model working in 8 gb vRAM ? (the smaller 224x224 model works fine)