Semantic segmentation with U-Net - Class review

Last updated on:9 days ago

Semantic image segmentation predicts a label for every single pixel in an image with appropriate class labels.

Semantic segmentation with U-Net

Semantic segmentation: Locating objects in an image by predicting each pixel as to which class it belongs to.

Motivation for U-Net

per-pixel class labels

Output: segmentation map

U-Net uses an equal number of convolutional blocks and transposed convolutions for down-sampling and up-sampling.

Deep learning for semantic segmentation

Output: $h \times w \times n$, where n = number of output classes

Transpose convolution

Output dimension: $s(n_h - 1) + f_h - 2p \times s(n_w - 1) + f_w - 2p$, where n is input size without channel size, f is kernel size, and then p is padding, s is stride.

Motivation: turn a small input into a bigger output. Ignore padding region, add overlap values together

U-Net Architecture

Skip connections are used to prevent border pixel information loss and overfitting in U-Net. Draw it which looks like a U.

align_corners in segmentation

Here is a simple illustration I made showing how a $4\times4$ image is upsampled to $8\times8$.
When aligh_corners=True, pixels are regarded as a grid of points.
Points at the corners are aligned.
When aligh_corners=False, pixels are regarded as $1\times1$ areas.
Area boundaries, rather than their centers, are aligned.

I had issue with my Unet not being equivariant to translations. Turns out align_corners=True was the culprit. As your figure clearly shows in the “True” case there is a shift between the input and output grids that depends on location. This add spatial bias in Unets. Your pic helped see this fast.

False is better in PASCAL VOC segmentation. (Deeplab, rescon, reslink)

Reference

[1] Deeplearning.ai, Convolutional Neural Networks

[2] What we should use align_corners