The Unet architecture is monuemental, and understanding it introduces you to so many different areas of deep learning. So I’ve decided to dedicate an entire blog post to it.
U-net is an architecture for semantic segmentation.
Semantic segmentation is the following task: given an image, classify each pixel as belonging to a class. It is like regular segmentation: while regular segmentation probably looks at pixel intensities and other intrinsic image vision properties (edges, gradients, coherence, continuity, etc.), semantic segmentation is concerned with segmenting out parts of an image semantically; i.e. extracting the cat from an image.
It’s the process by which we can generate cool images, where we have an object “masked out” from its background.
U-net’s key novel contribution is in its name! It uses a neural network learning architecture that “looks like a U”. That is, it downsamples the image then upsamples the image. The output of the UNet architecture is a K-channel output map (generally the same size as the input image), where K represents the number of classes you want to segment out of the image.
A note on upsampling: Upsampling is not something that you just “do” (unlike downsampling :P). Upsampling is a very difficult task: given an image at lower resolution, make it an image at higher resolution! This is an entire research field, and entire companies are dedicated to solving this type of problem (just like how I thought recommender systems were “just another topic”; boy was I wrong). Upsampling can be implemented in various ways: mode-filling, nearest-neighbour, simple billinear interpolation etc. (These are analogous to the way we can fill in time-series data). But the hot topic of the day is upsampling through learned weights in a neural network, equivalently known as upconvolution, deconvolution, fractionally-strided convolution and transpose convolution.
What is Transpose convolution? Normally, convolution reduces the spatial extent of our image, while expanding the number of filters/feature maps we get. The transpose convolution does the “transpose” operation: it expands the spatial extent of our image. Note that the transpose name comes more from our matrix formulation of convoluton: generally, we transpose that convolution matrix (the dimensions, not necessarily the same matrix) to do the expansion. Transpose convolution can be implemented as simply as mega padding an image and then running a regular convolution on the input; this causes a larger image than our original iamge, due to the copious padding we put.
Some applications: When we do semantic segmentation, we can probably also use it for regular old segmentation. So for instance, with the circle problem dataset, then we just need to do U-Net with K=2; where we predict both the circle and the background. Once we do this segmentation, we can either do RANSAC to fit a functional circle, or we can compute the area numerically (from the counts of pixels).
Todo: Bottomless pit topics: these topics all seem at the same level of complexity, but then one of them actually is a bottomless pit of depth