Segmentation

By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

Source

1. 2D Convolution¶

tf.keras.layers.Conv2D(filters, kernel_size, strides, padding, activation, kernel_regularizer, input_shape)
    filters = 32 
    kernel_size = (3,3)
    strides = (1,1)
    padding = 'SAME'
    activeation='relu'
    kernel_regularizer=tf.keras.regularizers.l2(0.04)
    input_shape = tensor of shape([input_h, input_w, input_ch])

filter size
- the number of channels.

kernel_size
- the height and width of the 2D convolution window.
stride
- the step size of the kernel when traversing the image.
padding
- how the border of a sample is handled.
- A padded convolution will keep the spatial output dimensions equal to the input, whereas unpadded convolutions will crop away some of the borders if the kernel is larger than 1.
- 'SAME' : enable zero padding
- 'VALID' : disable zero padding
activation
- Activation function to use.
kernel_regularizer
- Initializer for the kernel weights matrix.
input and output channels
- A convolutional layer takes a certain number of input channels ($C$) and calculates a specific number of output channels ($D$).

Examples

input = [None, 4, 4, 1]
filter size = [3, 3, 1, 1]
strides = [1, 1, 1, 1]
padding = 'VALID'

input = [None, 5, 5, 1]
filter size = [3, 3, 1, 1]
strides = [1, 1, 1, 1]
padding = 'SAME'

2. Transposed Convolution¶

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution. For instance, one might use such a transformation as the decoding layer of a convolutional autoencoder or to project feature maps to a higher-dimensional space.

Some sources use the name deconvolution, which is inappropriate because it’s not a deconvolution. To make things worse deconvolutions do exists, but they’re not common in the field of deep learning.
An actual deconvolution reverts the process of a convolution.
Imagine inputting an image into a single convolutional layer. Now take the output, throw it into a black box and out comes your original image again. This black box does a deconvolution. It is the mathematical inverse of what a convolutional layer does.

A transposed convolution is somewhat similar because it produces the same spatial resolution a hypothetical deconvolutional layer would. However, the actual mathematical operation that’s being performed on the values is different.
A transposed convolutional layer carries out a regular convolution but reverts its spatial transformation.

tf.keras.layers.Conv2DTranspose(filters, kernel_size, strides, padding = 'SAME', activation)


    filter = number of channels/ 64
    kernel_size = tensor of shape (3,3)
    strides = stride of the sliding window for each dimension of the input tensor
    padding = 'SAME'
    activation = activation functions('softmax', 'relu' ...)

'SAME' : enable zero padding
'VALID' : disable zero padding

An image of 5x5 is fed into a convolutional layer. The stride is set to 2, the padding is deactivated and the kernel is 3x3. This results in a 2x2 image.

2D convolution with no padding, no stride and kernel of 3

If we wanted to reverse this process, we’d need the inverse mathematical operation so that 9 values are generated from each pixel we input. Afterward, we traverse the output image with a stride of 2. This would be a deconvolution.

A transposed convolution does not do that. The only thing in common is it guarantees that the output will be a 5x5 image as well, while still performing a normal convolution operation. To achieve this, we need to perform some fancy padding on the input.

Transposed 2D convolution with no padding, stride of 2 and kernel of 3

It merely reconstructs the spatial resolution from before and performs a convolution. This may not be the mathematical inverse, but for Encoder-Decoder architectures, it’s still very helpful. This way we can combine the upscaling of an image with a convolution, instead of doing two separate processes.

Another example of transposed convolution

Transposed 2D convolution with no padding, no stride and kernel of 3

Strides and padding for transposed convolution (optional)

Source
- A guide to convolution arithmetic for deep learning by Vincent Dumoulin and Francesco Visin
- https://github.com/vdumoulin/conv_arithmetic

3. Lab: Convolutional Autoencoder (CAE)¶

A transposed 2-D convolution layer upsamples feature maps.

This layer is sometimes incorrectly known as a "deconvolution" or "deconv" layer. This layer is the transpose of convolution and does not perform deconvolution.

%%html
<center><iframe src="https://www.youtube.com/embed/nTt_ajul8NY?start=725" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3.1. Import Library¶

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

3.2. Load MNIST Data¶

# Load Data

mnist = tf.keras.datasets.mnist
(train_imgs, train_labels), (test_imgs, test_labels) = mnist.load_data()

train_imgs, test_imgs = train_imgs/255.0, test_imgs/255.0

Only use (1, 5, 6) digits to visualize latent space in 2-D

# Use Only 1,5,6 Digits to Visualize

train_x = train_imgs[np.hstack([np.where(train_labels == 1), 
                                np.where(train_labels == 5), 
                                np.where(train_labels == 6)])][0]
train_y = train_labels[np.hstack([np.where(train_labels == 1),
                                  np.where(train_labels == 5),
                                  np.where(train_labels == 6)])][0]
test_x = test_imgs[np.hstack([np.where(test_labels == 1), 
                              np.where(test_labels == 5), 
                              np.where(test_labels == 6)])][0]
test_y = test_labels[np.hstack([np.where(test_labels == 1), 
                                np.where(test_labels == 5), 
                                np.where(test_labels == 6)])][0]

train_x = train_x.reshape(-1,28,28,1)
test_x = test_x.reshape(-1,28,28,1)

The following structure is implemented.

3.3. Build a Model¶

encoder = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 32, 
                           kernel_size = (3,3), 
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),
    
    tf.keras.layers.Conv2D(filters = 64, 
                           kernel_size = (3,3), 
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (14, 14, 32)),
    
    tf.keras.layers.Conv2D(filters = 2, 
                           kernel_size = (7,7),
                           padding = 'VALID',
                           input_shape = (7,7,64))
])

decoder = tf.keras.models.Sequential([
    tf.keras.layers.Conv2DTranspose(filters = 64, 
                                    kernel_size = (7,7),
                                    strides = (1,1), 
                                    activation = 'relu',
                                    padding = 'VALID',
                                    input_shape = (1, 1, 2)),

    tf.keras.layers.Conv2DTranspose(filters = 32, 
                                    kernel_size = (3,3),
                                    strides = (2,2), 
                                    activation = 'relu',
                                    padding = 'SAME',
                                    input_shape = (7, 7, 64)),

    tf.keras.layers.Conv2DTranspose(filters = 1, 
                                    kernel_size = (7,7),
                                    strides = (2,2),
                                    padding = 'SAME',
                                    input_shape = (14,14,32))
])

latent = encoder.output
result = decoder(latent)

model = tf.keras.Model(inputs = encoder.input, outputs = result)

3.4. Define Loss and Optimizer¶

model.compile(optimizer = 'adam',
              loss = 'mean_squared_error')

3.5. Define Optimization Configuration and Then Optimize¶

model.fit(train_x, train_x, epochs = 10)

Epoch 1/10
566/566 [==============================] - 4s 6ms/step - loss: 0.0422
Epoch 2/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0338
Epoch 3/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0321
Epoch 4/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0309
Epoch 5/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0301
Epoch 6/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0296
Epoch 7/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0292
Epoch 8/10
566/566 [==============================] - 4s 7ms/step - loss: 0.0289
Epoch 9/10
566/566 [==============================] - 4s 7ms/step - loss: 0.0287
Epoch 10/10
566/566 [==============================] - 4s 8ms/step - loss: 0.0283

<tensorflow.python.keras.callbacks.History at 0x250698180f0>

test_img = test_x[[6]]
x_reconst = model.predict(test_img)

plt.figure(figsize = (10,8))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28,28), 'gray')
plt.title('Input image', fontsize = 15)
plt.axis('off')
plt.subplot(1,2,2)
plt.imshow(x_reconst.reshape(28,28), 'gray')
plt.title('Reconstructed image', fontsize = 15)
plt.axis('off')
plt.show()

idx = np.random.choice(test_y.shape[0], 500)
rnd_x, rnd_y = test_x[idx], test_y[idx]

rnd_latent = encoder.predict(rnd_x)
rnd_latent = rnd_latent.reshape(-1,2)

plt.figure(figsize = (10,10))
plt.scatter(rnd_latent[rnd_y == 1, 0], rnd_latent[rnd_y == 1, 1], label = '1')
plt.scatter(rnd_latent[rnd_y == 5, 0], rnd_latent[rnd_y == 5, 1], label = '5')
plt.scatter(rnd_latent[rnd_y == 6, 0], rnd_latent[rnd_y == 6, 1], label = '6')
plt.title('Latent Space', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.legend(fontsize = 15)
plt.axis('equal')
plt.show()

new_latent = np.array([[2, -4]]).reshape(-1,1,1,2)

fake_img = decoder.predict(new_latent)

plt.figure(figsize = (16,7))
plt.subplot(1,2,1)
plt.scatter(rnd_latent[rnd_y == 1, 0], rnd_latent[rnd_y == 1, 1], label = '1')
plt.scatter(rnd_latent[rnd_y == 5, 0], rnd_latent[rnd_y == 5, 1], label = '5')
plt.scatter(rnd_latent[rnd_y == 6, 0], rnd_latent[rnd_y == 6, 1], label = '6')
plt.scatter(new_latent[:,:,:,0], new_latent[:,:,:,1], c = 'k', marker = 'o', s = 200, label = 'new data')
plt.title('Latent Space', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.legend(loc = 2, fontsize = 12)
plt.axis('equal')
plt.subplot(1,2,2)
plt.imshow(fake_img.reshape(28,28), 'gray')
plt.title('Generated Fake Image', fontsize = 15)
plt.xticks([])
plt.yticks([])
plt.show()

4. Segmentation¶

Segmentation task is different from classification task because it requires predicting a class for each pixel of the input image, instead of only 1 class for the whole input.

Classification needs to understand what is in the input (namely, the context).

However, in order to predict what is in the input for each pixel, segmentation needs to recover not only what is in the input, but also where.

Segment images into regions with different semantic categories. These semantic regions label and predict objects at the pixel level

4.1. Fully Convolutional Networks (FCN)¶

FCN is built only from locally connected layers, such as convolution, pooling and upsampling.

Note that no dense layer is used in this kind of architecture.

Network can work regardless of the original image size, without requiring any fixed number of units at any stage.

To obtain a segmentation map (output), segmentation networks usually have 2 parts
- Downsampling path: capture semantic/contextual information
- Upsampling path: recover spatial information

The downsampling path is used to extract and interpret the context (what), while the upsampling path is used to enable precise localization (where).

Furthermore, to fully recover the fine-grained spatial information lost in the pooling or downsampling layers, we often use skip connections.

Given a position on the spatial dimension, the output of the channel dimension will be a category prediction of the pixel corresponding to the location.

5. Lab: Segmentation¶

5.1. Segmented (Labeled) Images¶

Download data

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

train_imgs = np.load('./data_files/images_training.npy')
train_seg = np.load('./data_files/seg_training.npy')
test_imgs = np.load('./data_files/images_testing.npy')

n_train = train_imgs.shape[0]
n_test = test_imgs.shape[0]

print ("The number of training images : {}, shape : {}".format(n_train, train_imgs.shape))
print ("The number of segmented images : {}, shape : {}".format(n_train, train_seg.shape))
print ("The number of testing images : {}, shape : {}".format(n_test, test_imgs.shape))

The number of training images : 180, shape : (180, 224, 224, 3)
The number of segmented images : 180, shape : (180, 224, 224, 2)
The number of testing images : 27, shape : (27, 224, 224, 3)

idx = np.random.randint(n_train)

plt.figure(figsize = (15,10))
plt.subplot(1,3,1)
plt.imshow(train_imgs[idx])
plt.axis('off')
plt.subplot(1,3,2)
plt.imshow(train_seg[idx][:,:,0])
plt.axis('off')
plt.subplot(1,3,3)
plt.imshow(train_seg[idx][:,:,1])
plt.axis('off')
plt.show()

5.2. From CAE to FCN¶

CAE

FCN
- VGG16
- Skip connections to fully recover the fine-grained spatial information lost in the pooling or downsampling layers

5.3. FCN with Transfer Learning¶

Utilize VGG16 Model for Encoder

model_type = tf.keras.applications.vgg16
base_model = model_type.VGG16()
base_model.trainable = False
base_model.summary()

Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
predictions (Dense)          (None, 1000)              4097000   
=================================================================
Total params: 138,357,544
Trainable params: 0
Non-trainable params: 138,357,544
_________________________________________________________________

tf.keras.layers are used to define upsampling parts

map5 = base_model.layers[-5].output

# sixth convolution layer
conv6 = tf.keras.layers.Conv2D(filters = 4096,
                               kernel_size = (7,7),
                               padding = 'SAME',
                               activation = 'relu')(map5)

# 1x1 convolution layers
fcn4 = tf.keras.layers.Conv2D(filters = 4096,
                              kernel_size = (1,1),
                              padding = 'SAME',
                              activation = 'relu')(conv6)

fcn3 = tf.keras.layers.Conv2D(filters = 2,
                              kernel_size = (1,1),
                              padding = 'SAME',
                              activation = 'relu')(fcn4)

# Upsampling layers
fcn2 =  tf.keras.layers.Conv2DTranspose(filters = 512,
                                        kernel_size = (4,4),
                                        strides = (2,2),
                                        padding = 'SAME')(fcn3)

fcn1 =  tf.keras.layers.Conv2DTranspose(filters = 256,
                                        kernel_size = (4,4),
                                        strides = (2,2),
                                        padding = 'SAME')(fcn2 + base_model.layers[14].output)

output =  tf.keras.layers.Conv2DTranspose(filters = 2,
                                          kernel_size = (16,16),
                                          strides = (8,8),
                                          padding = 'SAME',
                                          activation = 'softmax')(fcn1 + base_model.layers[10].output)

model = tf.keras.Model(inputs = base_model.inputs, outputs = output)

model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
block1_conv1 (Conv2D)           (None, 224, 224, 64) 1792        input_1[0][0]                    
__________________________________________________________________________________________________
block1_conv2 (Conv2D)           (None, 224, 224, 64) 36928       block1_conv1[0][0]               
__________________________________________________________________________________________________
block1_pool (MaxPooling2D)      (None, 112, 112, 64) 0           block1_conv2[0][0]               
__________________________________________________________________________________________________
block2_conv1 (Conv2D)           (None, 112, 112, 128 73856       block1_pool[0][0]                
__________________________________________________________________________________________________
block2_conv2 (Conv2D)           (None, 112, 112, 128 147584      block2_conv1[0][0]               
__________________________________________________________________________________________________
block2_pool (MaxPooling2D)      (None, 56, 56, 128)  0           block2_conv2[0][0]               
__________________________________________________________________________________________________
block3_conv1 (Conv2D)           (None, 56, 56, 256)  295168      block2_pool[0][0]                
__________________________________________________________________________________________________
block3_conv2 (Conv2D)           (None, 56, 56, 256)  590080      block3_conv1[0][0]               
__________________________________________________________________________________________________
block3_conv3 (Conv2D)           (None, 56, 56, 256)  590080      block3_conv2[0][0]               
__________________________________________________________________________________________________
block3_pool (MaxPooling2D)      (None, 28, 28, 256)  0           block3_conv3[0][0]               
__________________________________________________________________________________________________
block4_conv1 (Conv2D)           (None, 28, 28, 512)  1180160     block3_pool[0][0]                
__________________________________________________________________________________________________
block4_conv2 (Conv2D)           (None, 28, 28, 512)  2359808     block4_conv1[0][0]               
__________________________________________________________________________________________________
block4_conv3 (Conv2D)           (None, 28, 28, 512)  2359808     block4_conv2[0][0]               
__________________________________________________________________________________________________
block4_pool (MaxPooling2D)      (None, 14, 14, 512)  0           block4_conv3[0][0]               
__________________________________________________________________________________________________
block5_conv1 (Conv2D)           (None, 14, 14, 512)  2359808     block4_pool[0][0]                
__________________________________________________________________________________________________
block5_conv2 (Conv2D)           (None, 14, 14, 512)  2359808     block5_conv1[0][0]               
__________________________________________________________________________________________________
block5_conv3 (Conv2D)           (None, 14, 14, 512)  2359808     block5_conv2[0][0]               
__________________________________________________________________________________________________
block5_pool (MaxPooling2D)      (None, 7, 7, 512)    0           block5_conv3[0][0]               
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 7, 7, 4096)   102764544   block5_pool[0][0]                
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 7, 7, 4096)   16781312    conv2d_3[0][0]                   
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 7, 7, 2)      8194        conv2d_4[0][0]                   
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, 14, 14, 512)  16896       conv2d_5[0][0]                   
__________________________________________________________________________________________________
tf.__operators__.add (TFOpLambd (None, 14, 14, 512)  0           conv2d_transpose_3[0][0]         
                                                                 block4_pool[0][0]                
__________________________________________________________________________________________________
conv2d_transpose_4 (Conv2DTrans (None, 28, 28, 256)  2097408     tf.__operators__.add[0][0]       
__________________________________________________________________________________________________
tf.__operators__.add_1 (TFOpLam (None, 28, 28, 256)  0           conv2d_transpose_4[0][0]         
                                                                 block3_pool[0][0]                
__________________________________________________________________________________________________
conv2d_transpose_5 (Conv2DTrans (None, 224, 224, 2)  131074      tf.__operators__.add_1[0][0]     
==================================================================================================
Total params: 136,514,116
Trainable params: 121,799,428
Non-trainable params: 14,714,688
__________________________________________________________________________________________________

model.compile(optimizer = 'adam',
              loss = 'categorical_crossentropy',
              metrics = 'accuracy')

model.fit(train_imgs, train_seg, batch_size = 5, epochs = 5)

Epoch 1/5
36/36 [==============================] - 63s 2s/step - loss: 0.5721 - accuracy: 0.8566
Epoch 2/5
36/36 [==============================] - 57s 2s/step - loss: 0.2491 - accuracy: 0.9080
Epoch 3/5
36/36 [==============================] - 59s 2s/step - loss: 0.2136 - accuracy: 0.9181
Epoch 4/5
36/36 [==============================] - 66s 2s/step - loss: 0.2021 - accuracy: 0.9219
Epoch 5/5
36/36 [==============================] - 61s 2s/step - loss: 0.1997 - accuracy: 0.9226

<tensorflow.python.keras.callbacks.History at 0x250067f8898>

test_x = test_imgs[[1]]
test_seg = model.predict(test_x)

seg_mask = (test_seg[:,:,:,1] > 0.5).reshape(224, 224).astype(float)

plt.figure(figsize = (14,14))
plt.subplot(2,2,1)
plt.imshow(test_x[0])
plt.axis('off')
plt.subplot(2,2,2)
plt.imshow(seg_mask, cmap = 'Blues')
plt.axis('off')
plt.subplot(2,2,3)
plt.imshow(test_x[0])
plt.imshow(seg_mask, cmap = 'Blues', alpha = 0.5)
plt.axis('off')
plt.show()

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')