Segmentation


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. 2D ConvolutionĀ¶

tf.keras.layers.Conv2D(filters, kernel_size, strides, padding, activation, kernel_regularizer, input_shape)
    filters = 32 
    kernel_size = (3,3)
    strides = (1,1)
    padding = 'SAME'
    activeation='relu'
    kernel_regularizer=tf.keras.regularizers.l2(0.04)
    input_shape = tensor of shape([input_h, input_w, input_ch])


  • filter size
    • the number of channels.
  • kernel_size

    • the height and width of the 2D convolution window.
  • stride

    • the step size of the kernel when traversing the image.
  • padding

    • how the border of a sample is handled.
    • A padded convolution will keep the spatial output dimensions equal to the input, whereas unpadded convolutions will crop away some of the borders if the kernel is larger than 1.
    • 'SAME' : enable zero padding
    • 'VALID' : disable zero padding
  • activation
    • Activation function to use.
  • kernel_regularizer

    • Initializer for the kernel weights matrix.
  • input and output channels

    • A convolutional layer takes a certain number of input channels ($C$) and calculates a specific number of output channels ($D$).

Examples


input = [None, 4, 4, 1]
filter size = [3, 3, 1, 1]
strides = [1, 1, 1, 1]
padding = 'VALID'

input = [None, 5, 5, 1]
filter size = [3, 3, 1, 1]
strides = [1, 1, 1, 1]
padding = 'SAME'

2. Transposed ConvolutionĀ¶


  • The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution. For instance, one might use such a transformation as the decoding layer of a convolutional autoencoder or to project feature maps to a higher-dimensional space.
  • Some sources use the name deconvolution, which is inappropriate because itā€™s not a deconvolution. To make things worse deconvolutions do exists, but theyā€™re not common in the field of deep learning.

  • An actual deconvolution reverts the process of a convolution.

  • Imagine inputting an image into a single convolutional layer. Now take the output, throw it into a black box and out comes your original image again. This black box does a deconvolution. It is the mathematical inverse of what a convolutional layer does.

  • A transposed convolution is somewhat similar because it produces the same spatial resolution a hypothetical deconvolutional layer would. However, the actual mathematical operation thatā€™s being performed on the values is different.

  • A transposed convolutional layer carries out a regular convolution but reverts its spatial transformation.



tf.keras.layers.Conv2DTranspose(filters, kernel_size, strides, padding = 'SAME', activation)


    filter = number of channels/ 64
    kernel_size = tensor of shape (3,3)
    strides = stride of the sliding window for each dimension of the input tensor
    padding = 'SAME'
    activation = activation functions('softmax', 'relu' ...)


  • 'SAME' : enable zero padding
  • 'VALID' : disable zero padding


An image of 5x5 is fed into a convolutional layer. The stride is set to 2, the padding is deactivated and the kernel is 3x3. This results in a 2x2 image.

2D convolution with no padding, no stride and kernel of 3

If we wanted to reverse this process, weā€™d need the inverse mathematical operation so that 9 values are generated from each pixel we input. Afterward, we traverse the output image with a stride of 2. This would be a deconvolution.

A transposed convolution does not do that. The only thing in common is it guarantees that the output will be a 5x5 image as well, while still performing a normal convolution operation. To achieve this, we need to perform some fancy padding on the input.

Transposed 2D convolution with no padding, stride of 2 and kernel of 3

It merely reconstructs the spatial resolution from before and performs a convolution. This may not be the mathematical inverse, but for Encoder-Decoder architectures, itā€™s still very helpful. This way we can combine the upscaling of an image with a convolution, instead of doing two separate processes.

  • Another example of transposed convolution
Transposed 2D convolution with no padding, no stride and kernel of 3


Strides and padding for transposed convolution (optional)














3. Lab: Convolutional Autoencoder (CAE)Ā¶





  • A transposed 2-D convolution layer upsamples feature maps.
  • This layer is sometimes incorrectly known as a "deconvolution" or "deconv" layer. This layer is the transpose of convolution and does not perform deconvolution.
InĀ [1]:
%%html
<center><iframe src="https://www.youtube.com/embed/nTt_ajul8NY?start=725" 
width="560" height="315" frameborder="0" allowfullscreen></iframe></center>

3.1. Import LibraryĀ¶

InĀ [2]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

3.2. Load MNIST DataĀ¶

InĀ [3]:
# Load Data

mnist = tf.keras.datasets.mnist
(train_imgs, train_labels), (test_imgs, test_labels) = mnist.load_data()

train_imgs, test_imgs = train_imgs/255.0, test_imgs/255.0
  • Only use (1, 5, 6) digits to visualize latent space in 2-D
InĀ [4]:
# Use Only 1,5,6 Digits to Visualize

train_x = train_imgs[np.hstack([np.where(train_labels == 1), 
                                np.where(train_labels == 5), 
                                np.where(train_labels == 6)])][0]
train_y = train_labels[np.hstack([np.where(train_labels == 1),
                                  np.where(train_labels == 5),
                                  np.where(train_labels == 6)])][0]
test_x = test_imgs[np.hstack([np.where(test_labels == 1), 
                              np.where(test_labels == 5), 
                              np.where(test_labels == 6)])][0]
test_y = test_labels[np.hstack([np.where(test_labels == 1), 
                                np.where(test_labels == 5), 
                                np.where(test_labels == 6)])][0]
InĀ [5]:
train_x = train_x.reshape(-1,28,28,1)
test_x = test_x.reshape(-1,28,28,1)

The following structure is implemented.





3.3. Build a ModelĀ¶

InĀ [6]:
encoder = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(filters = 32, 
                           kernel_size = (3,3), 
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (28, 28, 1)),
    
    tf.keras.layers.Conv2D(filters = 64, 
                           kernel_size = (3,3), 
                           strides = (2,2), 
                           activation = 'relu',
                           padding = 'SAME',
                           input_shape = (14, 14, 32)),
    
    tf.keras.layers.Conv2D(filters = 2, 
                           kernel_size = (7,7),
                           padding = 'VALID',
                           input_shape = (7,7,64))
])





InĀ [7]:
decoder = tf.keras.models.Sequential([
    tf.keras.layers.Conv2DTranspose(filters = 64, 
                                    kernel_size = (7,7),
                                    strides = (1,1), 
                                    activation = 'relu',
                                    padding = 'VALID',
                                    input_shape = (1, 1, 2)),

    tf.keras.layers.Conv2DTranspose(filters = 32, 
                                    kernel_size = (3,3),
                                    strides = (2,2), 
                                    activation = 'relu',
                                    padding = 'SAME',
                                    input_shape = (7, 7, 64)),

    tf.keras.layers.Conv2DTranspose(filters = 1, 
                                    kernel_size = (7,7),
                                    strides = (2,2),
                                    padding = 'SAME',
                                    input_shape = (14,14,32))
])
InĀ [8]:
latent = encoder.output
result = decoder(latent)
InĀ [9]:
model = tf.keras.Model(inputs = encoder.input, outputs = result)

3.4. Define Loss and OptimizerĀ¶

InĀ [10]:
model.compile(optimizer = 'adam',
              loss = 'mean_squared_error')

3.5. Define Optimization Configuration and Then OptimizeĀ¶

InĀ [11]:
model.fit(train_x, train_x, epochs = 10)
Epoch 1/10
566/566 [==============================] - 4s 6ms/step - loss: 0.0422
Epoch 2/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0338
Epoch 3/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0321
Epoch 4/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0309
Epoch 5/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0301
Epoch 6/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0296
Epoch 7/10
566/566 [==============================] - 3s 6ms/step - loss: 0.0292
Epoch 8/10
566/566 [==============================] - 4s 7ms/step - loss: 0.0289
Epoch 9/10
566/566 [==============================] - 4s 7ms/step - loss: 0.0287
Epoch 10/10
566/566 [==============================] - 4s 8ms/step - loss: 0.0283
Out[11]:
<tensorflow.python.keras.callbacks.History at 0x250698180f0>
InĀ [12]:
test_img = test_x[[6]]
x_reconst = model.predict(test_img)

plt.figure(figsize = (10,8))
plt.subplot(1,2,1)
plt.imshow(test_img.reshape(28,28), 'gray')
plt.title('Input image', fontsize = 15)
plt.axis('off')
plt.subplot(1,2,2)
plt.imshow(x_reconst.reshape(28,28), 'gray')
plt.title('Reconstructed image', fontsize = 15)
plt.axis('off')
plt.show()
InĀ [13]:
idx = np.random.choice(test_y.shape[0], 500)
rnd_x, rnd_y = test_x[idx], test_y[idx]
InĀ [14]:
rnd_latent = encoder.predict(rnd_x)
rnd_latent = rnd_latent.reshape(-1,2)

plt.figure(figsize = (10,10))
plt.scatter(rnd_latent[rnd_y == 1, 0], rnd_latent[rnd_y == 1, 1], label = '1')
plt.scatter(rnd_latent[rnd_y == 5, 0], rnd_latent[rnd_y == 5, 1], label = '5')
plt.scatter(rnd_latent[rnd_y == 6, 0], rnd_latent[rnd_y == 6, 1], label = '6')
plt.title('Latent Space', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.legend(fontsize = 15)
plt.axis('equal')
plt.show()
InĀ [15]:
new_latent = np.array([[2, -4]]).reshape(-1,1,1,2)

fake_img = decoder.predict(new_latent)

plt.figure(figsize = (16,7))
plt.subplot(1,2,1)
plt.scatter(rnd_latent[rnd_y == 1, 0], rnd_latent[rnd_y == 1, 1], label = '1')
plt.scatter(rnd_latent[rnd_y == 5, 0], rnd_latent[rnd_y == 5, 1], label = '5')
plt.scatter(rnd_latent[rnd_y == 6, 0], rnd_latent[rnd_y == 6, 1], label = '6')
plt.scatter(new_latent[:,:,:,0], new_latent[:,:,:,1], c = 'k', marker = 'o', s = 200, label = 'new data')
plt.title('Latent Space', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.legend(loc = 2, fontsize = 12)
plt.axis('equal')
plt.subplot(1,2,2)
plt.imshow(fake_img.reshape(28,28), 'gray')
plt.title('Generated Fake Image', fontsize = 15)
plt.xticks([])
plt.yticks([])
plt.show()

4. SegmentationĀ¶


  • Segmentation task is different from classification task because it requires predicting a class for each pixel of the input image, instead of only 1 class for the whole input.
  • Classification needs to understand what is in the input (namely, the context).
  • However, in order to predict what is in the input for each pixel, segmentation needs to recover not only what is in the input, but also where.
  • Segment images into regions with different semantic categories. These semantic regions label and predict objects at the pixel level

4.1. Fully Convolutional Networks (FCN)Ā¶


  • FCN is built only from locally connected layers, such as convolution, pooling and upsampling.
  • Note that no dense layer is used in this kind of architecture.
  • Network can work regardless of the original image size, without requiring any fixed number of units at any stage.
  • To obtain a segmentation map (output), segmentation networks usually have 2 parts

    • Downsampling path: capture semantic/contextual information
    • Upsampling path: recover spatial information
  • The downsampling path is used to extract and interpret the context (what), while the upsampling path is used to enable precise localization (where).
  • Furthermore, to fully recover the fine-grained spatial information lost in the pooling or downsampling layers, we often use skip connections.
  • Given a position on the spatial dimension, the output of the channel dimension will be a category prediction of the pixel corresponding to the location.

5. Lab: SegmentationĀ¶

5.1. Segmented (Labeled) ImagesĀ¶

Download data

InĀ [16]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
InĀ [17]:
train_imgs = np.load('./data_files/images_training.npy')
train_seg = np.load('./data_files/seg_training.npy')
test_imgs = np.load('./data_files/images_testing.npy')

n_train = train_imgs.shape[0]
n_test = test_imgs.shape[0]

print ("The number of training images : {}, shape : {}".format(n_train, train_imgs.shape))
print ("The number of segmented images : {}, shape : {}".format(n_train, train_seg.shape))
print ("The number of testing images : {}, shape : {}".format(n_test, test_imgs.shape))
The number of training images : 180, shape : (180, 224, 224, 3)
The number of segmented images : 180, shape : (180, 224, 224, 2)
The number of testing images : 27, shape : (27, 224, 224, 3)
InĀ [18]:
idx = np.random.randint(n_train)

plt.figure(figsize = (15,10))
plt.subplot(1,3,1)
plt.imshow(train_imgs[idx])
plt.axis('off')
plt.subplot(1,3,2)
plt.imshow(train_seg[idx][:,:,0])
plt.axis('off')
plt.subplot(1,3,3)
plt.imshow(train_seg[idx][:,:,1])
plt.axis('off')
plt.show()

5.2. From CAE to FCNĀ¶


  • CAE






  • FCN
    • VGG16
    • Skip connections to fully recover the fine-grained spatial information lost in the pooling or downsampling layers





5.3. FCN with Transfer LearningĀ¶





  • Utilize VGG16 Model for Encoder
InĀ [19]:
model_type = tf.keras.applications.vgg16
base_model = model_type.VGG16()
base_model.trainable = False
base_model.summary()
Model: "vgg16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 224, 224, 3)]     0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
predictions (Dense)          (None, 1000)              4097000   
=================================================================
Total params: 138,357,544
Trainable params: 0
Non-trainable params: 138,357,544
_________________________________________________________________
  • tf.keras.layers are used to define upsampling parts





InĀ [20]:
map5 = base_model.layers[-5].output

# sixth convolution layer
conv6 = tf.keras.layers.Conv2D(filters = 4096,
                               kernel_size = (7,7),
                               padding = 'SAME',
                               activation = 'relu')(map5)

# 1x1 convolution layers
fcn4 = tf.keras.layers.Conv2D(filters = 4096,
                              kernel_size = (1,1),
                              padding = 'SAME',
                              activation = 'relu')(conv6)

fcn3 = tf.keras.layers.Conv2D(filters = 2,
                              kernel_size = (1,1),
                              padding = 'SAME',
                              activation = 'relu')(fcn4)

# Upsampling layers
fcn2 =  tf.keras.layers.Conv2DTranspose(filters = 512,
                                        kernel_size = (4,4),
                                        strides = (2,2),
                                        padding = 'SAME')(fcn3)

fcn1 =  tf.keras.layers.Conv2DTranspose(filters = 256,
                                        kernel_size = (4,4),
                                        strides = (2,2),
                                        padding = 'SAME')(fcn2 + base_model.layers[14].output)

output =  tf.keras.layers.Conv2DTranspose(filters = 2,
                                          kernel_size = (16,16),
                                          strides = (8,8),
                                          padding = 'SAME',
                                          activation = 'softmax')(fcn1 + base_model.layers[10].output)

model = tf.keras.Model(inputs = base_model.inputs, outputs = output)
InĀ [21]:
model.summary()
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 224, 224, 3) 0                                            
__________________________________________________________________________________________________
block1_conv1 (Conv2D)           (None, 224, 224, 64) 1792        input_1[0][0]                    
__________________________________________________________________________________________________
block1_conv2 (Conv2D)           (None, 224, 224, 64) 36928       block1_conv1[0][0]               
__________________________________________________________________________________________________
block1_pool (MaxPooling2D)      (None, 112, 112, 64) 0           block1_conv2[0][0]               
__________________________________________________________________________________________________
block2_conv1 (Conv2D)           (None, 112, 112, 128 73856       block1_pool[0][0]                
__________________________________________________________________________________________________
block2_conv2 (Conv2D)           (None, 112, 112, 128 147584      block2_conv1[0][0]               
__________________________________________________________________________________________________
block2_pool (MaxPooling2D)      (None, 56, 56, 128)  0           block2_conv2[0][0]               
__________________________________________________________________________________________________
block3_conv1 (Conv2D)           (None, 56, 56, 256)  295168      block2_pool[0][0]                
__________________________________________________________________________________________________
block3_conv2 (Conv2D)           (None, 56, 56, 256)  590080      block3_conv1[0][0]               
__________________________________________________________________________________________________
block3_conv3 (Conv2D)           (None, 56, 56, 256)  590080      block3_conv2[0][0]               
__________________________________________________________________________________________________
block3_pool (MaxPooling2D)      (None, 28, 28, 256)  0           block3_conv3[0][0]               
__________________________________________________________________________________________________
block4_conv1 (Conv2D)           (None, 28, 28, 512)  1180160     block3_pool[0][0]                
__________________________________________________________________________________________________
block4_conv2 (Conv2D)           (None, 28, 28, 512)  2359808     block4_conv1[0][0]               
__________________________________________________________________________________________________
block4_conv3 (Conv2D)           (None, 28, 28, 512)  2359808     block4_conv2[0][0]               
__________________________________________________________________________________________________
block4_pool (MaxPooling2D)      (None, 14, 14, 512)  0           block4_conv3[0][0]               
__________________________________________________________________________________________________
block5_conv1 (Conv2D)           (None, 14, 14, 512)  2359808     block4_pool[0][0]                
__________________________________________________________________________________________________
block5_conv2 (Conv2D)           (None, 14, 14, 512)  2359808     block5_conv1[0][0]               
__________________________________________________________________________________________________
block5_conv3 (Conv2D)           (None, 14, 14, 512)  2359808     block5_conv2[0][0]               
__________________________________________________________________________________________________
block5_pool (MaxPooling2D)      (None, 7, 7, 512)    0           block5_conv3[0][0]               
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 7, 7, 4096)   102764544   block5_pool[0][0]                
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 7, 7, 4096)   16781312    conv2d_3[0][0]                   
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 7, 7, 2)      8194        conv2d_4[0][0]                   
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, 14, 14, 512)  16896       conv2d_5[0][0]                   
__________________________________________________________________________________________________
tf.__operators__.add (TFOpLambd (None, 14, 14, 512)  0           conv2d_transpose_3[0][0]         
                                                                 block4_pool[0][0]                
__________________________________________________________________________________________________
conv2d_transpose_4 (Conv2DTrans (None, 28, 28, 256)  2097408     tf.__operators__.add[0][0]       
__________________________________________________________________________________________________
tf.__operators__.add_1 (TFOpLam (None, 28, 28, 256)  0           conv2d_transpose_4[0][0]         
                                                                 block3_pool[0][0]                
__________________________________________________________________________________________________
conv2d_transpose_5 (Conv2DTrans (None, 224, 224, 2)  131074      tf.__operators__.add_1[0][0]     
==================================================================================================
Total params: 136,514,116
Trainable params: 121,799,428
Non-trainable params: 14,714,688
__________________________________________________________________________________________________
InĀ [22]:
model.compile(optimizer = 'adam',
              loss = 'categorical_crossentropy',
              metrics = 'accuracy')
InĀ [23]:
model.fit(train_imgs, train_seg, batch_size = 5, epochs = 5)
Epoch 1/5
36/36 [==============================] - 63s 2s/step - loss: 0.5721 - accuracy: 0.8566
Epoch 2/5
36/36 [==============================] - 57s 2s/step - loss: 0.2491 - accuracy: 0.9080
Epoch 3/5
36/36 [==============================] - 59s 2s/step - loss: 0.2136 - accuracy: 0.9181
Epoch 4/5
36/36 [==============================] - 66s 2s/step - loss: 0.2021 - accuracy: 0.9219
Epoch 5/5
36/36 [==============================] - 61s 2s/step - loss: 0.1997 - accuracy: 0.9226
Out[23]:
<tensorflow.python.keras.callbacks.History at 0x250067f8898>
InĀ [24]:
test_x = test_imgs[[1]]
test_seg = model.predict(test_x)

seg_mask = (test_seg[:,:,:,1] > 0.5).reshape(224, 224).astype(float)

plt.figure(figsize = (14,14))
plt.subplot(2,2,1)
plt.imshow(test_x[0])
plt.axis('off')
plt.subplot(2,2,2)
plt.imshow(seg_mask, cmap = 'Blues')
plt.axis('off')
plt.subplot(2,2,3)
plt.imshow(test_x[0])
plt.imshow(seg_mask, cmap = 'Blues', alpha = 0.5)
plt.axis('off')
plt.show() 
InĀ [25]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')