Pages

Friday, May 1, 2020

VGG-16 Inference with different image dimension

In this blog i will talk about how to create a classification network or fine-tune any pre-trained classification network (VGG-16) that accepts image of any dimension rather than one dimension on which it is trained. Generally a model trained for MxNx3 dimension images accept only MxNx3 input not other dimensions but here we will see how to modify network to accept different dimension input image also.

I will explain this with VGG-16 network.
(**Dimension is referred to only width and height of image/feature map not channel/depth) 

Let's talk about scenario where a single network accepting different dimensions input can be helpful.
  • Case -1 You have few hundred images of smaller dimension (let's say 100x100x3) then fine-tuning a pre-trained network is one of the best option you have instead of training one from scratch.
  • Case-2 You trained a classification network for a fixed input dimension (224x224x3) and at inference time you get different dimension input ( ranging from 200x200x3 to 300x300x3) and you don't want to pad or resize inference images to loose information.
  • Case-3 You have multiple dataset with different input dimension then with this kind of network you can easily train a classifier without modifying input dataset.
There can be other scenario where it is helpful but let's not talk about all and move forward with solution.

So the first question is what's the problem if we pass different dimensions images to a VGG-16 network, which layer will create problem convolution or pooling or flatten or fully connect (FC) layer ??

Let's see what these layers do in brief.
  • Convolution layer accept any dimension input and perform convolution with kernel values and its output dimensions depend on the padding or stride of kernel. It may or may not reduce dimension.
  • Pooling layer accept any dimension input and its output dimension depend on stride of pooling operation. It will reduce dimension.
  • Flatten layer accept any dimension input and its output is reshaped input in single dimension.
  • Fully Connect (FC) layer accept fixed input dimension and its output dimension depend on next FC layer input dimension or output layer dimension i.e. both are fixed dimensions. It may or may not reduce dimension.
Now let's see network architecture for original VGG-16 which is trained on 224x224x3 images and same VGG-16 network when trained on 150x150x3 images.



For different dimensions of input images, after block5_pool layer feature map (feature map is nothing but the output of convolution or pooling layer of CNN ) dimensions is different because of convolution and pooling layer as they reduce feature map size by some constant factor and after that flatten layer is just flattening the feature map to one dimensional vector form.
We can see input to first FC layer is 25088 (7x7x512) when image is 224x224x3 and 8192 (4x4x512) when input is 150x150x3, so this will create a problem if you pass 150x150x3 image to a network which is trained for 224x224x3

In the above image you can see i have loaded original VGG-16 model for 1000 classes, and it gave error for 374x500x3 dimension input image, but if you uncomment resize line then it will run and give probability for 1000 classes.

So a network trained for 224x224x3 will take only 224x224x3 dimension input not 150x150x3 and vice versa. 

If we want a single network to accept both images then output dimension of flatten layer should be fixed so that FC layer should always accept the output of flatten layer. The problem will be solved if somehow we always pass fixed input dimension to FC layer And that's where Global Average Pooling Layer help us.

Conclusion till here is because FC layer accept fixed length input that's why passing different dimension image to VGG-16 network results in the error. 

Global Average Pooling Layer -

Global Average Pooling is an operation that perform average pooling of each channel of input feature map, means it's transform feature map from dimension HxWxK to 1x1xK by taking average of each channel (HxW) of feature map.

GAP Layer transforming feature map from 6x6x3 to 1x1x3 by taking average of each channel

Hence if you use GAP layer instead of flatten layer then it can handle any dimension of feature map and always produce 1x1xK dimension output where K is number of channel of input feature map which is always fixed for any CNN network. So next FC layer will always receive fixed dimension input.

As we are talking about GAP layer let's know other importance of this layer
  • It is used in most of the network to handle image of different dimension 
  • It is also used as a replacement of FC layer means output of GAP layer is directly fed to softmax layer
  • Reduces number of trainable parameter of network and hence act as a regularizer
  • Less prone to overfitting than traditional fully connected layer

Let's see VGG-16 network architecture with Global Average Pooling layer.


Here we can see in both the case output after global average pooling layer is 512 dimension vector.

Hence problem solved with GAP layer we can input smaller or even bigger image to a network if it have GAP layer.

Does this mean we can pass any dimension of image to this network ?? NO ! Why ??

You can see if a model is created for 224x224 image size then at the end before GAP layer dimension is reduced by 32 times ( 224/32 = 7 or 150/32 = 4), so our minimum dimension of input image should be greater than or equal to 32x32 image for VGG-16. 

Does this mean we can train VGG-16 having GAP layer with any image dimension (>= 32x32) ?? NO ! Why??

Now with GAP layer in VGG-16 network we can do inference with different dimension of images but not training, not directly at least, because at training time we train network in batches, batch of 32, 64 or 128 images, that means we pass multiple images to the network at the same time and if a batch contain different dimension of images then it will create a problem. Batch processing won't be able to handle different dimensions for different images. Solution for this is to create a image loader that load images of same dimensions in each batch.  

At inference time we always pass one image for inference so for inference we don't have any problem of different image dimension to network.

Notice one more thing number of parameters of network without GAP layer and with GAP layer, total parameter decreased from 138,357,544 to 37,694,248 so this proves the point that it act as a reguralizer and with GAP layer network is less prone to overfitting.

Code -

Let's see the code to create a model for inferencing different dimension input image. 


Here we are loading only convolutional block of VGG16 network not FC layer  and not passing any input dimension  for image, if you load full network with FC layer then you have to pass input dimension.

Let's say you want to train the model for 150x150x3 then at the time of data loading you have to resize the image to 150x150x3. Your model will be very accurate for this dimension but also be able to handle other dimensions images.

We can load imagenet weights, this can be helpful for fine-tuning.

We are adding Global Average Pooling Layer to network and adding two dense layer and output layer exactly same as in original VGG-16 network.

Now model is ready you can change number of classes and train on your dataset. Assuming model is trained let's see inference with this model.

Here I have passed one image without resizing and model is able to do inference on it. Original image size is 374x500x3

Let's reduce the image size.

You can see i have reduced the image dimension to 32x32x3 and still model is able to do inference.

Inference -

In my previous blog i talk about image classification and general model fine-tuning, I'm gonna use the same dataset & network and replace flatten layer to GAP layer and see how model is performing for different dimensions images.

Here are some prediction for you.
Loaded the VGG-16 network trained with GAP Layer, training images were resize to 224x224 at training time.
You can see below model is performing good for 32x32 dimension images also.


Now I passed the same image without resizing and image dimension is 499x403.


So we see here model trained for 224x224 images able to perform for different dimensions and quit accurately.

I won't say this is very great job as in real scenario it's very difficult to do correct prediction for such small images when model is trained on large dimensions images because in real scenario we get lot's of noisy data but this is quite GOOD.

Complete training and inference code is on GitHub. 

Conclusion -
  • Because of flatten and fully connected layer, a CNN classification network can't process images of different dimensions.
  • With global average pooling layer in any network we can do inference with different dimension images.
  • Global average pooling layer reduce number of trainable parameter in network hence act as a regularizer and make network less prone to overfitting.

That's all for this blog, hope you find this blog informative.
Thanks for reading !!


Code and Model Link-




No comments:

Post a Comment