Through this post you will be able to understand all about google’s Inception network architecture, it’s different versions, how to use it for transfer learning, real time implementation.
Table of content:
- What is Inception network?
- Problems solved by Inception Network architecture
- Inception v1
- Inception v2
- Inception v3
What is Inception Network?
Inception network is basically an complex and well designed architecture used for solving computer vision tasks which was designed in 2014 and it outperformed in “ImageNet visual recognition challenge“.
Problems solved by inception network
Inception architecture solved two major challenges faced in applying deep learning in computer vision:
- Before existence of Inception network, people were trying to make the model deeper and deeper to improve the accuracy. But making the network deeper was causing problems like overfitting.
- Another was, while increasing number of neurons (parameters), the requirement for more computational resources was biggest issue.
- At that time, existing architectures were not able to capture all the information efficiently due to location variation.
As you can see in the above images, cat’s location is different in both cases. If you take fix kernel size, it’s difficult to capture appropriate information. Inception network architecture handles this issue very tactfully.
A large kernel size filter, captures the information which is distributed over the whole image that is high level features. While a small size filter extracts local information i.e. low level features.
What’s so special in inception network?
Okay, Now I tell you the secret of performance of google inception network!!!
Let’s start with solving each problem one by one:
Extraction of information which varies locally.
We add filters of same size for one layer, but what if we add different size filters at one time?
If you add multiple size filters at the same level, like (1×1), (3×3), (5×5), inception layer will look like this:
Benefits of adding inception layer will be as follows:
- Network will be wider to avoid overfitting.
- If you add different size filters, it can capture all the information irrespective of the location of object.
Now, second issue i.e. “computationally expensive model”
Above issue is solved by adding (1×1) convolution before adding other filters like (3×3) or (5×5)
You might think, by adding an extra convolution it should increase the computational time then how will it reduce the computational time?
“By reducing number of depth channels”
A 1×1 convolution simply collapses all input channel pixels into one channel input pixel with all it’s channels to an output pixel. It reduces the number of depth channels. It is often very slow to multiply volumes with extremely large depths. For e.g.
input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth) input (256 depth) -> 4x4 convolution (256 depth)
The bottom one is about ~3.7x slower.
In simplest words, the neural network can ‘choose’ which input ‘colors’ to look at using 1×1 convolution, instead of brute force multiplying everything.
Versions of Inception architecture
Different version came into existence to improve previous architecture of inception network.
Based on inception layer “GoogleNet” was designed which contained 9 inception layers stacked on each other linearly.
It is 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module.
Nevertheless to say, it is a pretty deep classifier.
But again, due to being very deep, it is likely to suffer from vanishing gradient problem.
To solve vanishing gradient problem, authors of the paper added “auxillary classifier” in the middle of the architecture. Auxillary classifiers apply softmax function to the outputs of the inception module and compute auxillary loss. Total loss is calculated based on summation of real loss and weighted sum of auxillary loss.
# The total loss used by the inception net during training. total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2
Here 0.3 indicates the weights considered to calculate weighted sum of auxillary loss.
Auxillary loss is used for training only, not for the inference of the model.
Authors imporved inception v1 to imporve accuracy further and to reduce computational complexity even further.
By using dimension reduction methods extemely, you might be losing information. This is known as “representational bottleneck”
However, it can be solved if you use two (3×3) filters instead of one (5×5).
Also, to factorize convolutions of filter size nxn to a combination of 1xn and nx1 convolutions, makes the networkis computational cost more lesser.
Also, filters are expanded in parallel. Due to this, excessive dimension reduction can be prevented, thus removes representational bottleneck.
Authors had implemented “auxillary classifiers” to remove vanishing gradient problem, but as per their observation, it didn’t solve the problem efficiently. It worked as regularization function like dropout or batch normalization.
Inception v3 included all of the above improvements used for Inception v2 along with some additional functionality as follows:
- RMSProp Optimizer.
- Factorized 7×7 convolutions.
- BatchNorm in the Auxillary Classifiers.
- Label Smoothing
You might think, Inception V1 started with different kernel size to cover for huge variation in the location of the information. But, in the later versions, the kernel sizes are factorized to smaller sizes like 3×3 or 1×1 are mostly used and it worked well. So, does this mean the assumption to have different kernel sizes in Inception V1 seems wrong?
Actually, factorization of large kernel size to smaller one just reduces computational complexity. If you will replace 5×5 filter then you will need two 3×3 filters so that it will be able to perform the same as 5×5 filter did.
Now you understand how does inception works and what are the benefits of each version, what are additional changes incorporated in the succeeded version and why? In next part, you will be able to implement inception network for your classification problem.
Let us know if you have any doubts or any questions through comment section.