Through this post, you will be able to understand all about google’s Inception network architecture, its different versions, how to use it for transfer learning, real-time implementation.

Let’s start:

Table of content:

  • What is Inception network?
  • Problems solved by Inception Network architecture
  • Inception v1
  • The Inception v2
  • Inception v3

What is Inception Network?

Inception network is basically a complex and well-designed architecture used for solving computer vision tasks which were designed in 2014 and it outperformed in “ImageNet visual recognition challenge“.

Problems solved by inception network

Inception architecture solved two major challenges faced in applying deep learning in computer vision:

  • Before existence of Inception network, people were trying to make the model deeper and deeper to improve the accuracy. But making the network deeper was causing problems like overfitting.
  • Another was, while increasing number of neurons (parameters), the requirement for more computational resources was biggest issue.
  • At that time, existing architectures were not able to capture all the information efficiently due to location variation.
[Image Source: Bored Panda]

As you can see in the above images, the cat’s location is different in both cases. If you take fix kernel size, it’s difficult to capture appropriate information. Inception network architecture handles this issue very tactfully.


A large kernel size filter captures the information which is distributed over the whole image that is high-level features. While a small size filter extracts local information i.e. low-level features.

What’s so special in inception network?

Okay, now I tell you the secret of the performance of the google inception network!!!

Let’s start with solving each problem one by one:

Extraction of information that varies locally.

We add filters of the same size for one layer, but what if we add different size filters at one time?

If you add multiple size filters at the same level, like (1×1), (3×3), (5×5), the inception layer will look like this:

[Image Source: ResearchGate]

The benefits of adding inception layer will be as follows:

  1. Network will be wider to avoid overfitting.
  2. If you add different size filters, it can capture all the information irrespective of the location of object.

Okay, understood!

Now, the second issue i.e. “computationally expensive model”


The above issue is solved by adding (1×1) convolution before adding other filters like (3×3) or (5×5)

You might think, by adding an extra convolution should increase the computational time then how will it reduce the computational time?

“By reducing the number of depth channels”

A 1×1 convolution simply collapses all input channel pixels into one channel input pixel with all it’s channels to an output pixel. It reduces the number of depth channels. It is often very slow to multiply volumes with extremely large depths. For e.g.

input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth)

input (256 depth) -> 4x4 convolution (256 depth)

The bottom one is about ~3.7x slower.

In simplest words, the neural network can ‘choose’ which input ‘colors’ to look at using 1×1 convolution, instead of brute force-multiplying everything.

Versions of Inception architecture

The different versions came into existence to improve the previous architecture of the inception network.

Inception v1

Based on the inception layer “GoogleNet” was designed which contained 9 inception layers stacked on each other linearly.

 It is 22 layers deep (27, including the pooling layers). It uses global average pooling at the end of the last inception module.

Nevertheless, to say, it is a pretty deep classifier.

But again, due to being very deep, it is likely to suffer from a vanishing gradient problem.

To solve the vanishing gradient problem, the authors of the paper added an “auxiliary classifier” in the middle of the architecture. Auxillary classifiers apply the softmax function to the outputs of the inception module and compute auxiliary loss. Total loss is calculated based on the summation of real loss and the weighted sum of auxiliary loss.

# The total loss used by the inception net during training.
total_loss = real_loss + 0.3 * aux_loss_1 + 0.3 * aux_loss_2

Here 0.3 indicates the weights considered to calculate the weighted sum of auxiliary loss.

Auxillary loss is used for training only, not for the inference of the model.

Inception v2

Authors improved inception v1 to improve accuracy further and to reduce computational complexity even further.

By using dimension reduction methods extremely, you might be losing information. This is known as “representational bottleneck”

However, it can be solved if you use two (3×3) filters instead of one (5×5).

Structure of an Inception-Resnet-v2 layer. | Download Scientific Diagram
[Image Source: ResearchGate]

Also, to factorize convolutions of filter size nxn to a combination of 1xn and nx1 convolutions, makes the network’s computational cost lesser.

Also, filters are expanded in parallel. Due to this, excessive dimension reduction can be prevented, thus removing the representational bottleneck.

Inception v3

The authors had implemented “auxiliary classifiers” to remove the vanishing gradient problem, but as per their observation, it didn’t solve the problem efficiently. It worked as a regularization function like dropout or batch normalization.

Inception v3 included all of the above improvements used for Inception v2 along with some additional functionality as follows:

  1. RMSProp Optimizer.
  2. Factorized 7×7 convolutions.
  3. BatchNorm in the Auxillary Classifiers.
  4. Label Smoothing

You might think, Inception V1 started with different kernel sizes to cover for huge variations in the location of the information. But, in the later versions, the kernel sizes are factorized to smaller sizes like 3×3 or 1×1 are mostly used and it worked well. So, does this mean the assumption to have different kernel sizes in Inception V1 seems wrong?

Actually, factorization of large kernel size to smaller one just reduces computational complexity. If you will replace the 5×5 filter then you will need two 3×3 filters so that it will be able to perform the same as the 5×5 filter did.


Now you understand how does inception works and what are the benefits of each version, what are additional changes incorporated in the succeeded version, and why? In the next part, you will be able to implement an inception network for your classification problem.

Let us know if you have any doubts or any questions through the comment section.

Happy reading!!