What is translation invariance in computer vision and convolutional neural network?

2 min readJun 6, 2020

Invariance means that you can recognize an object as an object, even when its appearance variesin some way. This is generally a good thing, because it allows to abstract an object’s identity or category from the specifics of the visual input, like relative positions of the viewer/camera and the object.

The image below contains many views of the same statue. You (and well-trained neural networks) can recognize that the same object appears in every picture, even though the actual pixel values are quite different.

Note that translation here has a specific meaning in vision, borrowed from geometry. It does not refer to any type of conversion, unlike say, a translation from French to English or between file formats. Instead, it means that each point/pixel in the image has been moved the same amount in the same direction. Alternately, you can think of the origin as having been shifted an equal amount in the opposite direction. For example, we can generate the 2nd and 3rd images in the first row from the first by moving each pixel 50 or 100 pixels to the right.

One can show that the convolution operator commutes with respect to translation. If you convolve ffwith gg, it doesn’t matter if you translate the convolved output f∗gf∗g, or if you translate ff or gg first, then convolve them. Wikipedia has a bit more.

One approach to translation-invariant object recognition is to take a “template” of the object and convolve it with every possible location of the object in the image. If you get a large response at a location, it suggests that an object resembling the template is located at that location. This approach is often called template-matching.

What is translation invariance in computer vision and convolutional neural network?

Written by Ke Gui

No responses yet