Convolution is an operation which takes two functions as input, and produces a single function output (much like addition or multiplication of functions). The method of combining these functions is defined as
$$(f\star g)(x)=\int_\Omega f(y)g(x-y)dy$$
Where $x$, $y$ both range over all of $\Omega$. Here I will try and present convolution as a very convenient way of solving certain problems, and then abstracting from that; as a mathematical operation important in its own right.
Stones in Water
We will start our investigation by looking at some waves. Imagine you're standing on the edge of a pond, and you throw a small stone in. The stone will cause a ripple to travel outwards across the surface. Bigger stones make bigger ripples, and smaller stones make smaller ripples, like so:
All three of the ripple patterns above share the same general shape, but differ only in their magnitude. How do we go about expressing this observation mathematically? First, let's abstract a bit: if we look at the impacting stone as our input, and the resulting ripple as our output, we then have a process to model: $stone\to ripple$. We can express the dependence of the ripple's size on the weight of the stone by saying that the output scales linearly with the input:
$c\times stone\to c\times ripple$.
This property lets us see that there is some "special" stone we could study: that of unit weight. If we know the ripple caused by this stone, we can then find the ripple caused by any stone (lighter or heavier) by just scaling by the proper amount. If we think of this stone as appropriately "small" (technically, a point), then mathematically we can model it by an impulse. The wave caused by this unit weight pebble is called the impulse response, a relationship we can throw into symbols as follows: $\delta\to h$ where $\delta$ is the impulse and $h$ symbolizes the ripple. So far, by just knowing $h$ we are able to find the output which results from an input of any magnitude. What else can we do with it? Well; a lot it turns out. For one, so far we have been dropping our stones at the (mathematical) origin, but we intuitively expect the shape of our ripple to look the same no matter where we drop the rock. We can express this by saying that if we translate the impulse, we translate the response: $\delta(r-r_0)\to h(r-r_0)$.
I'm going to replace the arrow with a function, $w$, so that instead of saying $\delta\to h$ we will say $w(\delta)=h$. Let's say now that we have a handful of small pebbles of all different weights. We throw them into the water and they spread out, impacting at all different locations. How are we to model the complex ripple that will result? Well, upon impact, the surface just experiences a sum of stones $impact=\sum stones$, and each stone can be viewed as a scaled and translated version of our unit stone ($\delta$). This lets us say it a bit more precisely:
We are now in a position to use our previous knowledge of the linearity of our "wave operator" to write this more explicitly: we know the wave of a sum is the sum of waves, and the wave of a scalar multiple is the scaled wave of the impulse, so we can go ahead and express this symbolically
$$wave=w\left(\sum_i c_i \delta(r-r_i)\right)=\sum_i c_i w(\delta(r-r_i))=\sum_i c_i h(r-r_i)$$
Assuming we know the form of this function h then (the impulse response), we know the full form of our solution for the handful of pebbles:
This idea is at the heart of convolution as a solution technique - we found an initial configuration we understood well (a single impulse) and then when asked to solve a harder problem we attempted to write our new, more complex input as a sum of scaled and translated copies of this impulse we understood. Then we know the solution itself is just a sum of scaled copies of the response to that impulse!
However, the operation of convolution itself is an integral, and we sure don't have anything like that showing up in our formulas yet. This motivating example is kind of like a "discrete convolution" - to arrive at the formula that begins this post we need to look at a different, more continuous example.
Instead of water waves, lets say we are going to take a picture. The setup is something like this: an object to be photographed, the camera lens system, and a piece of film.
The goal of our camera is to take light from the original object and deposit it on the lens; so we can think of our camera as a function from object to film. If we call the object to be photographed the Source, the function to the image the Camera, and the film the Photograph, we can reduce this to a simpler schematic diagram.
If we shine a bright light at the camera, we expect a bright image; if we shine a dim light, we expect a dim image. If we shine two lights, we expect two images. So, we can say that our camera function is linear, much like the waves. We can symbolize this relationship
If our camera is in focus, we expect to see on the image exactly what what was around in the real world. That is, we expect a 1-1 correspondance between the source plane and the photograph.
What if our camera is out of focus? We know from experience that this means what appears to be a single point in the source (think a star) is a smudged blob on our photo. Thus, we don't have a 1-1 correspondence anymore, single source points spread their influence to multiple image points:
To make things easier to keep track of, we will let $h(x,y)$ stand for the blurry image of a star which was originally centered on the source screen (ie at $(0,0)$), after our crappy camera imaged it. Since this star is rather point-like, we will treat it as an impulse and say
or $C[\delta(x,y)]=h(x,y)$. In exact analogy with our wave example, lets say now we take our camera and image the night sky. The source plane (the cosmos) contains a multitude of stars, each of a different brightness. What will our image look like? Well, just knowing how a unit impulse at the origin is imaged, we can figure it out! Viewing each star as a stretched and translated unit impulse; our image will be a collection of stretched and translated blurs. We can decompose the night sky into a sum of these modified impulses, and then distribute our imaging function over the sum, blurring each star individually and adding the results.
$$Source=\sum Stars =\sum_i m_i \delta(x-x_i,y-y_i)$$
Where $m_i$ is the brightness of each star, and $(x_i,y_i)$ is its spatial location.
If we want to determine what our photograph will look like, we can apply our function which represents the camera response to the source:
$$Image=C[Source]=C\left[ \sum_i m_i \delta(x-x_i,y-y_i)\right]=\sum_i m_i C\left[\delta(x-x_i,y-y_i)\right]=\sum_i m_i h(x-x_i,y-y_i)$$
So far we have come up with two seemingly different problems; both which have a simple solution if we first understand how an "impulse" is propagated through them. Now lets take a picture of a portion of the sky with even more stars in it: for example, this small globular cluster which orbits our home galaxy:
We can apply the exact same reasoning as above here: each star is like an impulse, so to see how each star will look on our final picture, we simply take that impulse, propagate it through the camera until it becomes our response function h, and then add all of them back together. The resulting blurry photo looks like this:
It's time to make sure the mathematical formalism of this process really makes sense: at the location of each impulse (star), we apply our "blur" function which models our camera's lack of focus. This blur function spreads the light out from this original point source to a more extended region. If our star is located on the source plane at point $(x_i, y_i)$, we need to make sure that our final image of the star is centered on that location as well. We defined our function $h(x,y)$ to be the blur caused by a unit impulse at the origin, so we need to slide this impulse to occur at our new point $(x_i, y_i)$ instead. This is what the term $h(x-x_i,y-y_i)$ says, it's just a translated version of $h(x,y)$ centered at $(x_i,y_i)$ instead of $(0,0)$. Now if the stars original brightness was not simply $1$ but some other amount, we must scale the brightness of our blur accordingly. Thus the image of a star of brightness $m_i$ located at $(x_i, y_i)$ will be $m_i h(x-x_i,y-y_i)$.
So far, this has just been a more careful repetition of what we have done above. What we are going to do now seems to be a simple change of symbols, but will actually give us some profound insight into convolution itself. The number $m_i$ represents the brightness of our $i$th star, which is located at the point $(x_i, y_i)$ on the source plane. What this is really saying then is that the brightness of the source plane at location $(x_i, y_i)$ is given by $m_i$. That is, we can define a function on the source plane which gives us the brightness at each point. Lets call this function $m(u,v)$. Given a point $(u,v)$ of our source plane (the night sky), $m$ tells us what the brightness is there. Of course, in our star examples so far most points of the plane have a brightness of zero (black sky). But as we saw in the globular cluster, if we try to image a part of the sky with lots and lots of stars, some of them will seem to touch, and form extended areas of brightness.
In fact, in the limit lets say we find a point of the night sky where there are stars everywhere. There are no dark points, every line of sight ends in a star. Our brightness function $m$ would be nonzero everywhere, and every single point of the sky would have an impulse (star) located at it. Can we use what we have learned so far to figure out what our photograph of this area would look like? Sure we can! The only difference now is that we have a continuum of stars, instead of discrete points. But, at each location $(u,v)$ in the night sky, we have an impulse with brightness $m(u,v)$. We can represent this star by this brightness multiplied by an impulse function shifted to be located at $(u,v)$:
This single point source doesn't propagate through our camera unchanged, however; it becomes blurred. This blur is still centered around the original location of the star however, and scaled for brightness by $m(u,v)$ so we can say the image of this particular star is $m(u,v)h(x-u,y-v)$.
Just like in our discrete cases above, we now just need to sum up the images of each star to get the final image. This sum needs to be taken over each $(u,v)$ which has nonzero brightness, which in the present case is all of them! We will need to perform a continuous sum then, over each position $(u,v)$.
This last equation just says the image of our continuous field of stars is the blurred image of each star individually, all added back together. However, this also happens to be a realization of the formula for convolution presented at the top of this post! The resulting image of our star field is just the convolution of our brightness field with the blurred image of a unit star at the origin!
How about the image of a lizard? Well, we were able to treat a continuous field of stars successfully by modeling it via a continuous "brightness" field. What if we just view our lizard as a continuous field of brightness/color? With the stars it was intuitive to say that each point was in reality an "impulse," which was the crux of our argument. Does the same thing carry over here? Is the image of a lizard really nothing more than a weighted collection of impulses? Let's think for a second about a television set displaying our lizard on it's screen. If you get really close to this image, you'll notice that its actually made out of teeny tiny pixels, its nothing more than a collection of tiny impulses! If we shrink these pixels to zero size (as an ideal impulse is), then it seems we can really say that a lizard's image is nothing more than a collection of infinitly small pixels, or impulses!
Since we already know how our camera responds to a unit impulse, all we have to do to model the photo of our lizard with the blurry camera is to take this response, scale it by the brightness at each location, and add them all up.
(Its probably time for a new camera!)
The blurry image is then just the convolution of our original image of the lizard (the brightness field), and the response of the camera to a single impulse.