3D projection is the process of rendering numerically-described 3d geometry as a 2d image. In simple wire-frame rendering, this means transforming the 3d vertices of our objects into 2d coordinates–taking into account our chosen vantage point in the world–and then simply drawing straight lines between those coordinates. So to render a wire-frame cube for example, we transform the 8 3D vertices of its corners into 2D vertices, and in our image draw straight lines between those vertices. The trick, of course, is in how to do this transformation, taking into account the optics of cameras and the human eye, which cause distant points to converge to the center of the image. In other words, we must take into account the phenomenon of perspective.

In this video, I’ll explain why cameras and our eyes perceive the world in perspective and explain how to compute a perspective projection. We’ll also briefly discuss the simpler alternative, orthogonal projection, which is the style of rendering seen in architectural blueprints, where distant points do NOT converge to the center of the image.

In a later video, we’ll revisit how the projection computation is more commonly described using matrices. For now, though, we’ll stick to just simple algebra, geometry, and trigonometry.

In traditional photography, an image is formed by light striking a piece of light-sensitive film such that the different intensities and frequencies (aka colors) of light affect the different parts of the film surface differently. In digital photography, it’s the same idea, except instead of a surface of light-sensitive chemicals, we have a surface with a grid of sensors which each report a digital measure of the light striking them. These sensors are called CCD light sensors: charge-couple devices. (Light sensors are actually just one kind of CCD, but in the context of digital cameras, CCD usually implies a CCD light sensor.)

Now, to get an image from the world, what we *cannot *do is simply hold a film or light-sensor array in front of the scene, like shown here. We need the light from the respective parts of the scene to hit the corresponding parts of the surface, e.g. the light coming from the top of the scene–and only that light–should hit the top of the surface, and only the top of the surface. As depicted by the white lines here, the ray of light from the point at the top of the Eiffel tower should hit a point towards the top middle of our surface. What happens in the real world, however, is that light is bouncing all around, such that generally light from all parts of the scene hits all parts of the surface. Here we see some of the unwanted rays of light in red: light from all parts of the scene are hitting the same point on the surface, adding up to far too much light and probably not the right frequency. This happens for all points on the surface, such that we get a blank, all-white image. This is why cameras and the human eye have lenses: to focus the light from different parts of the scene onto different parts of the film or the sensor array.

The very simplest kind of lens is a pinhole lens, which is exactly what it sounds like. Punch a small hole in a box and you have a pinhole camera. As the diagram shows, light from a point at the top of the tree passes through the hole and only strikes a point at the bottom of the box’s interior; light from a point at the bottom of the tree passes through the hole and only strikes a point at the top. Effectively then, the image on the interior back surface gets mirrored upside down. This happens on the horizontal axis as well, at least from a person’s perspective standing behind the box: light from points on the left side of the tree end up on the right side of the box and light from points of the right side of the tree end up on the left side. (This isn’t actually depicted correctly in the diagram. Look closely at the smaller tree and you’ll see it hasn’t been correctly flipped horizontally.)

As you might imagine, a pinhole lens doesn’t produce a great image. Making the pinhole small tends not to allow in enough light, producing a weak image, but making the pinhole larger produces a blurry image: as the pinhole becomes larger, light from each point in the scene passes through the lens in a larger cone; these cones of light strike back interior of the camera, producing overlapping blotches of color instead of focused points.

Most practical cameras use a lens made up of one or more elements of glass. The idea is that the pieces of glass are shaped in such a way such that all light rays from one point in the scene get refracted to the proper point on the film. Here, all light rays from the same point in the scene that reach the front side of the lens are refracted onto the same point on the film. Notice that, like in a pinhole camera, the scene is getting flipped vertically and horizontally.

The proper distance from the lens to the film is called the focal length of the lens. If we moved the film closer to the lens, or if we moved it further away, the rays of light from one point in the scene wouldn’t converge onto the same point of the surface, producing a blurry image.

For detailed reasons of optics we won’t get into, a lens cannot perfectly focus points at all distances in the scene onto the image at once, producing an effect called ‘depth of field’, in which objects in the foreground and/or background may be blurry. With certain lenses and lighting conditions, points in a large range of distances can be nearly focused. Probably the most famous example of such shots are in *Citizen Kane*, such as this one where the woman in the foreground is in good focus but the men far behind her remain mostly in focus too.

Lenses with different focal lengths produce different images. Here we have a longer focal length top and a shorter focal length bottom. Because light through the long lens refracts at a shallower angle, fewer parts of the scene get focused onto the film, capturing a narrower portion of the scene. With the shorter angle lens, we get more of the scene onto the film. This is why shorter lenses are also known as wide angle lenses. (Somewhat confusingly, though, it’s not common to refer to a long focal length lens as a ‘narrow lens’, even though they do capture narrower portions of the scene.)

Now, consider an observer standing in a hall with a floor and ceiling that run parallel. Looking down the hall, the light rays from the floor and ceiling that reach the observer converge towards the same vertical part of the observer’s vision–in fact, if the hallway were long enough, the position of the distant parts of the floor and ceiling would be only imperceptibly different. This is perspective: light from the scene converges on the observer at a single point such that points further away in the scene converge towards the center of the image.

Now, with a simple, round lens, light converges directly towards the middle of the image, producing a *curvilinear* effect in which straight lines from the world may end up curved. More elaborate “rectilinear” lenses “correct” for this distortion by converging distant points separately along the X and Y axis instead of directly to the center. This preserves straight lines but at the cost of artificially stretching the image at the edges. In the example here, notice how much wider the rectilinear image is and notice how the wall panels on the left seem to get larger towards the left edge of the image even though those panels are all the same size.

Understand that the distortions of both curvilinear and rectilinear lenses become more apparent for wider angle lenses because, with a wider angle lens, points in the distance effectively converge faster to the center: an object which is a given distance from the lens converges more towards the center the wider the lens.

So now, the question is, which is more realistic, curvilinear or rectilinear? Well on the one hand, curvilinear perspective better preserves the relative sizes of objects, and it arguably better reflects an idealized perspective, in which light from a scene converges to a single point and so distant points converge directly towards the center. On the other hand, rectilinear simply looks *right*: to most people, rectilinear better matches human perception. Answering *why* this is the case quickly gets bogged down in the murky philosophy and science of perception, so we’ll just take it as given.

Getting to 3D rendering now, what we generally want to simulate is a rectilinear projection of a virtual world. In some advanced cases, we might strive to simulate aspects of real-world cameras or the human eye, such as depth of field or some degree of curvilinear bend. We’re just going to keep things simple, though, by ignoring such issues. What we’ll produce are perfectly rectilinear images with an infinite depth of field. This in fact, is basically the default case used in much 3D rendering, games especially.

If what we’re simulating isn’t necessarily a camera or human eye, we can think of our task this way: we have a virtual world and a virtual window in that world, and we want the 2D image which a virtual observer sees looking through that window. (Be clear that our observer is really neither a camera nor a pair of human eyes but instead just a point in space.) Anyway, the first thing to note in this setup is that the observer’s distance to the window changes the field of view: getting close to the window widens the field of view; backing away narrows the field of view. To determine what point from the scene should appear on a point of the image, we extrapolate a line from the observer through that point on the window until it collides with something in the scene. So, here, what the observer sees at the red dot on the window corresponds to a dark point on the tree; the color at that point on the tree is the color we want to see at the point on the image. Extrapolating like this through every point on the window gets us our image.

The question now is how to do this extrapolation from 3D coordinates.

Consider this process in just two dimensions, here from a side view. The line extrapolated through the window passes through a certain point on the window but hits the green apple at a higher point. In effect, this point of the scene gets translated down to where it should appear on the window; notably, where the apple closer along the line of extrapolation, it would require a smaller translation. Also note that the angle of the extrapolation line affects the size of the translation: the smaller the angle, the smaller the translation; once the angle is zero–that is, where the observer sees straight through the window–the translation is also zero. The point in the scene directly straight ahead never gets translated, no matter how close or far the point is.

For points that do need translation, however, the formula to find the point on the window is quite simple, derived by noting that the line of extrapolation forms two overlapping right triangles with the line extrapolated straight through the window. The value of a1 here, the distance from observer to window center, is our focal length, and up to us to select when rendering. Assuming then that we have the values a2 and b2, we can find b1 by noting that the ratio of b1 to a1 equals the ratio of b2 to a2 because these are corresponding sides of two right triangles of the same angle. Solving for b1, we get b1 equals a1 times the quantity b2 divided by a2. Assuming the window center to be our origin, then b1 is the height coordinate on the image of our point in the scene.

We can apply the exact same logic to find the point’s *horizontal* position on the image. The only change is that, this time, b1 and b2 are coordinates of our horizontal axis, so finding b1 gets us our horizontal coordinate.

Another way of thinking about this process is that we are squeezing the observer’s field of view into a rectangle such that all the points in our field of view get squeezed along with it: points farther back from the window and farther from the center axis get squeezed proportionally more. And, again, because we’re going for a rectilinear projection, we squeeze the vertical and horizontal axes separately, one before the other instead of at the same time (which would produce a curvilinear projection). So, here, this is what we end up with. Looking at the before and after side-by-side, note that the two red dots had the same distance from the center axis before the squeeze, but the red dot farther from the window gets squeezed more towards the center axis. Imagine then, that those two points described the side of a wall running parallel with the observer’s direction of vision. Once we account for projection, the wall seems to converge in the distance towards the observer’s center of vision. This is just like what we observe with our eyes or a camera: looking down these train tracks, the parallel lines of the rails converge to a point in the distance.

In any case, once we’ve squeezed all our vertices, we have their coordinates as they should appear on our 2D image. Assuming again that the window is the center of our coordinate system and that the horizontal axis is X, the vertical axis is Y, and the depth axis is Z, then the X Y coordinates of the vertices denote their position on the image. That is, assuming 0, 0 denotes the center of the image, which is actually often not the case. Recall from earlier units that 2D pixel coordinates are commonly described in terms of 0,0 at the top left corner with the Y axis pointing down. Also important, 0, 0 in a 2D image usually denotes the corner of the top-left pixel rather than its center. Moreover, we don’t necessarily want one world coordinate unit to have the same dimensions as a single pixel in our image. So to account for all of this, we must translate from our 3D XY coordinate system centered on the window to a 2D XY coordinate system which is possibly centered elsewhere and possibly of a different scale. It’s a simple translation we’ll come back to shortly.

Now, ideally, we’re modeling a field of view that’s shaped like a pyramid with its tip at the focal point, our so-called observer. Any object inside or overlapping this pyramid (here shown as a blue triangle from the side), should show up in our image.

So far, though, we’ve thought of our projection as capturing a virtual observer’s view through a virtual window, implicitly disregarding anything between the observer and the window. To render objects in front of the window, however, we can actually use the very same formula. Just like with objects behind the window, we can draw a line from the observer through an object on the near side, and where that line intersects the window represents where it should appear in the image. The only difference is that the object vertices end up expanding *away* from the center axis instead of contracting towards it.

Problems arise, however, when rendering objects very close to the focal point, for reasons having to do with floating-point rounding error and aspects of rendering which we’ll get into later. Briefly, imagine what happens in our formula as a2, the distance to the vertex, approaches zero: the smaller and smaller fraction may begin to exceed the limits of our floating-point precision, producing ugly errors, especially when we start drawing filled-in polygons. Even worse, a coordinate lying on the focal point would have an a2 value of zero, which in our formula would trigger a divide by zero and thus break the code. To avoid these issue, usual practice is to simply clip the drawn geometry with a Near Clipping Plane such that only geometry behind the plane gets rendered. Here, for example, this apple lies in the field of view but in front of the Near Clipping Plane, so we ignore it in our rendering.

For different reasons, it’s usual to also specify a Far Clipping Plane to cap rendering of objects *past* a certain distance. By rendering only objects within a certain distance, the rendering job can often be greatly simplified and hence made faster, which is of course especially important in games. Recent games often have the clipping distance set far enough away that it’s not noticeable in most scenes, but earlier 3D games often had to set it distractingly close, using distance fog to hide the appearance of distant objects popping in and out of the world. Again, modern games still use these techniques, but usually set the clipping and fog distance far enough away to be less noticeable.

By the way, the lopped-off pyramid formed by our truncated field of view is known as the *frustum*. (Make sure to get that correct: there’s no R after the T.)

Now, as a matter of convenience and simplification, it’s common to assume that the viewing plane corresponds with the Near Clipping Plane, that is, assume that it corresponds with the small end of our lopped-off pyramid. Does this mean we lose control of our focal length? Well yes, but no. Be clear that, within a field of view, changing the distance to the view plane doesn’t change the resulting image as long as the view plane changes size to fit the same field of view. So by fixing the view plane to coincide with the Near Clipping Plane instead of specifying them independently, we do lose *direct* control over the focal length, but we can set it *in*directly by setting the field of view. Here, for example, the frustum on the right has a larger field of view and hence a shorter focal length.

Now, if the image plane and clipping plane are tied together, what if objects you want to see is getting clipped? How can we fix this without changing the apparent camera position and field of view? Well, we have two solutions. First, simply scale up the coordinates in your world such that everything is bigger and further apart. This would be as if the world outside your window grew but also moved away from the window such that everything through the window looks the same. Crucially, though, we scale the world relative to the focal point, not the origin at the center of the image plane. This way, some vertices may get moved onto the back side of the near-clipping-plane. Here, for example, one of the two points lies in front of the near clipping plane and so won’t get rendered. If we double the scale of the world, relative to the focal point, now both points get drawn, but we otherwise haven’t changed the image.

Alternatively, we can proportionally scale down the image plane and focal length, effectively moving the image plane closer to the focal point without changing the apparent image. This is certainly simpler, but as mentioned earlier, we should be cautious of letting the distance from the focal point to the Near Clipping Plane get too close to zero.

Anyway, let’s complete the process of rendering a wire-frame image. So again, we start by defining our camera to be facing up the z axis with our view plane centered on the origin. Having specified a focal length–i.e. the distance of the observer to the origin–we then perspective adjust each vertex in our scene.

For example, given a focal length of 20, the vertex at 5, 10, 30 gets adjusted with a1 equal to 20 and a2 equal to 20 plus 30. Computing for x, we plug in the x coord 5 as b2, giving us a new x coord 2. Computing for y, we plug in the y coord 10 as b2, giving us a new y coord 4. So this vertex gets perspective adjusted to coordinate 2, 4 on our view plane. Once we start filling in our polygons, we’ll have further use for the Z values, but in wire-frame rendering, we can ignore the Z values once we have our perspective adjusted vertices.

The last coordinate adjustment is to account for a possible difference between the image plane dimensions and the destination image dimensions. In other words, we must translate between view plane coordinates and pixel coordinates because 1 coordinate unit does not necessarily equal the width or height of 1 pixel. Here for example, if our view plane is 100 units wide and 70 units tall but our destination image is 200 pixels wide and 210 pixels tall–and assuming that our pixel coordinates system is centered at the center of the destination image–then a coordinate 20, 15 on the view plane gets translated to 40, 45 in pixel coordinates.

The formula is quite obvious: we observer that the ratio of the pixel x coordinate to the view plane x coordinate equals the ratio of the pixel grid width to the view plane width; likewise, the ratio of the pixel y coordinate to the view plane y coordinate equals the ratio of the pixel grid height to the view plane height. So solving for x of the pixel grid gets us x of the pixel grid equals x of the view plane times the pixel grid width divided by the view plane width. And solving for y of the pixel grid gets us y of the pixel grid equals y of the view plane times the pixel grid height divided by the view plane height.

The last complication is that the pixel grid origin is usually not at the center of the image but rather at the top left or sometimes the *bottom* left. When at the bottom left, we add half the pixel grid width to x and half the pixel grid height to y, which in this example would mean adding 100 and 105, yielding a coordinate 140, 150. When the origin is at the *top* left, we do the same thing but flip our y coordinate by subtracting everything from the pixel grid height, which in this example would mean subtracting 150 from 210, yielding a y coordinate 60.

Finally, once we have our pixel coordinates of the vertices, we get our wireframe rendering by simply drawing lines between the connected vertices. So here on the left, for example, the cube is made up of 8 vertices, and we simply draw lines between the vertices which define our edges. How exactly we keep track of which vertices connect to which is just a detail of how exactly we define our polygons in our data.

The only other thing to note here is that, were our projection meant to be curvilinear rather than rectilinear, we couldn’t simply draw straight lines between the vertices because we’d have to account for how straight lines should bend around the image center. A rectilinear projection spares us that problem.

Because this video already runs long, I’ll not go over actual code implementing this wire-frame rendering, but you’ll find running code examples on the site that build upon our previous 2D drawing code. There are only a couple hundred lines to it, actually, so it shouldn’t take much effort to read and understand.

Anyway, by now, we should understand how to render a 2D image from 3D geometry for the special case where our virtual camera view plane is at the origin of the world and the camera is looking up the Z axis. But what if we want to render the world from another angle and position? Well it turns out to be easiest not to move the camera in our virtual world but rather to move the world around the virtual camera. Whatever camera moves we want to make away from the origin, we get the same effect by actually moving everything in the world in the inverse direction. So for example, if we want to dolly this camera forward closer to the apple, we could instead just move the apple the same distance in the opposite direction. Likewise, instead of moving the camera up, we could just move the apple down the same distance. The same idea applies to rotations. If we want to pitch the camera down, pivoting around the origin, we could instead rotate every object in the world the same angle the opposite way, pivoting around the origin.

So now the question is, how do we move objects and rotate them? It’s an important question not just for moving our camera but also moving objects in our world relative to each other, and it’s something we’ll look at in detail when we talk about transformations.

Lastly here, recall that, for a given view plane size, longer focal lengths effectively narrow the field of view. Imagine, then, what happens as our focal length grows into infinity: the bounds of the field of view run parallel to the center axis of vision. What this mean is that when it comes time to squeeze our coordinates, none of them move at all: they’re already in their proper position for an infinite focal length. This special case is called an orthogonal projection, and it has the effect that lines running into the distance parallel with our vision never converge. Orthogonal projection is the sort of projection used in architectural blueprints because, while it makes object depth often difficult to interpret, it usefully cuts down on the number of visible lines in the image and preserves the relative distances between every point in the 2D plane.

You might assume orthogonal projection has no place in games, but most 2D games today are modeled as layers of flat images stacked in 3D space but rendered with an orthogonal projection. Here, for example, this scene is likely made up of a few layers of flat images: a couple layers of background depicting scenery at a few distances, then the foreground floors and walls that actually collide with the player, then the player avatar and the enemies, along with the gun projectile particle sprites and effects, and then a layer on top of the vines, plants, and other decorative objects. This layering not only makes it more straightforward to construct layered scenes, it allows 2D games to utilize the pixel-pushing power of the GPU.

The HUD elements are also likely drawn as layers of an orthogonal projection, but because they’re always rendered on top in fixed positions on the screen, they’re likely rendered in a separate coordinate system. This is in fact also how ‘proper’ 3D games render HUD elements: the 3D world is drawn with a perspective projection, then the HUD is drawn on top with a separate orthogonal projection.