š Rendering a Screen-space Skybox
NOTE: This wonāt be an article that goes very deep into details, itās more a way to document how I implemented the skybox in my renderer. I use Metal, but it should translate to other APIs almost 1-to-1.
When I decided to implement a skybox on my renderer my first impulse was to follow the olā reliable LearnOpenGL article on cubemaps, which essentially consists on rendering a giant cube around the camera.
Itās a simple and very intuitive approach but I wanted something a wee bit fancier.
Also, this technique requires uploading a lot of data (all the cubeās vertices, a new view matrix so it āstays in placeā as you move the camera, etc) which wouldnāt be necessary on a screen-space approach.
šŗ Screen-space implementation
For those unfamiliar with the term, screen-space basically means that we render onto a plane (normally a quad or a giant triangle) that covers the whole screen.
If you want to be super efficient you can use a giant triangle but I opted for a quad because it was simpler to set up (I can hardcode the output vertex positions as the corners of the normalised device coordinates space at the near plane).
The fragment shader will sample the cube texture with the view direction in world space.
To get it, I simply multiply the vertex position by the inverse view and projection matrices and let the GPU interpolate it.
Normally, in vertex shader we go from object to world, to camera, and finally to clip/device space, but in this case we already have the device coordinates and we need the worldās.
// So instead of doing:
device_space_pos = projection * view * model * objec_space_pos;
// We need the inverse transformation, which implies multiplying by the inverse
// matrices, in reverse order:
world_space_pos = inverse(view) * inverse(projection) * device_space_pos;
š Performance tip: Inverting a matrix is an expensive operation, but we can take a shortcut.
Orthonormal matrices (matrices that donāt change the relative position between points) have the convenient property that their inverse is the same as its transpose, which is way easier to compute.
Thankfully, the view matrix is orthonormal, so we can save some work there. However, the same doesnāt apply to the projection matrix and it has to be inverted āproperlyāā¹ļø You could precompute these inverse matrices on the CPU instead of re-calculating them for every vertex, but since thereāre only 4 vertices (potentially 3), the cost of uploading and binding the extra matrices would probably be higher.
If we render the skybox the first thing in the pass, thatād be it. The skybox will āclearā the render target and everything else would go on top (just remember to disable depth write).
However I wanted to go further and optimise it a bit more.
š„± Avoiding work
I got the idea from this reply to one of my tweets:
It makes a lot of sense!
My test scene only has 3 objects at the moment: the floor, a Stanford Bunny and a Utah Teapot, but a ārealā environment will have a lot of things on the screen, and the skybox will take only a small percentage of the render target. Why would we want to calculate the skybox for every single fragment if most of them will get overwritten anyway? And keep in mind that we are doing 2 expensive operations here: inverting a matrix (vertex stage) and sampling a cubemap (fragment stage).
A stencil is the perfect fit.
In this case we donāt need anything fancy, we just need to know if that fragment is āfreeā. So what I did was:
- Move the skybox draw call to the end of the scene pass.
- For every scene object, increment the stencil when a fragment is drawn.
- When rendering the skybox, only draw the fragments that still have a stencil of 0.
This is how my stencil looks at the end of the pass:
The skybox fragment shader will only run for the pixels in black, saving a lot of computing.
To achieve this in the code I had to:
- Change the Z buffer format to include stencil
- Create 2 separated stencil states. One for the scene rendering, and another for the skybox.
mainStencilDesc.depthStencilPassOperation = .incrementClamp
mainStencilDesc.stencilCompareFunction = .always
// Only render the skybox for the fragments that have the same stencil value as the reference
skyboxStencilDesc.stencilCompareFunction = .equal
2.1 BONUS: Disable depth writing and testing for the skybox. Itās not 100% necessary because the quad will be so close to the camera that itāll always pass, but it has a cost that we can easily avoid.
skyboxDSDesc.frontFaceStencil = skyboxStencilDesc
skyboxDSDesc.isDepthWriteEnabled = false
skyboxDSDesc.depthCompareFunction = .always
3. Swap the states and set the reference value
...
// All other scene draw call encoding
...
cmdEncoder.setDepthStencilState(skyboxDepthStencilState)
cmdEncoder.setStencilReferenceValue(0);
// Encode the skybox draw call
...
Et voilĆ !
š§āš« Conclusions
š The good
- Pretty simple to implement.
- Ridiculously small bandwidth cost: This approach only requires us to upload the cubemap and 4 vertices (3 if we used a triangle) that donāt really need to carry any information. We could even just upload a point and generate the quad in the geometry stage. The cubemap can (and will) be used for other techniques so itās not really an extra cost and it can stay bound, it uses the same view and projection matrices so we donāt need to bind any, and it uses the same attachments as the rest of the scene so it can stay in the same pass.
- The most expensive part (cubemap sampling) is only done where itās stricly necessary.
š The bad
- Not as āintuitiveā as the giant cube approach?
š¹ The ugly
- I spent more time than what Iām proud to admit debugging a ālense distortionā and doubting my math and graphics knowledge. Turns out it was caused by me āinvertingā the projection matrix by transposing it, despite it not being orthonormal.
- Managing the depth-stencil state can be annoying in some graphic APIs?