20100101

Xbox 360: Consideration of an Optimal API

This blog is for Rendering Engineers. Specifically: engineers who work on performance critical applications, and who benefit from engine-optimization (at the ms per frame level).

Over the next series of entries, we are going to look at how the GPU operates, how the CPU “drives” the GPU, and plenty of work on distinguishing the abstractions that we use – from the actual hardware implementation. At the abstraction level, it doesn’t matter what platform we are using – the entire purpose of the layer. On the other hand, to have the implementation conversation, we need a real CPU, a real GPU, and a real API – similar to that frog in science class. Without a living, breathing piece of hardware, we can’t get to the meaty stuff … this is the most important part! I have the benefit of knowing how this conversation “turns out”, and – of the available choices – the Xbox 360 is the best option.

Don’t read anything into this. I am not endorsing the platform– and I am not bashing it either. The Xenon is simply the best vehicle for delivering my communication. I can deliver my distinctions – with the least amount of work. This is, of course, the definition of optimization.

A significant benefit with the 360 is the recent release of the “Open GPU” documentation by AMD. The Open GPU release includes Programming Guides for the chipsets “surrounding” the Xenos: the r4xx, r5xx, and r6xx. Using a combination of public information on the Xenos, and information in the Open GPU documentation – I can discuss architectural details; without endangering NDA agreements that I have with AMD.

When non-public details on the Xenos do come up, I will switch to r6xx information. . This won’t impact the conversation. Example details are: exact names of packet opcodes (they tend to change slightly between chipset releases) – and exact register offsets on the Xenos GPU. In these cases, a quick look at d3d9gpu.h (in the 369 XDK) will reveal the true opcode name, or register offset.

Additionally, there is information that is not public AND not documented in the XDK release. These are symbols that I extracted from debug version of the Xenon libs. This includes functions for manually creating packets and adding them directly to the active ring (fifo). For Ps3 developers – this is similar to functionality provided by libgcm. Unfortunately, I can’t reveal the linker names in this blog. I will use fictions names, but keep the signature – and describe the actual functionality. Extracting the actual linker symbols will be an exercise for the reader. But – I will be giving you all of the hard stuff: how to build packets, and how to use the fifo submission functions. All you need to do is replace the names!

Whew- I think that completes the disclaimers.

I want to end this post with a “hook” – to get you interested in this conversation, and a sense of the payoff that is available here… We will be discussing the CPU cost of “driving” the GPU during dispatch. We will specifically be looking at the “small batch problem”, and expose it as a myth … an artifact of the fundamental model used for dispatching / configuring the GPU. This will be followed with a proposed model, that transforms this “problem” into an insignificant cost. The bottom line: you may not need to sacrifice an entire CPU core to dispatch a frame.

Last: here’s some food for thought… Look at one of your pix2 GPU captures. Check the bottom line (where the totals are). What is the total amount of data that you pushed into the fifo/ring (command buffer data)? 3mb? More?

  • Let’s look at what information goes into the command stream:
  • All of the states that control rendering (other than shaders and constants) are 496 bytes.  That includes all of the render-states … including the ones that never change.
  • Texture fetch constants are 768 bytes each (vertex fetch constants are smaller)
  • The entire block of ALU shader constants (vertex and pixel) is 8k

So: 496 bytes. 768 bytes. 8k for ALU.

How do you end up sending that much data? PIX doesn’t show you redundant data-setting. (HINT: You are doing a lot of pathological data-commits to the GPU)

On each draw event (dispatch) the CPU is responsible for packing this pathological data into bit-packed structures (that match the hardware register layout) and copying the data into the active ring. The preparing of this GPU “configuration” data is where all the CPU time goes. So, rigorously speaking: there isn’t actually a “small batch problem” – there is a “batch overhead problem” … and this is exposed if you have a large number of batches in a frame.

Over the next several weeks, we will be discussing all of the details; and considering what an “optimal” API might look like.

Consider this: forgetting about the current API … and looking at the hardware in the console … what would your API design look like?

Consider it.

Until next time -----