The last 90%

At this point, with the default configuration of 34 simulation steps per output image, our computation is fully I/O bound and has all of the obvious I/O optimizations applied. Therefore, for this default configuration, the next step would be to spend more time studying the HDF5 configuration in order to figure out if there is any other I/O tuning knob that we can leverage to speed up the I/O part further. Which goes beyond the scope of this GPU-centric chapter.

However, if you reduce the number of saved images by using CLI options like -e340 -n100, -e3400 -n10, or even -e34000 -n1, you will find that the simulation starts speeding up, then saturates at some maximal speed. This is the point where the computation stops being I/O-bound and becomes compute-bound again.

What options would we have to speed the GPU code up in this configuration? Here are some possible tracks that you may want to explore:

  • We have a JiT compiler around, it’s a shame that right now we are only treating it as a startup latency liability and not leveraging its benefits. Use specialization constants to ensure that the GPU code is specialized for the simulation parameters that you are using.
  • We are doing so little with samplers that it is dubious whether using them is worthwhile, and we should test this. Stop using samplers for input images, and instead try both if-testing in the shader and a strip of zeros on the edge of the GPU image.
  • While we are at it, are we actually sure that 2D images are faster than manual 2D indexing of a 1D buffer, as they intuitively should be? It would be a good idea to do another version that uses buffers, and manually caches data using local memory. This will likely change the optimal work-group size, so we will want to make this parameter easily tunable via specialization constants, then tune it.
  • Modern GPUs also have explicit SIMD instructions, which in Vulkan are accessible via the optional subgroup extension. It should be possible to use them to exchange neighboring data between threads faster than local memory can allow. But in doing so, we will need to handle the fact that subgroups are not portable across all hardware (the Vulkan extension may or may not be present), which will likely require some code duplication.
  • The advanced SIMD chapter’s data layout was designed for maximal efficiency on SIMD hardware, and modern GPUs are basically a thin shell around a bunch of SIMD ALUs. Would this layout also help GPU performance? There is only one way to find out.
  • Our microbenchmarks tell us that our GPU is not quite operating at peak throughput when processing a single 1920x1080 image. It would be nice to try processing multiple images in a single compute dispatch, but this will require implementing a global synchronization protocol during the execution of a single kernel, which will require very fancy/tricky lock-free programming in global memory.
  • So far, we have not attempted to overlap GPU computing with CPU-GPU data transfers, as those data transfers were relatively inexpensive with respect to both compute and storage I/O costs. But if we optimize compute enough, this may change. We would then want to allocate a dedicated data transfer queue, and carefully tune our resource allocations so that those that need to be accessible from both queues are actually marked as such. And then we will need to set up a third image as a GPU-side staging buffer (so the double buffer can still be used for compute when a data transfer is ongoing) and refine our CPU/GPU synchronization logic to get the compute/transfer overlap to work.

Also, there are the other GPU APIs that are available from Rust. How much performance do we lose when we improve portability by using wgpu instead of Vulkan? How far along is rust-gpu these days, and is krnl any close to the portable CUDA clone that it aims to be? These are all interesting questions, that we should probably explore once we have a good Vulkan version as a reference point that tells us what a mature GPU API is capable of.

As you can probably guess at this point, GPU computing is not immune to old software project management wisdom: once you are done with the first 90% of a project, you are ready to take on the remaining 90%.