Execution

At this point, we have a Vulkan context, we have a compute pipeline, and we have descriptor sets that we can use to configure the pipeline’s inputs and outputs. Now all we need to do is put them together, and we will finally have a complete GPU simulation that we can test and optimize. Getting to this major milestone will be the goal of this chapter.

Work-groups

Back when we introduced this simulation’s GLSL shader, we mentioned in passing that work-groups are the Vulkan (more generally, Khronos) equivalent of NVidia CUDA’s thread blocks. As part of executing our compute pipeline, we will meet them again, so let’s discuss them a little more.

How work-groups came to be

Many decades ago, when GPU APIs started to expose programmable shaders to developers, they made one very important API design choice: to abstract away the many forms of hardware concurrency and parallelism (SIMD, VLIW/superscalar, hardware threading, multicore…) behind a single unifying “data-parallel for loop” interface.

In this grand unified design, the developer would write their program in terms of how a single visual component¹ of the 3D scene is processed, then specify on the CPU side what is the set of visual components that the scene is made of, and the GPU driver and hardware would then be in charge of ensuring that all specified visual components are eventually processed by the developer-specified, possibly in parallel across multiple execution units.

To give the implementation maximal freedom, GPU APIs only exposed a minimal ability for these shader executions to communicate with each other. Basically, the only API-sanctioned way was to run multiple GPU jobs in a sequence, using outputs from job N to adjust the configuration of job N+1 from the CPU side. Anything else was a non-portable hack that required very good knowledge of the underlying hardware and how the GPU driver would map the GPU API to this hardware.

This model has served the graphics community very well, enabling GPU programs to achieve good compatibility across many different kinds of hardware, and to scale to very high degrees of hardware parallelism due to the small amount of communication that the API enforces. But as the numerical computing community started abusing GPUs for scientific computations, the lack of a fast communication primitive quickly emerged as a problem that warranted a better solution.

Work-groups today

Thus came work-groups, which introduced some hierarchy to the execution domain:

The 1/2/3D range of indices over which the compute work is distributed is now sliced into contiguous blocks of uniform size, with blocks of size [1, 1, 1] roughly matching the API semantics of the former configuration.²
Each work-item within a work-group is guaranteed to be executed concurrently on the same GPU compute unit, enabling fast communication through hardware channels like the compute unit’s L1 cache (aka shared/local memory) and group-local execution and memory barriers. In contrast, work-items from different work-groups may still only communicate with each other through very clever hacks of dubious portability, as before.

Since that time, further performance concerns have led to the exposure of even more hardware concepts in the GPU APIs³, further complexifying the execution model for those programs that want to use optimally efficient communication patterns. But work-groups remain to this day the only part of the GPU execution model that one needs to care about order to be able to use a GPU for computation at all. Even if your computation is purely data-parallel, like our first Gray-Scott reaction simulation, you will still need to set a work-group size in order to be able to run it.

Picking a work-group size

So what work-group size should you pick when you don’t care because your code is not communicating? Well, the answer depends on several factors:

Work-groups should be at least as large as the GPU’s SIMD vector width, otherwise execution performance will drop by a corresponding factor. More generally, their size should be a multiple of the GPU’s SIMD vector width.
Beyond that, if you do not need to communicate or otherwise leverage shared resources, smaller work groups are often better. They allow the GPU to distribute the work across more compute units, improve load balancing, and reduce pressure on shared compute unit resources like registers and the L1 cache.
If you do use the shared resources, then you can try to increase the work-group size until either the trade-off becomes unfavorable and speed goes down or you reach the hardware limit. But that is generally a hardware-specific empirical tuning process.

Knowing that all GPU hardware in common use has a SIMD vector width that is a power of two smaller than 64, a work-group of 64 work-items sounds like a reasonable start. We may then fine-tune later on by increasing work-group size in increments of 64, to see how it affects performance.

Moving on, in 2D and 3D problems, there is also a question of work-group shape. Should work-groups be square/cubic? Rectangular? Horizontal or vertical lines? Here again, there is a trade-off:

Hardware loves long streams of contiguous memory accesses. So whenever possible, work-group shape should follow the direction of contiguous data in memory and be as elongated as possible in this direction.
But in stencil problems like ours, spatial locality also matters: it’s good if the data that is loaded by one work-item can be reused by another work-item within the same work-group, as the two work-items will then be able to leverage the fact that they share an L1 cache for improved memory access performance.

Here we are using images, whose memory layout is unspecified but optimized for 2D locality. And we obviously care about spatial locality for the purpose of loading our input data. Therefore, a square work-group sounds like a good starting point. And all in all, that is why we have initially set the GLSL work-group size to 8x8.

A single update step

For the purpose of making the CPU-to-GPU transition faster and easier, we will again mimick the design of the CPU API in the GPU API, even when it means doing some sub-optimal things in the first version. We will therefore again have a function that performs one GPU compute step, except this time said function will have this signature:

use crate::{
    gpu::context::VulkanContext,
    options::RunnerOptions,
};
use std::{error::Error, sync::Arc};
use vulkano::{
    descriptor_set::PersistentDescriptorSet,
    pipeline::ComputePipeline,
};

/// Run a simulation step on the GPU
pub fn update(
    context: &VulkanContext,
    options: &RunnerOptions,
    pipeline: Arc<ComputePipeline>,
    parameters: Arc<PersistentDescriptorSet>,
    concentrations: Arc<PersistentDescriptorSet>,
) -> Result<(), Box<dyn Error>> {
    // TODO: Actually do the work
}

Notice that in addition to the three parameters that you most likely expected (the compute pipeline and the two resource descriptor sets that will be bound to it), we will also need a Vulkan context to submit commands, and the simulation runner options to tell how large the simulation domain is. And because this is GPU code, we also need to account for the possibility of API errors.

Inside of this function, we will first translate our simulation domain shape into a GPU dispatch size. This means that we will need to check that the simulation domain is composed of an integer amount of work-groups (a restriction of this version of the simulation which would be hard to work around⁴), and then translate the simulation domain shape into a number of work-groups that the GPU should run across each spatial dimension.

use crate::gpu::pipeline::WORK_GROUP_SHAPE;

assert!(
    options.num_rows.max(options.num_cols) < u32::MAX as usize,
    "Simulation domain has too many rows/columns for a GPU"
);
let num_rows = options.num_rows as u32;
let num_cols = options.num_cols as u32;
let work_group_cols = WORK_GROUP_SHAPE[0];
let work_group_rows = WORK_GROUP_SHAPE[1];
assert!(
    (num_cols % work_group_cols).max(num_rows % work_group_rows) == 0,
    "Simulation domain size must be a multiple of GPU work-group size"
);
let dispatch_size = [num_cols / work_group_cols, num_rows / work_group_rows, 1];

Notice how as previously mentioned, we must be careful when mixing simulation domain shapes (where dimensions are normally specified in [rows, columns] order) and GPU 2D spaces (where dimensions are normally specified in [width, height] order).

After this, we’ll start to record a command buffer, and also grab a copy of our compute pipeline’s I/O layout for reasons that will become clear in the next step.

use vulkano::{
    command_buffer::{AutoCommandBufferBuilder, CommandBufferUsage},
    pipeline::Pipeline,
};

let mut builder = AutoCommandBufferBuilder::primary(
    &context.command_alloc,
    context.queue.queue_family_index(),
    CommandBufferUsage::OneTimeSubmit,
)?;
let pipeline_layout = pipeline.layout().clone();

We will then bind our compute pipeline (which moves it away, hence the previous layout copy) and its parameters (which requires a copy of the pipeline layout).

use crate::gpu::pipeline::PARAMS_SET;
use vulkano::pipeline::PipelineBindPoint;

builder
    .bind_pipeline_compute(pipeline)?
    .bind_descriptor_sets(
        PipelineBindPoint::Compute,
        pipeline_layout.clone(),
        PARAMS_SET,
        parameters,
    )?;

And then we will bind the current simulation inputs and outputs and schedule an execution of the compute pipeline with the previously computed number of work-groups…

use crate::gpu::pipeline::IMAGES_SET;

builder
    .bind_descriptor_sets(
        PipelineBindPoint::Compute,
        pipeline_layout,
        IMAGES_SET,
        concentrations,
    )?
    .dispatch(dispatch_size)?;

Finally, we will build the command buffer, submit it to the command queue, flush the command queue to the GPU, and wait for execution to finish.

use vulkano::{
    command_buffer::PrimaryCommandBufferAbstract,
    sync::GpuFuture,
};

builder
    .build()?
    .execute(context.queue.clone())?
    .then_signal_fence_and_flush()?
    .wait(None)?;

With this, we get an update() function whose semantics roughly match those of the CPU version, and will be thus easiest to integrate into the existing code. But as you may have guessed while reading the previous code, some sacrifices had to be made to achieve this:

We are not leveraging Vulkan’s command batching capabilities. On every simulation step, we need to build a new command buffer that binds the simulation compute pipeline and parameters set, only to execute a single compute dispatch and wait for it. These overheads are especially bad because they are paid once per simulation step, so scaling up the number of compute steps per saved image will not amortize them.
Compared to an optimized alternative that would just bind the compute descriptor set and schedule a compute pipeline dispatch, leaving the rest up to higher-level code, our update() function needs to handle many unrelated concerns, and gets a more complex signature and complicated calling procedure as a result.

Therefore, as soon as we get the simulation to run, this should be our first target for optimization.

Exercise

At long last, we are ready to roll out a full Vulkan-based GPU simulation.

Integrate the simulation update function discussed above at the root of the gpu module of the codebase, then write a new version of run_simulation() in the same module that uses all the infrastructure introduced so far to run the simulation on the GPU.

Next, make the simulation use the gpu version of run_simulation(). To enable CPU/GPU comparisons, it is a good idea to to keep it easy to run the CPU version. For example, you could have an off-by-default cpu: bool runner option which lets you switch back to CPU mode.

Once the simulation builds, make sure that it runs successfully and without significant⁵ Vulkan validation warnings. You can check the latter by running the simulation binary in debug mode (cargo run --bin simulate without the usual --release option). Beware that such a debug binary can be slow: you will want to run it with a smaller number of simulation steps.

And finally, once the simulation binary seems to run successfully, you will adjust the microbenchmark to evaluate its performance:

Add a new GPU update microbenchmark in addition to the existing CPU microbenchmark, so that you can measure how the two implementations compare.
Add another GPU microbenchmark that measures the performance of getting the data from the GPU to the CPU (via Concentrations::current_v()).

Then run the resulting microbenchmark, and compare the performance characteristics of our first GPU version to the optimized CPU version.

I’m purposely being vague about what a “visual component” is, since even back in the day there were already two kinds of components with configurable processing (triangle vertex and pixel/fragment), and the list has grown enormously since: geometry primitives, tesselator inputs and outputs, rays, meshlets…

…albeit with bad performance because SIMD can’t help leaking through every nice abstraction and making the life of everyone more complicated.

First came sub-groups/warps, which expose hardware SIMD instructions. Then came NVidia’s thread-block clusters, which to my knowledge do not have a standardized Khronos name yet, but are basically about modeling L2 caches shards shared by multiple GPU compute units.

⁴

It could be done with a simple if in the current version, but would become much harder when introducing later optimizations that leverage the communication capabilities of work-groups and subgroups.

⁵

As a minor spoiler, you will find that even vulkano gets some subtleties of Vulkan wrong.

Gray-Scott with Rust