Shader

So far, we have mostly been working on common Vulkan boilerplate that could easily be reused in another application, with minor changes and extra configuration options. Now it’s time to actually start writing some Gray-Scott reaction simulation code.

In this chapter, we will write the part of the simulation code that actually runs on the GPU. Vulkan calls it a shader, but if you feel more comfortable calling it a kernel, I won’t judge you. I’ll let the Linux contributors in the room do it for me. 😉

Introducing GLSL

One benefit of Khronos’ decision to switch to the SPIR-V intermediate representation was that it became theoretically possible to use any compiled programming language you like to write the GPU-side code. All you need is a new compiler backend that emits SPIR-V instead of machine code.

Sadly, compiler backends are big pieces of work, so Rust is not quite there yet. All we have today is a promising prototype called rust-gpu. I hope to introduce it in a future edition of this course1, but it felt a bit too experimental for this year’s edition.

Therefore, we will prefer the tried and true alternative of Khronos’ standard GPU programming language: the OpenGL Shading Language, or GLSL for short.

GLSL is, along with HLSL, one of the two most important programming languages in the history of GPU computing. It was originally designed as a companion to the OpenGL GPU API, the predecessor to Vulkan in Khronos’ family of GPU APIs, and therefore its specification can be found right next to the OpenGL specification on the Khronos Group website.

GLSL is a derivative of the C programming language, which dropped some of C’s most controversial features2 and introduced many extensions to C to meet the needs of GPU computing.

To give some examples, in GLSL you can easily…

  • Specify how the GPU code will interface with Vulkan on the CPU side.
  • Easily manipulate vectors and matrices, but only up to dimension 4x4.3
  • Exploit GPU texturing units to more efficiently manipulate 2/3D arrays, interpolate 1/2/3D data, and perform on-the-fly conversions between many image formats.
  • Selectively let the compiler optimize operations in a numerically unstable fashion (similar to GCC’s -ffast-math, except SPIR-V makes it fine-grained and calls it RelaxedPrecision).
  • Control the fashion in which the GPU program execution’s will be decomposed into work-groups (the Vulkan equivalent of NVidia CUDA’s thread blocks).
  • Allow the value of a compilation constant to be specified before the shader’s SPIR-V is compiled into a device binary, to improve configurability and compiler optimizations.
  • Enjoy a more extensive library of built-in math functions than C could ever dream of.
  • …and many other things revolving around the high-speed drawing of textured triangles that we will not care much about in everyday numerical computing.4

As you can see, this opens quite a few interesting possibilities. And that’s part of why rust-gpu is still experimental in spite of having been in development for quite a while. Making a good new GPU programming language is actually not only a matter of adding a SPIR-V backend to the compiler (which is already a large amount of work), it’s also a matter of extending the source programming language and its standard library to expose all the GPU features that GPU-specific programming languages have been exposing forever.5

Our first GLSL Gray-Scott

Let us now go through a GLSL implementation of the Gray-Scott reaction simulation. This first implementation isn’t heavily tuned for performance, however it does try to leverage GLSL-specific features where the impact on readability is small and it should help runtime performance.

First of all, we specify which version of the GLSL specification the program is written against. Here, we are using the latest and greatest GLSL 4.60 specification.

#version 460

Then we start to specify the interface between the CPU and the GPU code. This is a danger zone. Anytime we change this part of the GPU code, we must remember to keep matching code on the CPU side in sync, or else the GPU program will produce completely wrong results if it runs at all.

// Initial concentrations of species U and V
layout(set = 0, binding = 0) uniform sampler2D us;
layout(set = 0, binding = 1) uniform sampler2D vs;

// Final concentrations of species U and V
layout(set = 0, binding = 2) uniform restrict writeonly image2D nextUs;
layout(set = 0, binding = 3) uniform restrict writeonly image2D nextVs;

// Computation parameters uniform
layout(set = 1, binding = 0) uniform Parameters {
    mat3 weights;
    float diffusion_rate_u, diffusion_rate_v,
          feed_rate, kill_rate,
          time_step;
} params;

// Workgroup dimensions
layout(local_size_x = 8, local_size_y = 8) in;

There is a fair bit to unpack here:

  • us and vs are 2D samplers representing our input data. There are two interfaces to GPU texturing units in Khronos APIs, images and samplers. Images let us leverage the optimized data layout of GPU textures for 2/3D spatial locality. Samplers build on top of images to provide us with optional interpolation features (not used here) and well-defined handling of out-of-bound accesses (used here to enforce the simulation’s zero boundary condition).
  • nextUs and nextVs are 2D images representing our output data. Here we do not need samplers because we will not be performing out-of-bounds accesses on the output side. We can also help the GPU compiler optimize code better that telling it that these images do not alias with any other input/output data in memory and will not be read from.
  • params is a struct that lets us pass simulation parameters to the GPU code. Notice that unlike the CPU code, the GPU code does not know about some computation parameters at compile time, which puts it at a slight performance disadvantage. By using specialization constants, we could reverse this and let the GPU code know about all simulation parameters at compile time.
  • All variables introduced so far are uniforms, which means that all work-items (same thing as CUDA threads) in a given shader execution agree on their value. This greatly improves GPU code performance by letting the GPU hardware share associated state (registers, etc) between concurrent computations, and is in fact mandatory for images and samplers.
  • To enable binding associated resources from the CPU side, each of these inputs and outputs is assigned a set and binding number. Here, we have made all inputs and outputs part of set 0, and all parameters part of set 1. Thi set-binding hierarchy lets us flip simulation inputs/outputs in a single API transaction, without needing to rebind unchanging parameters.
  • Finally, we specify the work-group size, which is Vulkan’s equivalent of CUDA’s thread-block size. For now, we hardcode it in the shader code, but later we could make it runtime-configurable using specialization constants.

After specifying the CPU/GPU interface, we add a few simple utilities to make it easier to use. First, we write a pair of more ergonomic wrappers on top of GLSL’s very general but somewhat unwieldy image and sampler access functions…

// Read the current value of the concentration
vec2 read_uv(const ivec2 input_idx) {
    const vec2 input_pos = vec2(input_idx) + 0.5;
    return vec2(
        texture(us, input_pos).r,
        texture(vs, input_pos).r
    );
}

// Write the next value of the concentration
void write_uv(const ivec2 output_idx, const vec2 uv) {
    imageStore(nextUs, output_idx, vec4(uv.x));
    imageStore(nextVs, output_idx, vec4(uv.y));
}

…and second, we give ourselves a way to refer to the stencil’s 2D dimensions and center position:

// Diffusion stencil properties
const ivec2 stencil_shape = ivec2(params.weights.length(),
                                  params.weights[0].length());
const ivec2 stencil_offset = ivec2((stencil_shape - 1) / 2);

Finally, we introduce the body of the Gray-Scott reaction computation:

// What each shader invocation does
void main() {
    // Determine which central input and output location we act on
    const ivec2 center_idx = ivec2(gl_GlobalInvocationID.xy);

    // Read the center value of the concentration
    const vec2 uv = read_uv(center_idx);

    // Compute diffusion term for U and V
    vec2 full_uv = vec2(0.);
    const ivec2 top_left = center_idx - stencil_offset;
    for (int x = 0; x < stencil_shape.x; ++x) {
        for (int y = 0; y < stencil_shape.y; ++y) {
            const vec2 stencil_uv = read_uv(top_left + ivec2(x, y));
            full_uv += params.weights[x][y] * (stencil_uv - uv);
        }
    }
    const float diffusion_u = params.diffusion_rate_u * full_uv.x;
    const float diffusion_v = params.diffusion_rate_v * full_uv.y;

    // Deduce rate of change in u and v
    const float u = uv.x;
    const float v = uv.y;
    const float uv_square = u * v * v;
    const float du = diffusion_u - uv_square + params.feed_rate * (1.0 - u);
    const float dv = diffusion_v + uv_square
                                 - (params.feed_rate + params.kill_rate) * v;

    // Update u and v accordingly
    write_uv(center_idx, uv + vec2(du, dv) * params.time_step);
}

Again, this could use some unpacking:

  • Like CUDA, Vulkan and GLSL use a data-parallel execution model where each work-item is tasked with processing one element of a 1/2/3D-dimensional grid. We can read the position that we are tasked to process from the gl_GlobalInvocationID global 3D vector. Because our problem is 2D, we extract the first two coordinates of this vector, and convert them from floating-point to integer because that feels most correct here.
  • We then compute the diffusion term of the differential equation. Here we played a little with GLSL’s vector type to make the computations of U and V less redundant and possibly a little more efficient (depending on how smart your GPU driver’s compiler feels like being today).
  • Finally, we update U and V using almost exactly the same logic as the CPU version.

Basic CPU-side integration

Following the historical conventions of GPU programmers, we first save our GPU-side code into a source file with a .comp extension, which tells our IDE tooling that this is a compute shader. We propose that you call it grayscott.comp and put it inside of the exercises/src/gpu.

We then ensure that this shader is compiled to SPIR-V whenever our project is built by using the vulkano-shaders library provided by the vulkano project. We first add it as a dependency…

cargo add --optional vulkano-shaders

…and within the project’s Cargo.toml, we make it part of our gpu optional feature.

[features]
gpu = ["dep:directories", "dep:vulkano", "dep:vulkano-shaders"]

Then within our code’s gpu::pipeline module, we create a new shader module and ask vulkano_shaders to generate the SPIR-V code and some Rust support code inside of it:

/// Compute shader used for GPU-side simulation
mod shader {
    vulkano_shaders::shader! {
        ty: "compute",
        path: "src/gpu/grayscott.comp",
    }
}

To make the code more maintainable, we give human-readable names to the shader binding points. These constants will still need to be kept up to date if the associated GLSL interface evolves over time, but at least we won’t need to update magic numbers spread all over the code:

#![allow(unused)]
fn main() {
/// Shader descriptor set to which input and output images are bound
pub const IMAGES_SET: u32 = 0;

/// Descriptor within `IMAGES_SET` for sampling of input U concentration
pub const IN_U: u32 = 0;

/// Descriptor within `IMAGES_SET` for sampling of input V concentration
pub const IN_V: u32 = 1;

/// Descriptor within `IMAGES_SET` for writing to output U concentration
pub const OUT_U: u32 = 2;

/// Descriptor within `IMAGES_SET` for writing to output V concentration
pub const OUT_V: u32 = 3;

/// Shader descriptor set to which simulation parameters are bound
pub const PARAMS_SET: u32 = 1;

/// Descriptor within `PARAMS_SET` for simulation parameters
pub const PARAMS: u32 = 0;

/// Work-group shape of the shader
pub const WORK_GROUP_SHAPE: [u32; 2] = [8, 8];
}

Finally, we will need a Rust struct which matches the definition of the GPU-side simulation parameters. Thankfully, vulkano_shaders saves us from the trouble of a duplicate type definition by automatically generating one Rust version of each type from the input GLSL code. All we need to do is to expose it so other code can use it:

/// Simulation parameters, in a GPU-friendly layout
///
/// This syntax lets you re-export a type defined in another module as if you
/// defined said type yourself
pub use shader::Parameters;

And that’s it: now we have autogenerated SPIR-V in our code, along with a CPU-side version of our CPU-GPU interface declarations.

Exercise

Integrate all of the above into the Rust project’s gpu::pipeline module, and make sure that the code still compiles. A test run is not useful here as runtime behavior is unchanged for now.


1

Along with krnl, a CUDA-like simplified GPU computing layer built on top of rust-gpu and Vulkan.

2

In particular, GLSL removed a lot of C’s implicit conversions, and fully drops pointers. The latter are largely replaced by much improved dynamically sized array support and function in/out/inout parameters.

3

What else would you expect from a language designed by and for computer graphics experts?

4

…though people these days are increasingly interested in the potential physics simulation applications of the hardware ray-tracing acceleration units that were introduced by recent GPUs from NVidia and AMD.

5

In case you are wondering, on the C++ side, CUDA and SYCL mostly ignore this problem by not supporting half of GLSL’s most useful features. But the rust-gpu developers aim for performance parity with well-optimized GLSL code, and are therefore unwilling to settle for this shortcut.