Resources

Little by little, we are making progress in our Vulkan exploration, to the point where we now have a compute pipeline that lets us run our code on the GPU. However, we do not yet have a way to feed this code with data inputs and let it emit data outputs. In this chapter, we will see how that is done.

From buffers to descriptor sets

Vulkan has two types of memory resources, buffers and images. Buffers are both simpler to understand and a prerequisite for using images, so we will start with them. In short, a buffer is nothing more than a fixed-sized allocation of CPU or GPU memory that is managed by Vulkan.

By virtue of this simple definition, buffers are very flexible. We can use them for CPU memory that can be (slowly) accessed by the GPU, for sources and destinations of CPU <=> GPU data transfers, for fast on-GPU memory that cannot be accessed by the CPU… But there is a flip side to this flexibility, which is that the Vulkan implementation is going to need some help from our side in order to wisely decide how the memory that backs a buffer should be allocated.

As a first example, let us write down our simulation parameters to memory that is accessible from to the CPU, but preferably resident on the GPU, so that subsequent accesses from the GPU are fast.

First of all, we initialize our GPU parameters struct, assuming availability of simulation options:

use crate::{
    gpu::pipeline::Parameters,
    options::{
        UpdateOptions, DIFFUSION_RATE_U, DIFFUSION_RATE_V, STENCIL_WEIGHTS,
    },
};
use vulkano::padded::Padded;

/// Collect GPU simulation parameters
fn create_parameters_struct(update_options: &UpdateOptions) -> Parameters {
    Parameters {
        // Beware that GLSL matrices are column-major...
        weights: std::array::from_fn(|col| {
            // ...and each column is padded to 4 elements for SIMD reasons.
            Padded::from(
                std::array::from_fn(|row| {
                    STENCIL_WEIGHTS[row][col]
                })
            )
        }),
        diffusion_rate_u: DIFFUSION_RATE_U,
        diffusion_rate_v: DIFFUSION_RATE_V,
        feed_rate: update_options.feedrate,
        kill_rate: update_options.killrate,
        time_step: update_options.deltat,
    }
}

We then create a CPU-accessible buffer that contains this data:

use vulkano::{
    buffer::{Buffer, BufferCreateInfo, BufferUsage},
    memory::allocator::{AllocationCreateInfo, MemoryTypeFilter},
};

let parameters = Buffer::from_data(
    context.memory_alloc.clone(),
    BufferCreateInfo {
        usage: BufferUsage::UNIFORM_BUFFER,
        ..Default::default()
    },
    AllocationCreateInfo {
        memory_type_filter:
            MemoryTypeFilter::HOST_SEQUENTIAL_WRITE
            | MemoryTypeFilter::PREFER_DEVICE,
        ..Default::default()
    },
    create_parameters_struct()
)?;

Notice how vulkano lets us specify various metadata about how we intend to use the buffer. Having this metadata around allows vulkano and the Vulkan implementation to take more optimal decisions when it comes to where memory should be allocated.

This matters because if you run vulkaninfo on a real-world GPU, you will realize that all the higher-level compute APIs like CUDA and SYCL have been hiding things from you all this time, and it is not uncommon for a GPU to expose 10 different memory heaps with different limitations and performance characteristics. Picking between all these heaps without having any idea of what you will be doing with your memory objects is ultimately nothing more than an educated guess.

Vulkan memory heaps on my laptop's AMD Radeon RX 5600M

VkPhysicalDeviceMemoryProperties:
=================================
memoryHeaps: count = 3
    memoryHeaps[0]:
        size   = 6174015488 (0x170000000) (5.75 GiB)
        budget = 6162321408 (0x16f4d9000) (5.74 GiB)
        usage  = 0 (0x00000000) (0.00 B)
        flags: count = 1
            MEMORY_HEAP_DEVICE_LOCAL_BIT
    memoryHeaps[1]:
        size   = 33395367936 (0x7c684e000) (31.10 GiB)
        budget = 33368653824 (0x7c4ed4000) (31.08 GiB)
        usage  = 0 (0x00000000) (0.00 B)
        flags:
            None
    memoryHeaps[2]:
        size   = 268435456 (0x10000000) (256.00 MiB)
        budget = 266260480 (0x0fded000) (253.93 MiB)
        usage  = 0 (0x00000000) (0.00 B)
        flags: count = 1
            MEMORY_HEAP_DEVICE_LOCAL_BIT
memoryTypes: count = 11
    memoryTypes[0]:
        heapIndex     = 0
        propertyFlags = 0x0001: count = 1
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[1]:
        heapIndex     = 0
        propertyFlags = 0x0001: count = 1
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                None
            IMAGE_TILING_LINEAR:
                None
    memoryTypes[2]:
        heapIndex     = 1
        propertyFlags = 0x0006: count = 2
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[3]:
        heapIndex     = 2
        propertyFlags = 0x0007: count = 3
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[4]:
        heapIndex     = 2
        propertyFlags = 0x0007: count = 3
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                None
            IMAGE_TILING_LINEAR:
                None
    memoryTypes[5]:
        heapIndex     = 1
        propertyFlags = 0x000e: count = 3
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[6]:
        heapIndex     = 1
        propertyFlags = 0x000e: count = 3
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT
        usable for:
            IMAGE_TILING_OPTIMAL:
                None
            IMAGE_TILING_LINEAR:
                None
    memoryTypes[7]:
        heapIndex     = 0
        propertyFlags = 0x00c1: count = 3
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD
            MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[8]:
        heapIndex     = 1
        propertyFlags = 0x00c6: count = 4
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD
            MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[9]:
        heapIndex     = 2
        propertyFlags = 0x00c7: count = 5
            MEMORY_PROPERTY_DEVICE_LOCAL_BIT
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD
            MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images
    memoryTypes[10]:
        heapIndex     = 1
        propertyFlags = 0x00ce: count = 5
            MEMORY_PROPERTY_HOST_VISIBLE_BIT
            MEMORY_PROPERTY_HOST_COHERENT_BIT
            MEMORY_PROPERTY_HOST_CACHED_BIT
            MEMORY_PROPERTY_DEVICE_COHERENT_BIT_AMD
            MEMORY_PROPERTY_DEVICE_UNCACHED_BIT_AMD
        usable for:
            IMAGE_TILING_OPTIMAL:
                color images
                FORMAT_D16_UNORM
                FORMAT_D32_SFLOAT
                FORMAT_S8_UINT
                FORMAT_D16_UNORM_S8_UINT
                FORMAT_D32_SFLOAT_S8_UINT
            IMAGE_TILING_LINEAR:
                color images

Finally, as we mentioned previously, Vulkan makes you group memory resources together in descriptor sets in order to let you amortize the cost of binding resources to shaders. These groups should not be assembled randomly, of course. Resources should be grouped such that shader parameters that change together are grouped together, and slow-changing parameters are not grouped together with fast-changing parameters. In this way, most of the time, you should be able to rebind only those GPU resources that did change, in a single API binding call.

Because our Gray-Scott simulation is relatively simple, we have no other memory resource with a lifetime that is similar to the simulation parameters (which can be bound for the duration of the entire simulation). Therefore, our simulation parameters will be alone in their descriptor set:

use crate::gpu::pipeline::{PARAMS_SET, PARAMS};
use vulkano::{
    descriptor_set::{
        persistent::PersistentDescriptorSet,
        WriteDescriptorSet
    },
    pipeline::Pipeline,
};

let descriptor_set = PersistentDescriptorSet::new(
    &context.descriptor_alloc,
    pipeline.layout().set_layouts()[PARAMS_SET as usize].clone(),
    [WriteDescriptorSet::buffer(PARAMS, buffer)],
    [],
)?;

All in all, if we leverage the shortcuts that vulkano provides us with, it essentially takes three steps to go from having CPU-side simulation parameters to having a parameters descriptor set that we can bind to our compute pipeline:

use crate::gpu::context::VulkanContext;
use std::{sync::Arc, error::Error};
use vulkano::{
    pipeline::compute::ComputePipeline,
};

pub fn create_parameters(
    context: &VulkanContext,
    pipeline: &ComputePipeline,
    update_options: &UpdateOptions,
) -> Result<Arc<PersistentDescriptorSet>, Box<dyn Error>> {
    // Assemble simulation parameters into a struct with a GPU-compatible layout
    let parameters = create_parameters_struct(update_options);

    // Create a buffer and put the simulation parameters into it
    let buffer = Buffer::from_data(
        context.memory_alloc.clone(),
        BufferCreateInfo {
            usage: BufferUsage::UNIFORM_BUFFER,
            ..Default::default()
        },
        AllocationCreateInfo {
            memory_type_filter:
                MemoryTypeFilter::HOST_SEQUENTIAL_WRITE
                | MemoryTypeFilter::PREFER_DEVICE,
            ..Default::default()
        },
        parameters,
    )?;

    // Create a descriptor set containing only this buffer, since no other
    // resource has the same usage pattern as the simulation parameters
    let descriptor_set = PersistentDescriptorSet::new(
        &context.descriptor_alloc,
        pipeline.layout().set_layouts()[PARAMS_SET as usize].clone(),
        [WriteDescriptorSet::buffer(PARAMS, buffer)],
        [],
    )?;
    Ok(descriptor_set)
}

Concentration images

Did you say images?

As previously mentioned, Vulkan images give us access to the power of GPU texturing units, which were designed to speed up and simplify the chore of handling multidimensional data with optional linear interpolation, proper handling of boundary conditions, on-the-fly data decompression, and easy conversions between many common pixel formats.

But alas, this is the GPU world¹, so with great power comes great API complexity:

To handle multidimensional data efficiently, GPU texturing units must store it in an optimized memory layout. The actual layout is hidden from you², which means that you must prepare your data into a buffer with a standard layout, and then use a special copy command to make the GPU translate from your standard layout to its optimized internal texture layout.
To handle on-the-fly data decompression, interpolation and pixel format conversions, the GPU hardware must know about the data types that you are manipulating. This means that images cannot accept all 1/2/3D arrays of any arbitrary user-defined data types, and must instead be restricted to a finite set of pixel types whose support varies from one GPU to another.

Staging buffer

Since we have to, we will start by building a buffer that contains our image data in Vulkan’s standard non-strided row-major layout. Because buffers are one-dimensional, this will require a little bit of index wrangling, but a survivor of the Advanced SIMD chapter should easily handle it.

use crate::{data::Float, options::RunnerOptions};
use vulkano::buffer::{AllocateBufferError, Subbuffer};

/// Set up U and V concentration buffers
///
/// Returns first U, then V. We were not making this a struct yet because there
/// is a bit more per-species state that we will eventually need.
fn create_concentration_buffers(
    context: &VulkanContext,
    options: &RunnerOptions,
) -> Result<[Subbuffer<[Float]>; 2], Validated<AllocateBufferError>> {
    // Define properties that are common to both buffers
    let [num_rows, num_cols] = [options.num_rows, options.num_cols];
    let num_pixels = num_rows * num_cols;
    let create_info = BufferCreateInfo {
        usage: BufferUsage::TRANSFER_SRC | BufferUsage::TRANSFER_DST,
        ..Default::default()
    };
    let allocation_info = AllocationCreateInfo {
        memory_type_filter:
            MemoryTypeFilter::HOST_SEQUENTIAL_WRITE
            | MemoryTypeFilter::PREFER_HOST,
        ..Default::default()
    };
    let pattern = |idx: usize| {
        let (row, col) = (idx / num_cols, idx % num_cols);
        (row >= (7 * num_rows / 16).saturating_sub(4)
            && row < (8 * num_rows / 16).saturating_sub(4)
            && col >= 7 * num_cols / 16
            && col < 8 * num_cols / 16) as u8 as Float
    };

    // Concentration of the U species
    let u = Buffer::from_iter(
        context.memory_alloc.clone(),
        create_info.clone(),
        allocation_info.clone(),
        (0..num_pixels).map(|idx| 1.0 - pattern(idx)),
    )?;

    // Concentration of the V species
    let v = Buffer::from_iter(
        context.memory_alloc.clone(),
        create_info,
        allocation_info,
        (0..num_pixels).map(pattern),
    )?;
    Ok([u, v])
}

Creating images

Now that we have buffers of initial data, we can make images…

use vulkano::{
    format::Format,
    image::{AllocateImageError, Image, ImageCreateInfo, ImageUsage},
};

/// Create 4 images for storing the input and output concentrations of U and V
fn create_images(
    context: &VulkanContext,
    options: &RunnerOptions,
) -> Result<Vec<Arc<Image>>, Validated<AllocateImageError>> {
    let create_info = ImageCreateInfo {
        format: Format::R32_SFLOAT,
        extent: [options.num_cols as u32, options.num_rows as u32, 1],
        usage: ImageUsage::TRANSFER_SRC | ImageUsage::TRANSFER_DST
               | ImageUsage::SAMPLED | ImageUsage::STORAGE,
        ..Default::default()
    };
    let allocation_info = AllocationCreateInfo::default();
    let images =
        std::iter::repeat_with(|| {
            Image::new(
                context.memory_alloc.clone(),
                create_info.clone(),
                allocation_info.clone(),
            )
        })
        .take(4)
        .collect::<Result<Vec<_>, _>>()?;
    Ok(images)
}

…but some choices made in the above code could use a little explanation:

You need to be careful that numerical computing APIs and GPU APIs do not use the same conventions when denoting multidimensional array shapes. Numerical scientists tend to think of 2D shapes as [num_rows, num_cols], whereas graphics programmeurs tend to think of them as [width, height], which is the reverse order. Pay close attention to this, as it is a common source of bugs when translating numerical code to GPU APIs.
Our images need to have all of these usage bit sets because we’ll need to send initial data from the CPU (TRANSFER_DST), use some images as input (SAMPLED) and other images as output (STORAGE) in a manner where images keep alternating between the two roles, and in the end bring back output data to the CPU (TRANSFER_DST).
- Here the Vulkan expert may reasonably wonder if the implementation could go faster if we allocated more images with more specialized usage flags and copied data between them as needed. We will get back to this question once we are done implementing the naive simulation and start discussing possible optimizations.
The final code statement uses semi-advanced iterator trickery to create 4 images without duplicating the associated line of code 4 times, even though vulkano’s Image type is not clonable because Vulkan provides no easy image-cloning operation.
- The iterator pipeline is made a little bit more complex by the fact that images creation can fail, and therefore returns a Result. We do not want to end up with a Vec<Result<Image, _>> in the end, so instead we leverage the fact that Rust allows you to collect an iterator of Result<Image, Err> into a Result<Vec<Image>, Err>.

Introduction to command buffers

As we have discussed previously, Vulkan images have no initial content. We need to transfer the initial data from our U and V concentration buffers to a pair of images, which will be our initial input. To do this, we will have to send the GPU some buffer-to-image copy commands, which means that for the first time in this Vulkan practical, we will actually be asking the GPU to do some work.

Sending work to the GPU is very resource-intensive for several reasons, one of which is that CPU-GPU communication over the PCIe bus has a high latency and involves a complicated protocol. This means that the only way to make good use of this hardware interconnect is to send not one, but many commands at the same time. Vulkan helps you to do so by fully specifying its GPU command submission API in terms of batches of commands called command buffers, which are explicitly exposed in the API and therefore can be easily generated by multiple CPU threads in scenarios where command preparation on the CPU side becomes a bottleneck.³

In fact, Vulkan took this one step further, and introduced the notion of secondary command buffers to let you use command buffers as commands inside of other command buffers, which is convenient when you want to reuse some but not all GPU commands from one computation to another…

Mandatory meme regarding command buffer nesting

…but this Vulkan feature is rather controversial and GPU vendors provide highly contrasting advice regarding its use (just compare the section of the AMD and NVidia optimization guides on command submission). Therefore, it will come at no surprise that we will not cover the topic any further in this course. Instead, we will focus our attention to primary command buffers, which are the “complete” buffers that are submitted to the GPU at the end.

Initializing input images

To initialize our input images, we will start with by creating AutoCommandBufferBuilder, which is vulkano’s high-level abstraction for making Vulkan command buffers easier to build:

use vulkano::command_buffer::{
    auto::AutoCommandBufferBuilder,
    CommandBufferUsage,
};

let mut upload_builder = AutoCommandBufferBuilder::primary(
    &context.command_alloc,
    context.queue.queue_family_index(),
    CommandBufferUsage::OneTimeSubmit,
)?;

Then we’ll add a pair of commands to fill the first images with data from our concentration buffers:

use vulkano::command_buffer::CopyBufferToImageInfo;

for (buffer, image) in buffers.iter().zip(images) {
    upload_builder.copy_buffer_to_image(
        CopyBufferToImageInfo::buffer_image(
            buffer.clone(),
            image.clone(),
        )
    )?;
}

As it turns out, we have no other GPU work to submit during application initialization. Therefore, we will build a command buffer with just these two upload commands…

let command_buffer = upload_builder.build()?;

…submit it to our command queue…

use vulkano::command_buffer::PrimaryCommandBufferAbstract;

let execute_future = command_buffer.execute(context.queue.clone())?;

…ask Vulkan to send all pending commands to the GPU side and tell us when they are done executing via a synchronization primitive called a fence⁴.

use vulkano::sync::GpuFuture;

let fence_future = execute_future.then_signal_fence_and_flush()?;

…and finally wait for the work to complete without any timeout:

fence_future.wait(None)?;

After this line of code, our first two concentration images will be initialized. To put it all together…

fn initialize_images(
    context: &VulkanContext,
    buffers: &[Subbuffer<[Float]>],
    images: &[Arc<Image>],
) -> Result<(), Box<dyn Error>> {
    assert!(images.len() > buffers.len());
    let mut upload_builder = AutoCommandBufferBuilder::primary(
        &context.command_alloc,
        context.queue.queue_family_index(),
        CommandBufferUsage::OneTimeSubmit,
    )?;
    for (buffer, image) in buffers.iter().zip(images) {
        upload_builder.copy_buffer_to_image(
            CopyBufferToImageInfo::buffer_image(
                buffer.clone(),
                image.clone(),
            )
        )?;
    }
    // Notice how these APIs are meant to chain nicely, like iterator adapters
    upload_builder
        .build()?
        .execute(context.queue.clone())?
        .then_signal_fence_and_flush()?
        .wait(None)?;
    Ok(())
}

Descriptor sets at last

With all the work it took to initialize our concentration images, it is easy to forget that what we will eventually bind to our compute pipeline is not individual images, but descriptor sets composed of four images: two input images with sampling, and two output images.

Unlike with simulation parameters, we will need two descriptor sets this time due to double buffering: one descriptor set where images #0 and #1 serve as inputs and images #2 and #3 serve as outputs, and another descriptor set where images #2 and #3 serve as inputs and images #0 and #1 serve as outputs. We will then start with the first descriptor set and alternate between the two descriptor sets as the simulation keeps running.

To this end, we will first write a little utility closure that sets up one descriptor set with two input and output images…

use crate::gpu::pipeline::{IMAGES_SET, IN_U, IN_V, OUT_U, OUT_V};
use vulkano::{
    image::view::{ImageView, ImageViewCreateInfo},
    Validated, VulkanError,
};

let create_set =
    |[in_u, in_v, out_u, out_v]: [Arc<Image>; 4]| -> Result<_, Validated<VulkanError>> {
        let layout = pipeline.layout().set_layouts()[IMAGES_SET as usize].clone();
        let binding =
            |binding, image: Arc<Image>, usage| -> Result<_, Validated<VulkanError>> {
                let view_info = ImageViewCreateInfo {
                    usage,
                    ..ImageViewCreateInfo::from_image(&image)
                };
                Ok(WriteDescriptorSet::image_view(
                    binding,
                    ImageView::new(image, view_info)?,
                ))
            };
        let descriptor_set = PersistentDescriptorSet::new(
            &context.descriptor_alloc,
            layout,
            [
                binding(IN_U, in_u, ImageUsage::SAMPLED)?,
                binding(IN_V, in_v, ImageUsage::SAMPLED)?,
                binding(OUT_U, out_u, ImageUsage::STORAGE)?,
                binding(OUT_V, out_v, ImageUsage::STORAGE)?,
            ],
            [],
        )?;
        Ok(descriptor_set)
    };

…and then we will use it to build our double-buffered descriptor set configuration:

fn create_concentration_sets(
    context: &VulkanContext,
    pipeline: &ComputePipeline,
    images: &[Arc<Image>],
) -> Result<[Arc<PersistentDescriptorSet>; 2], Validated<VulkanError>> {
    let create_set = /* ... as above ... */;
    let [u1, v1, u2, v2] = [&images[0], &images[1], &images[2], &images[3]];
    let descriptor_sets = [
        create_set([u1.clone(), v1.clone(), u2.clone(), v2.clone()])?,
        create_set([u2.clone(), v2.clone(), u1.clone(), v1.clone()])?,
    ];
    Ok(descriptor_sets)
}

Concentration state

As you can see, setting up the concentration images involved creating a fair amount of Vulkan objects. Thankfully, owing to vulkano’s high-level API design⁵, we will not need to keep all of them around, and it will suffice to keep the objects that we use directly:

Descriptor sets are used to specify our compute pipeline’s concentration inputs and outputs.
The V species images are used, together with a matching buffer, to download the concentration of V from the GPU for the purpose of saving output data to HDF5 files.

We will bundle this state together using a double buffering abstraction similar to the one that we have used previously in the CPU code, trying to retain as much API compatibility as we can to ease the migration. But as you will see, this comes at a significant cost, so we may want to revisit this decision later on, after we get the code in a runnable and testable state.

use crate::gpu::context::CommandBufferAllocator;
use ndarray::Array2;
use vulkano::{
    command_buffer::CopyImageToBufferInfo,
    device::Queue,
};

pub struct Concentrations {
    // Double buffered data
    descriptor_sets: [Arc<PersistentDescriptorSet>; 2],
    v_images: [Arc<Image>; 2],
    src_is_1: bool,

    // Area where we download the concentration of the V species
    v_buffer: Subbuffer<[Float]>,
    // Unsatisfying copy of the contents of `v_buffer`, present for API
    // compatibility reasons.
    v_ndarray: Array2<Float>,

    // Subset of the GPU context that we need to keep around
    // in order to provide CPU API compatibility
    command_alloc: Arc<CommandBufferAllocator>,
    queue: Arc<Queue>,
}
//
impl Concentrations {
    /// Set up the simulation state
    pub fn new(
        context: &VulkanContext,
        options: &RunnerOptions,
        pipeline: &ComputePipeline,
    ) -> Result<Self, Box<dyn Error>> {
        let buffers = create_concentration_buffers(context, options)?;
        let images = create_images(context, options)?;
        initialize_images(context, &buffers, &images)?;
        let descriptor_sets = create_concentration_sets(context, pipeline, &images)?;
        let [_, v_buffer] = buffers;
        Ok(Self {
            descriptor_sets,
            v_images: [images[1].clone(), images[3].clone()],
            src_is_1: false,
            v_buffer,
            v_ndarray: Array2::zeros([options.num_rows, options.num_cols]),
            command_alloc: context.command_alloc.clone(),
            queue: context.queue.clone(),
        })
    }

    /// Shape of the simulation domain (using ndarray conventions)
    pub fn shape(&self) -> [usize; 2] {
        assert_eq!(self.v_images[0].extent(), self.v_images[1].extent());
        let extent = &self.v_images[0].extent();
        assert!(extent[2..].iter().all(|dim| *dim == 1));
        [extent[1] as usize, extent[0] as usize]
    }

    /// Read out the current V species concentration
    pub fn current_v(&mut self) -> Result<&Array2<Float>, Box<dyn Error>> {
        // Download the current V species concentration
        let current_image = self.v_images[self.src_is_1 as usize].clone();
        let mut download_builder = AutoCommandBufferBuilder::primary(
            &self.command_alloc,
            self.queue.queue_family_index(),
            CommandBufferUsage::OneTimeSubmit,
        )?;
        download_builder.copy_image_to_buffer(CopyImageToBufferInfo::image_buffer(
            current_image,
            self.v_buffer.clone(),
        ))?;
        download_builder
            .build()?
            .execute(self.queue.clone())?
            .then_signal_fence_and_flush()?
            .wait(None)?;

        // Access the CPU-side buffer
        let v_data = self.v_buffer.read()?;
        let v_target = self
            .v_ndarray
            .as_slice_mut()
            .expect("Failed to access ndarray as slice");
        v_target.copy_from_slice(&v_data);
        Ok(&self.v_ndarray)
    }

    /// Run a simulation step
    ///
    /// The user callback function `step` will be called with the proper
    /// descriptor set for executing the GPU compute
    pub fn update(
        &mut self,
        step: impl FnOnce(Arc<PersistentDescriptorSet>) -> Result<(), Box<dyn Error>>,
    ) -> Result<(), Box<dyn Error>> {
        step(self.descriptor_sets[self.src_is_1 as usize].clone())?;
        self.src_is_1 = !self.src_is_1;
        Ok(())
    }
}

Some sacrifices were made to keep the API similar to the CPU version in this first GPU version:

The Concentrations struct needs to store a fair bit more state than its GPU version, including state that is only remotely related to its purpose like some elements of the Vulkan context.
current_v() needs to build, submit, flush and await a command buffer for a single download command, which is not ideal in terms of CPU/GPU communication efficiency.
current_v() needs to copy the freshly downloaded GPU data into a separate Array2 in order to keep the API return type similar.
Although update() was adapted to propagate GPU errors (which is necessary because almost every Vulkan function can error out), its API design was not otherwise modified to accomodate the asynchronous command buffer based workflow of Vulkan. As a result, we will need to “work around the API” through quite inelegant code later on.

Given these compromises, the minimal set of API changes needed are that…

new() has completely different parameters because it needs access to all the other GPU state, and we definitely don’t want to make Concentrations responsible for managing that state too.
Both current_v() and update() must be adapted to handle the fact that GPU APIs can error out a lot more often than CPU APIs.⁶
The update() callback needs a different signature because GPU code manipulates inputs and outputs very differently with respect to CPU code.

Exercise

Integrate all of the above into the Rust project’s gpu::resources module, then make sure that the code still compiles. We are still not ready for a test run, but certainly getting much closer.

If you are going fast and want a more challenging exercise…

Get a bit closer to a production application by using the device-filtering hook from the GPU context creation function to make sure that the GPU that you automatically select actually does support the image format that you want to use.
Explore the hardware support statistics provided by gpuinfo.org to learn more about what pixel formats are commonly supported and for what kind of API usage.

The GPU real world, not the tiny subset that CUDA and SYCL restrict you into in a vain attempt to make you believe that programming a GPU is just like programming a CPU and porting your apps will be easy.

Because it is considered a hardware implementation trade secret. But we can guess that it very likely involves some kind of space-filling curve like the Hilbert curve or the Morton curve.

Readers familiar with CUDA or SYCL may wonder why they have never heard of command buffers. As it turns out, those GPU compute APIs were modeled after old graphics APIs like OpenGL, where command batching was implicitly taken care of by the GPU driver through hidden global state. However decades of experience in using OpenGL have taught us that this approach scales poorly to multicore CPUs and is bad for any kind of application with real-time constraints, as it introduces unpredictable CPU thread slowdowns whenever the driver decides to turn a random command into a command batch submission. All modern graphics APIs have therefore switched to a Vulkan-like explicit command batch model, and even CUDA has semi-recently hacked away a similar abstraction called CUDA graphs.

⁴

Vulkan provides many synchronization primitives, including fences and semaphores. Fences can be used for CPU-GPU synchronization. Semaphores allow for this too but in addition they let you synchronize multiple GPU command queues with each other without round-tripping through the CPU. In exchange for this extra flexibility, semaphores can be expected to be slower at CPU-GPU synchronization.

⁵

This API design does require a fair amount of atomic reference counting under the hood. But compared to the cost of other things we do when interacting with the GPU, the atomic increments of a few Arc clones here and there are not normally a significant cost. So I think that for a GPU API binding, vulkano’s Arc-everywhere design is a reasonable choice.

⁶

We could adjust the CPU API to keep them in sync here by making its corresponding entry points return Result<(), Box<dyn Error>> too. It would just happen that the error case is never hit.

Gray-Scott with Rust