From zero to device

In the previous chapters, we have seen how to use Rust to implement an efficient CPU-based simulation of the Gray-Scott reaction. However, these days, CPUs are only half of the story, as most computers also come equipped with at least one GPU.

Originally designed for the low-dimensional linear algebra workloads of real-time computer graphics, GPUs have since been remarketed as general-purpose computing hardware, and successfully sold in large numbers to high performance computing centers worldwide. As a result, if you intend to run your computations on HPC centers, you need to know that being able to use a compute node’s GPUs is increasingly becoming a mandatory requirement for larger computing resource allocations.

But even if you don’t care about HPC centers, it is still good to know how to use a GPU for compute, as it is not uncommon for other people to run your code on gaming- or CAD-oriented hardware that has around 10x more single-precision¹ floating-point computing power and RAM bandwidth on the GPU side than on the CPU side.

Therefore, in this part of the course, we will learn how Rust code can be made to leverage almost any available GPU (not just NVidia ones), with minimal end user installation requirements and good portability across operating systems, by leveraging the Vulkan API.

Being fair about Vulkan complexity

Like most other APIs designed by the Khronos Group, Vulkan has a reputation of being very verbose, requiring hundreds of lines of code in order to perform the simplest tasks. While that statement is not wrong, it is arguably incomplete and misleading:

Vulkan is specified as a C API. This is necessary for it to be callable from many programming languages, but due to limitations of its type system, C tends to make patterns like resource managements and optional features verbose. This is why it is a good idea to avoid using C APIs directly, and instead prefer higher-level wrappers that leverage the extra capabilities of modern programming languages for ergonomics, like vulkano in Rust.
While a Vulkan “Hello world” is indeed quite long, much of it revolves around one-time initialization work that would not need to be repeated in a larger application. As a consequence, less trivial Vulkan programs will be more comparable in size to programs written against other APIs that have a less flexible initialization process.
Many lines of the typical Vulkan “Hello world” revolve around things that will not vary much across applications, like enumerating available computing devices and finding out which device you want to use. With a bit of care, these patterns can be easily extracted into libraries of common abstractions, that you can easily reuse from one application to another.
Once all this “accidental verbosity” is taken care of, what remains is extra configuration steps that let you control more details of the execution of your application than is possible in other APIs. This control can be leveraged for performance optimizations, enabling well-written Vulkan applications to use the GPU more efficiently than e.g. CUDA or SYCL applications could.

Keeping these facts in mind, and balancing them against the need to fit this course in half a day, this Vulkan course will mimick the CPU course by providing you with lots of pre-written boilerplate code upfront, and heavily guiding you through the remaining work.

As before, we will spend the next few sections explaining how the provided code works. But because Vulkan programming is a fair bit more complex than CPU programming, here it will actually take not just a few sections, but a few chapters.

Adding the `vulkano` dependency

First of all, we add vulkano to our list of dependencies. This is a high-level Vulkan binding that will take care of providing an idiomatic Rust interface to the Vulkan C API, and also provide us with a good default policy for handling particularly unpleasant aspects of Vulkan like functionality enumeration², GPU buffer sub-allocation³ and pipeline barriers⁴.

We use the now familiar cargo add command for this…

cargo add --optional vulkano

…but this time we make vulkano an optional dependency, which is not built by default, so that your CPU builds are not affected by this relatively heavy dependency.

We then add this dependency to a gpu optional feature inside of the project’s Cargo.toml file:

[features]
gpu = ["dep:vulkano"]

This way, when the project is build with the --features=gpu cargo option, vulkano will be built and linked against our project.

This means that all of our GPU code will more generally need to be guarded so that it is only compiled when this optional feature is enabled. We will later see how that is done.

Loading the Vulkan library

Now that we have vulkano in our dependency tree, we must make it load our operating system’s Vulkan implementation. This is harder than it sounds on the implementation side², but vulkano makes it easy enough for us:

use vulkano::VulkanLibrary;

// Try to load the Vulkan library, and handle errors (here by propagating them)
let library = VulkanLibrary::new()?;

We can then query the resulting VulkanLibrary object in order to know more about the Vulkan implementation that we are dealing with:

Which version of the Vulkan specification is supported?
Are there optional extensions that we can enable? These enable us to do things like log driver error messages or display images on the screen.
Are there optional layers that we can enable? These intercept each of our API calls, typically for the purpose of measuring profiles or validating that our usage of Vulkan is correct.

Creating an instance

Once we have learned everything that we need to know about the Vulkan implementation of the host system, we proceed to create an Instance. This is where we actually initialize the Vulkan library by telling it more about our application and what optional Vulkan functionality it needs.

For this purpose, and many others, vulkano uses a pattern of configuration structs with a default value. You can use the defaults for most fields and set the fields you want differently using the following syntax:

use vulkano::instance::InstanceExtensions;

// Let us log Vulkan errors and warnings
let enabled_extensions = InstanceExtensions {
    ext_debug_utils: true,
    ..Default::default()
};

By using this pattern on a larger scale, and leveraging cfg!(debug_assertions) which lets us detect if the program is a debug or release build, we can set up a Vulkan instance that enables minimal debug logging (errors and warnings) in release builds, and more verbose logging in debug builds.

In debug builds, we also enable Vulkan’s validation layer, which instruments every API call to detect many flavors of invalid and inefficient API usage.

use vulkano::instance::{
    Instance, InstanceCreateInfo, InstanceExtensions,
    debug::{
        DebugUtilsMessengerCallback, DebugUtilsMessengerCreateInfo,
        DebugUtilsMessageSeverity, DebugUtilsMessageType
    },
};

// Basic logging in release builds, more logging in debug builds
let mut message_severity =
    DebugUtilsMessageSeverity::ERROR | DebugUtilsMessageSeverity::WARNING;
if cfg!(debug_assertions) {
    message_severity |=
        DebugUtilsMessageSeverity::INFO | DebugUtilsMessageSeverity::VERBOSE;
}
let mut message_type = DebugUtilsMessageType::GENERAL;
if cfg!(debug_assertions) {
    message_type |=
        DebugUtilsMessageType::VALIDATION | DebugUtilsMessageType::PERFORMANCE;
}

// Logging configuration and callback
let messenger_info = DebugUtilsMessengerCreateInfo {
    message_severity,
    message_type,
    // This is unsafe because we promise not to call the Vulkan API
    // inside of this callback
    ..unsafe {
        DebugUtilsMessengerCreateInfo::user_callback(DebugUtilsMessengerCallback::new(
            |severity, ty, data| {
                eprintln!("[{severity:?} {ty:?}] {}", data.message);
            },
        ))
    }
};

// Set up a Vulkan instance
let instance = Instance::new(
    library,
    InstanceCreateInfo {
        // Enable validation layers in debug builds to catch many Vulkan usage bugs
        enabled_layers: if cfg!(debug_assertions) {
            vec![String::from("VK_LAYER_KHRONOS_validation")]
        } else {
            Vec::new()
        },
        // Enable debug utils extension to let us log Vulkan messages
        enabled_extensions: InstanceExtensions {
            ext_debug_utils: true,
            ..Default::default()
        },
        // Set up a first debug utils messenger that logs Vulkan messages during
        // instance initialization and teardown
        debug_utils_messengers: vec![messenger_info],
        ..Default::default()
    }
)?;

Debug logging after instance creation

The debug_utils_messengers field of the InstanceCreateInfo struct only affects how Vulkan errors and warnings are going to be logged during the instance creation and teardown process. For reasons known to the Vulkan specification authors alone, a separate DebugUtilsMessenger must be configured in order to keep logging messages while the application is running.

Of course, they can both use the same configuration, and we can refactor the code accordingly. First we adjust the InstanceCreateInfo to clone the messenger_info struct instead of moving it away…

debug_utils_messengers: vec![messenger_info.clone()],

…and then we create our separate messenger, like this:

use vulkano::instance::debug::DebugUtilsMessenger;

let messenger = DebugUtilsMessenger::new(instance.clone(), messenger_info);

Log messages will only be emitted as long as this object exists, so we will want to keep it around along with other Vulkan state. We’ll get back to this by eventually stashing all useful Vulkan state into a single VulkanContext struct.

There is one more matter that we need to take care of, however: logging on stderr is fundamentally incompatible with displaying ASCII art in the terminal, like the indicatif progress bars that we have been using so far. If a log is printed while the progress bar is on-screen, it will corrupt its display.

indicatif provides us with a tool to handle this, in the form of the ProgressBar::suspend() method. But our debug utils messenger must be configured to use it, which will not be needed in builds without progress bars like our microbenchmarks.

To handle this concern, we pass down a callback to our instance creation function, which receives a string and is in charge of printing it to stderr as appropriate…

use std::{panic::RefUnwindSafe, sync::Arc};
use vulkano::{Validated, VulkanError};

fn create_instance(
    // The various trait bounds are used to assert that it is fine for vulkano
    // to move and use our debug callback anywhere, including on another thread
    debug_println: impl Fn(String) + RefUnwindSafe + Send + Sync + 'static
) -> Result<(Arc<Instance>, DebugUtilsMessenger), Box<dyn Error>> {
    // TODO: Create the instance and its debug messenger
}

…and then we use it in our Vulkan logging callback instead of calling eprintln!() directly:

..unsafe {
    DebugUtilsMessengerCreateInfo::user_callback(DebugUtilsMessengerCallback::new(
        move |severity, ty, data| {
            let message = format!("[{severity:?} {ty:?}] {}", data.message);
            debug_println(message);
        },
    ))
}

We will provide a version of this callback that directly sends logs to stderr…

#![allow(unused)]
fn main() {
pub fn debug_println_stderr(log: String) {
    eprintln!("{log}");
}
}

…and a recipe to make other callbacks that use the suspend method of an indicatif ProgressBar to correctly print out logs when such a progress bar is active:

use indicatif::ProgressBar;

pub fn make_debug_println_indicatif(
    progress_bar: ProgressBar
) -> impl Fn(String) + RefUnwindSafe + Send + Sync + 'static {
    move |log| progress_bar.suspend(|| eprintln!("{log}"))
}

To conclude this instance creation tour, let’s explain the return type of create_instance():

Result<(Arc<Instance>, DebugUtilsMessenger), Box<dyn Error>>

There are two layers there:

We have a Result, which indicates that the function may fail in a manner that the caller can recover from, or report in a customized fashion.
In case of success, we return an Instance, wrapped in an Arc Atomically Reference-Counted pointer so that cheap copies can be shared around, including across threads⁵.
In case of error, we return Box<dyn Error>, which is the lazy person’s type for “any error can happen, I don’t care about enumerating them in the output type”.

And that’s finally it for Vulkan instance creation!

Picking a physical device

People who are used to CUDA or SYCL may be surprised to learn that Vulkan works on all modern⁶ GPUs with no other setup work needed than having a working OS driver, and can be easily emulated on CPU for purposes like unit testing in CI.

There is a flip side to this portability, however, which is that it is very common to have multiple Vulkan devices available on a given system, and production-grade code should ideally be able to…

Tell which of the available devices (if any) match its hardware requirements (e.g. amount of RAM required for execution, desired Vulkan spec version/extensions…).
If only a single device is to be used⁷, pick the most suitable device amongst those available options (which may involve various trade-offs like peak performance vs power efficiency).

We will at first do the simplest thing that should work on most machines⁸: accept all devices, priorize them by type (discrete GPU, integrated GPU, CPU-based emulation…) according to expected peak throughput, and pick the first device that matches in this priority order.

But as we move forward through the course, we may have to revisit this part of the code by filtering out devices that do not implement certain optional features that we need. Hence we should plan ahead by having a device filtering callback that tells whether each device can be used or not, using the same logic as Iterator::filter().

Finally, we need to decide what will happen if we fail to detect a suitable GPU device. In production-quality code, the right thing to do here would be to log a warning and fall back to a CPU-based computation. And if we have such a CPU fallback available, we should probably also ignore CPU emulations of GPU devices, which are likely to be slower. But because the goal is to learn about how to write Vulkan code here, we will instead fail with a runtime panic when no GPU or GPU emulation is found, as this will tell us if something is wrong with our device selection callback.

Overall, our physical device selection code looks like this:

use vulkano::{
    device::physical::{PhysicalDevice, PhysicalDeviceType},
    VulkanError,
};

/// Pick the best physical device that matches our requirements
fn pick_physical_device(
    instance: &Arc<Instance>,
    mut device_filter: impl FnMut(&PhysicalDevice) -> bool,
) -> Result<Arc<PhysicalDevice>, VulkanError> {
    Ok(instance
        .enumerate_physical_devices()?
        .filter(|device| device_filter(&device))
        .fold(None, |best_so_far, current| {
            // The first device that comes up is always the best
            let Some(best_so_far) = best_so_far else {
                return Some(current);
            };

            // Compare previous best device type to current device type
            let best_device = match (
                best_so_far.properties().device_type,
                current.properties().device_type,
            ) {
                // Discrete GPU should always be the best performing option
                (PhysicalDeviceType::DiscreteGpu, _) => best_so_far,
                (_, PhysicalDeviceType::DiscreteGpu) => current,
                // Virtual GPU is hopefully a discrete GPU accessed via PCIe passthrough
                (PhysicalDeviceType::VirtualGpu, _) => best_so_far,
                (_, PhysicalDeviceType::VirtualGpu) => current,
                // Integrated GPU is, at least, still a GPU.
                // It will likely be less performant, but more power-efficient.
                // In this basic codebase, we'll only care about performance.
                (PhysicalDeviceType::IntegratedGpu, _) => best_so_far,
                (_, PhysicalDeviceType::IntegratedGpu) => current,
                // CPU emulation is probably going to be pretty bad...
                (PhysicalDeviceType::Cpu, _) => best_so_far,
                (_, PhysicalDeviceType::Cpu) => current,
                // ...but at least we know what we're dealing with, unlike the rest
                (PhysicalDeviceType::Other, _) => best_so_far,
                (_, PhysicalDeviceType::Other) => current,
                (_unknown, _other_unknown) => best_so_far,
            };
            Some(best_device)
        })
        // This part (and the function return type) would change if you wanted
        // to switch to a CPU fallback when no GPU is found.
        .expect("No usable Vulkan device found"))
}

Creating a logical device and command queue

Once we have found a suitable physical device, we need to set up a logical device, which will be used to allocate resources. And we will also need to set up with one or more command queues, which will be used to submit work. Both of these will be created in a single API transaction.

The process of creating a logical device and associated command queues is very similar to that of creating a Vulkan instance, and exists partially for the same reasons: we need to pick which optional API features we want to enable, at the expense of reducing application portability.

But here there is also a new concern, which is command queue creation. In a nutshell, Vulkan devices support simultaneous submission and processing of commands over multiple independent hardware channels, which is typically used to…

Overlap graphics rendering with general-purpose computing.
Overlap PCIe data transfers to and from the GPU with other operations.

Some channels are specialized for a specific type of operations, and may perform them better than other channels. Unlike most other GPU APIs, Vulkan exposes this hardware feature in the form of queue families with flags that tell you what each family can do. For example, here are the queue families available on my laptop’s AMD Radeon 5600M GPU, as reported by vulkaninfo:

VkQueueFamilyProperties:
========================
    queueProperties[0]:
    -------------------
        minImageTransferGranularity = (1,1,1)
        queueCount                  = 1
        queueFlags                  = QUEUE_GRAPHICS_BIT | QUEUE_COMPUTE_BIT | QUEUE_TRANSFER_BIT
        timestampValidBits          = 64
        present support             = true
        VkQueueFamilyGlobalPriorityPropertiesKHR:
        -----------------------------------------
            priorityCount  = 4
            priorities: count = 4
                QUEUE_GLOBAL_PRIORITY_LOW_KHR
                QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR
                QUEUE_GLOBAL_PRIORITY_HIGH_KHR
                QUEUE_GLOBAL_PRIORITY_REALTIME_KHR


    queueProperties[1]:
    -------------------
        minImageTransferGranularity = (1,1,1)
        queueCount                  = 4
        queueFlags                  = QUEUE_COMPUTE_BIT | QUEUE_TRANSFER_BIT
        timestampValidBits          = 64
        present support             = true
        VkQueueFamilyGlobalPriorityPropertiesKHR:
        -----------------------------------------
            priorityCount  = 4
            priorities: count = 4
                QUEUE_GLOBAL_PRIORITY_LOW_KHR
                QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR
                QUEUE_GLOBAL_PRIORITY_HIGH_KHR
                QUEUE_GLOBAL_PRIORITY_REALTIME_KHR


    queueProperties[2]:
    -------------------
        minImageTransferGranularity = (1,1,1)
        queueCount                  = 1
        queueFlags                  = QUEUE_SPARSE_BINDING_BIT
        timestampValidBits          = 64
        present support             = false
        VkQueueFamilyGlobalPriorityPropertiesKHR:
        -----------------------------------------
            priorityCount  = 4
            priorities: count = 4
                QUEUE_GLOBAL_PRIORITY_LOW_KHR
                QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR
                QUEUE_GLOBAL_PRIORITY_HIGH_KHR
                QUEUE_GLOBAL_PRIORITY_REALTIME_KHR

As you can see, there is one general-purpose queue family that can do everything, another queue family that is specialized for asynchronous compute tasks and data transfers running in parallel with the main compute/graphics work, and a third queue family that lets you manipulate sparse memory resources for the purpose of handling resources larger than GPU VRAM less badly.

From these physical queue families, you may then allocate one or more logical command queues that will allow you to submit commands to the matching hardware command processor.

To keep this introductory course simpler, and because the Gray-Scott reaction simulation does not easily lend itself to the parallel execution of compute and data transfer commands⁹, we will not use multiple command queues in the beginning. Instead, we will use a single command queue from the first queue family that supports general-purpose computations and data transfers:

use vulkano::device::QueueFlags;

// Pick the first queue family that supports compute
let queue_family_index = physical_device
    .queue_family_properties()
    .iter()
    .position(|properties| {
        properties
            .queue_flags
            .contains(QueueFlags::COMPUTE | QueueFlags::TRANSFER)
    })
    .expect("Vulkan spec mandates availability of at least one compute queue");

As for optional device features, for now we will…

Only enable features that are useful for debugging (basically turning out-of-bounds data access UB into well-defined behavior)
Only do so in debug builds as they may come at a significant runtime performance cost.
Only enable them if supported by the device, ignoring their absence otherwise.

Overall, it looks like this:

use vulkano::device::Features;

// Enable debug features in debug builds, if supported by the device
let supported_features = physical_device.supported_features();
let enabled_features = if cfg!(debug_assertions) {
    Features {
        robust_buffer_access: supported_features.robust_buffer_access,
        robust_buffer_access2: supported_features.robust_buffer_access2,
        robust_image_access: supported_features.robust_image_access,
        robust_image_access2: supported_features.robust_image_access2,
        ..Default::default()
    }
} else {
    Features::default()
};

Now that we know which optional features we want to enable and which command queues we want to use, all that is left to do is to create our logical device and single commande queue:

use vulkano::device::{Device, DeviceCreateInfo, QueueCreateInfo};

// Create our device and command queue
let (device, mut queues) = Device::new(
    physical_device,
    DeviceCreateInfo {
        // Request a single command queue from the previously selected family
        queue_create_infos: vec![QueueCreateInfo {
            queue_family_index: queue_family_index as u32,
            ..Default::default()
        }],
        enabled_features,
        ..Default::default()
    },
)?;
let queue = queues
    .next()
    .expect("We asked for one queue, we should get one");

And that concludes our create_device_and_queue() function, which itself is the last part of the basic application setup work that we will always need no matter what we want to do with Vulkan.

In the next chapter, we will see a few more Vulkan setup steps that are a tiny bit more specific to this application, but could still be reused across many different applications.

It is not uncommon for GPUs to run double-precision computations >10x slower than single-precision ones.

Vulkan is an API that was designed to evolve over time and be freely extensible by hardware vendors. This means that the set of functions provided by your operating system’s Vulkan library is not fully known at compile time. Instead the list of available function pointers must be queried at runtime and absence of certain “newer” or “extended” function pointers must be correctly handled.

GPU memory allocators are very slow and come with all sorts of unpleasant limitations, like only allowing a small number of allocations or preventing any other GPU command from running in parallel with the allocator. Therefore, it is good practice to only allocate a few very large blocks of memory from the GPU driver, and use a separate library-based allocator like the Vulkan Memory Allocator in order to sub-allocate smaller application-requested buffers from these large blocks.

⁴

In Vulkan, GPUs are directed to do things using batches of commands. By default, these batches of commands are almost completely unordered, and it is legal for the GPU driver to e.g. start processing a command on a compute unit while the previous command is still running on another. Because of this and because GPUs hardware uses incoherent caches, the view of memory of two separate commands may be inconsistent, e.g. a command may not observe the writes to memory that were performed by a previous command in the batch. Pipeline barriers are a Vulkan-provided synchronization primitive that can be used to enforce execution ordering and memory consistency constraints between commands, at a runtime performance cost.

⁵

vulkano makes heavy use of Arc, which is the Rust equivalent of C++’s shared_ptr. Unlike C++, Rust lets you choose between atomic and non-atomic reference counting, in the form of Rc and Arc, so that you do not need to pay the cost of atomic operations in single-threaded sections of your code.

⁶

Vulkan 1.0 is supported by most GPUs starting around 2012-2013. The main exception is Apple GPUs. Due to lack of interest from Apple, who prefer to work on their proprietary Metal API, these GPUs can only support Vulkan via the MoltenVk library, which implements a “portability subset” of Vulkan that cuts off some minor features and thus does not meet the minimal requirements of Vulkan 1.0. Most Vulkan applications, however, will run just fine with the subset of Vulkan implemented by MoltenVk.

⁷

The barrier to using multiple devices is significantly lower in Vulkan than in most other GPU APIs, because they integrate very nicely into Vulkan’s synchronization model and can be manipulated together as “device groups” since Vulkan 1.1. But writing a correct multi-GPU application is still a fair amount of extra work, and many of us will not have access to a setup that allows them to check that their multi-GPU support actually works. Therefore, I will only cover usage of a single GPU during this course.

⁸

Most computers either have a single hardware GPU (discrete, virtual or integrated), a combination of a discrete and an integrated GPU, or multiple discrete GPUs of identical performance characteristics. Therefore, treating all devices of a single type as equal can only cause device selection problems in exotic configurations (e.g. machines mixing discrete GPUs from NVidia and AMD), and is thus normally an acceptable tradeoff in beginner-oriented Vulkan code. The best way to handle more exotic configurations is often not to auto-select GPUs anyway, but just to give the end user an option (via CLI parameters, environment or config files) to choose which GPU(s) should be used. If they are using such an unusual computer, they probably know enough about their hardware to make an informed choice here.

⁹

If we keep using the double-buffered design that has served us well so far, then as long as one GPU concentration buffer is in the process of being downloaded to the CPU, we can only run one computation step before needing to wait for that buffer to be available for writing, and that single overlapping compute step will not save us a lot of time. To do asynchronous downloads more efficiently, we would need to switch to a triple-buffered design, where an on-GPU copy of the current concentration array is first made to a third independent buffer that is not used for compute, and then the CPU download is performed from that buffer. But that adds a fair amount of complexity to the code, and is thus beyond the scope of this introductory course.

Gray-Scott with Rust