From zero to device
In the previous chapters, we have seen how to use Rust to implement an efficient CPU-based simulation of the Gray-Scott reaction. However, these days, CPUs are only half of the story, as most computers also come equipped with at least one GPU.
Originally designed for the low-dimensional linear algebra workloads of real-time computer graphics, GPUs have since been remarketed as general-purpose computing hardware, and successfully sold in large numbers to high performance computing centers worldwide. As a result, if you intend to run your computations on HPC centers, you need to know that being able to use a compute node’s GPUs is increasingly becoming a mandatory requirement for larger computing resource allocations.
But even if you don’t care about HPC centers, it is still good to know how to use a GPU for compute, as it is not uncommon for other people to run your code on gaming- or CAD-oriented hardware that has around 10x more single-precision1 floating-point computing power and RAM bandwidth on the GPU side than on the CPU side.
Therefore, in this part of the course, we will learn how Rust code can be made to leverage almost any available GPU (not just NVidia ones), with minimal end user installation requirements and good portability across operating systems, by leveraging the Vulkan API.
Being fair about Vulkan complexity
Like most other APIs designed by the Khronos Group, Vulkan has a reputation of being very verbose, requiring hundreds of lines of code in order to perform the simplest tasks. While that statement is not wrong, it is arguably incomplete and misleading:
- Vulkan is specified as a C API. This is necessary for it to be callable from
many programming languages, but due to limitations of its type system, C tends
to make patterns like resource managements and optional features verbose. This
is why it is a good idea to avoid using C APIs directly, and instead prefer
higher-level wrappers that leverage the extra capabilities of modern
programming languages for ergonomics, like
vulkano
in Rust. - While a Vulkan “Hello world” is indeed quite long, much of it revolves around one-time initialization work that would not need to be repeated in a larger application. As a consequence, less trivial Vulkan programs will be more comparable in size to programs written against other APIs that have a less flexible initialization process.
- Many lines of the typical Vulkan “Hello world” revolve around things that will not vary much across applications, like enumerating available computing devices and finding out which device you want to use. With a bit of care, these patterns can be easily extracted into libraries of common abstractions, that you can easily reuse from one application to another.
- Once all this “accidental verbosity” is taken care of, what remains is extra configuration steps that let you control more details of the execution of your application than is possible in other APIs. This control can be leveraged for performance optimizations, enabling well-written Vulkan applications to use the GPU more efficiently than e.g. CUDA or SYCL applications could.
Keeping these facts in mind, and balancing them against the need to fit this course in half a day, this Vulkan course will mimick the CPU course by providing you with lots of pre-written boilerplate code upfront, and heavily guiding you through the remaining work.
As before, we will spend the next few sections explaining how the provided code works. But because Vulkan programming is a fair bit more complex than CPU programming, here it will actually take not just a few sections, but a few chapters.
Adding the vulkano
dependency
First of all, we add vulkano
to our list of dependencies. This is a
high-level Vulkan binding that will take care of providing an idiomatic Rust
interface to the Vulkan C API, and also provide us with a good default policy
for handling particularly unpleasant aspects of Vulkan like functionality
enumeration2, GPU buffer sub-allocation3 and pipeline barriers4.
We use the now familiar cargo add
command for this…
cargo add --optional vulkano
…but this time we make vulkano
an optional dependency, which is not built by
default, so that your CPU builds are not affected by this relatively heavy
dependency.
We then add this dependency to a gpu
optional feature inside of the project’s
Cargo.toml
file:
[features]
gpu = ["dep:vulkano"]
This way, when the project is build with the --features=gpu
cargo option,
vulkano will be built and linked against our project.
This means that all of our GPU code will more generally need to be guarded so that it is only compiled when this optional feature is enabled. We will later see how that is done.
Loading the Vulkan library
Now that we have vulkano
in our dependency tree, we must make it load our
operating system’s Vulkan implementation. This is harder than it sounds on the
implementation side2, but vulkano
makes it easy enough for us:
use vulkano::VulkanLibrary;
// Try to load the Vulkan library, and handle errors (here by propagating them)
let library = VulkanLibrary::new()?;
We can then query the resulting
VulkanLibrary
object in order to know more about the Vulkan implementation that we are dealing
with:
- Which version of the Vulkan specification is supported?
- Are there optional extensions that we can enable? These enable us to do things like log driver error messages or display images on the screen.
- Are there optional layers that we can enable? These intercept each of our API calls, typically for the purpose of measuring profiles or validating that our usage of Vulkan is correct.
Creating an instance
Once we have learned everything that we need to know about the Vulkan
implementation of the host system, we proceed to create an
Instance
.
This is where we actually initialize the Vulkan library by telling it more about
our application and what optional Vulkan functionality it needs.
For this purpose, and many others, vulkano
uses a pattern of configuration
structs with a default value. You can use the defaults for most fields and set
the fields you want differently using the following syntax:
use vulkano::instance::InstanceExtensions;
// Let us log Vulkan errors and warnings
let enabled_extensions = InstanceExtensions {
ext_debug_utils: true,
..Default::default()
};
By using this pattern on a larger scale, and leveraging cfg!(debug_assertions)
which lets us detect if the program is a debug or release build, we can set up a
Vulkan instance that enables minimal debug logging (errors and warnings) in
release builds, and more verbose logging in debug builds.
In debug builds, we also enable Vulkan’s validation layer, which instruments every API call to detect many flavors of invalid and inefficient API usage.
use vulkano::instance::{
Instance, InstanceCreateInfo, InstanceExtensions,
debug::{
DebugUtilsMessengerCallback, DebugUtilsMessengerCreateInfo,
DebugUtilsMessageSeverity, DebugUtilsMessageType
},
};
// Basic logging in release builds, more logging in debug builds
let mut message_severity =
DebugUtilsMessageSeverity::ERROR | DebugUtilsMessageSeverity::WARNING;
if cfg!(debug_assertions) {
message_severity |=
DebugUtilsMessageSeverity::INFO | DebugUtilsMessageSeverity::VERBOSE;
}
let mut message_type = DebugUtilsMessageType::GENERAL;
if cfg!(debug_assertions) {
message_type |=
DebugUtilsMessageType::VALIDATION | DebugUtilsMessageType::PERFORMANCE;
}
// Logging configuration and callback
let messenger_info = DebugUtilsMessengerCreateInfo {
message_severity,
message_type,
// This is unsafe because we promise not to call the Vulkan API
// inside of this callback
..unsafe {
DebugUtilsMessengerCreateInfo::user_callback(DebugUtilsMessengerCallback::new(
|severity, ty, data| {
eprintln!("[{severity:?} {ty:?}] {}", data.message);
},
))
}
};
// Set up a Vulkan instance
let instance = Instance::new(
library,
InstanceCreateInfo {
// Enable validation layers in debug builds to catch many Vulkan usage bugs
enabled_layers: if cfg!(debug_assertions) {
vec![String::from("VK_LAYER_KHRONOS_validation")]
} else {
Vec::new()
},
// Enable debug utils extension to let us log Vulkan messages
enabled_extensions: InstanceExtensions {
ext_debug_utils: true,
..Default::default()
},
// Set up a first debug utils messenger that logs Vulkan messages during
// instance initialization and teardown
debug_utils_messengers: vec![messenger_info],
..Default::default()
}
)?;
Debug logging after instance creation
The debug_utils_messengers
field of the InstanceCreateInfo
struct only
affects how Vulkan errors and warnings are going to be logged during the
instance creation and teardown process. For reasons known to the Vulkan
specification authors alone, a separate
DebugUtilsMessenger
must be configured in order to keep logging
messages while the application is running.
Of course, they can both use the same configuration, and we can refactor the
code accordingly. First we adjust the InstanceCreateInfo
to clone the
messenger_info
struct instead of moving it away…
debug_utils_messengers: vec![messenger_info.clone()],
…and then we create our separate messenger, like this:
use vulkano::instance::debug::DebugUtilsMessenger;
let messenger = DebugUtilsMessenger::new(instance.clone(), messenger_info);
Log messages will only be emitted as long as this object exists, so we will want
to keep it around along with other Vulkan state. We’ll get back to this by
eventually stashing all useful Vulkan state into a single VulkanContext
struct.
There is one more matter that we need to take care of, however: logging on
stderr is fundamentally incompatible with displaying ASCII art in the terminal,
like the indicatif
progress bars that we have been using so far. If a log is
printed while the progress bar is on-screen, it will corrupt its display.
indicatif
provides us with a tool to handle this, in the form of the
ProgressBar::suspend()
method. But our debug utils messenger
must be configured to use it, which will not be needed in builds without
progress bars like our microbenchmarks.
To handle this concern, we pass down a callback to our instance creation function, which receives a string and is in charge of printing it to stderr as appropriate…
use std::{panic::RefUnwindSafe, sync::Arc};
use vulkano::{Validated, VulkanError};
fn create_instance(
// The various trait bounds are used to assert that it is fine for vulkano
// to move and use our debug callback anywhere, including on another thread
debug_println: impl Fn(String) + RefUnwindSafe + Send + Sync + 'static
) -> Result<(Arc<Instance>, DebugUtilsMessenger), Box<dyn Error>> {
// TODO: Create the instance and its debug messenger
}
…and then we use it in our Vulkan logging callback instead of calling
eprintln!()
directly:
..unsafe {
DebugUtilsMessengerCreateInfo::user_callback(DebugUtilsMessengerCallback::new(
move |severity, ty, data| {
let message = format!("[{severity:?} {ty:?}] {}", data.message);
debug_println(message);
},
))
}
We will provide a version of this callback that directly sends logs to stderr…
#![allow(unused)] fn main() { pub fn debug_println_stderr(log: String) { eprintln!("{log}"); } }
…and a recipe to make other callbacks that use the suspend
method of an
indicatif ProgressBar
to correctly print out logs when such a progress bar is
active:
use indicatif::ProgressBar;
pub fn make_debug_println_indicatif(
progress_bar: ProgressBar
) -> impl Fn(String) + RefUnwindSafe + Send + Sync + 'static {
move |log| progress_bar.suspend(|| eprintln!("{log}"))
}
To conclude this instance creation tour, let’s explain the return
type of create_instance()
:
Result<(Arc<Instance>, DebugUtilsMessenger), Box<dyn Error>>
There are two layers there:
- We have a
Result
, which indicates that the function may fail in a manner that the caller can recover from, or report in a customized fashion. - In case of success, we return an
Instance
, wrapped in anArc
Atomically Reference-Counted pointer so that cheap copies can be shared around, including across threads5. - In case of error, we return
Box<dyn Error>
, which is the lazy person’s type for “any error can happen, I don’t care about enumerating them in the output type”.
And that’s finally it for Vulkan instance creation!
Picking a physical device
People who are used to CUDA or SYCL may be surprised to learn that Vulkan works on all modern6 GPUs with no other setup work needed than having a working OS driver, and can be easily emulated on CPU for purposes like unit testing in CI.
There is a flip side to this portability, however, which is that it is very common to have multiple Vulkan devices available on a given system, and production-grade code should ideally be able to…
- Tell which of the available devices (if any) match its hardware requirements (e.g. amount of RAM required for execution, desired Vulkan spec version/extensions…).
- If only a single device is to be used7, pick the most suitable device amongst those available options (which may involve various trade-offs like peak performance vs power efficiency).
We will at first do the simplest thing that should work on most machines8: accept all devices, priorize them by type (discrete GPU, integrated GPU, CPU-based emulation…) according to expected peak throughput, and pick the first device that matches in this priority order.
But as we move forward through the course, we may have to revisit this part of
the code by filtering out devices that do not implement certain optional
features that we need. Hence we should plan ahead by having a device filtering
callback that tells whether each device can be used or not, using the same logic
as
Iterator::filter()
.
Finally, we need to decide what will happen if we fail to detect a suitable GPU device. In production-quality code, the right thing to do here would be to log a warning and fall back to a CPU-based computation. And if we have such a CPU fallback available, we should probably also ignore CPU emulations of GPU devices, which are likely to be slower. But because the goal is to learn about how to write Vulkan code here, we will instead fail with a runtime panic when no GPU or GPU emulation is found, as this will tell us if something is wrong with our device selection callback.
Overall, our physical device selection code looks like this:
use vulkano::{
device::physical::{PhysicalDevice, PhysicalDeviceType},
VulkanError,
};
/// Pick the best physical device that matches our requirements
fn pick_physical_device(
instance: &Arc<Instance>,
mut device_filter: impl FnMut(&PhysicalDevice) -> bool,
) -> Result<Arc<PhysicalDevice>, VulkanError> {
Ok(instance
.enumerate_physical_devices()?
.filter(|device| device_filter(&device))
.fold(None, |best_so_far, current| {
// The first device that comes up is always the best
let Some(best_so_far) = best_so_far else {
return Some(current);
};
// Compare previous best device type to current device type
let best_device = match (
best_so_far.properties().device_type,
current.properties().device_type,
) {
// Discrete GPU should always be the best performing option
(PhysicalDeviceType::DiscreteGpu, _) => best_so_far,
(_, PhysicalDeviceType::DiscreteGpu) => current,
// Virtual GPU is hopefully a discrete GPU accessed via PCIe passthrough
(PhysicalDeviceType::VirtualGpu, _) => best_so_far,
(_, PhysicalDeviceType::VirtualGpu) => current,
// Integrated GPU is, at least, still a GPU.
// It will likely be less performant, but more power-efficient.
// In this basic codebase, we'll only care about performance.
(PhysicalDeviceType::IntegratedGpu, _) => best_so_far,
(_, PhysicalDeviceType::IntegratedGpu) => current,
// CPU emulation is probably going to be pretty bad...
(PhysicalDeviceType::Cpu, _) => best_so_far,
(_, PhysicalDeviceType::Cpu) => current,
// ...but at least we know what we're dealing with, unlike the rest
(PhysicalDeviceType::Other, _) => best_so_far,
(_, PhysicalDeviceType::Other) => current,
(_unknown, _other_unknown) => best_so_far,
};
Some(best_device)
})
// This part (and the function return type) would change if you wanted
// to switch to a CPU fallback when no GPU is found.
.expect("No usable Vulkan device found"))
}
Creating a logical device and command queue
Once we have found a suitable physical device, we need to set up a logical device, which will be used to allocate resources. And we will also need to set up with one or more command queues, which will be used to submit work. Both of these will be created in a single API transaction.
The process of creating a logical device and associated command queues is very similar to that of creating a Vulkan instance, and exists partially for the same reasons: we need to pick which optional API features we want to enable, at the expense of reducing application portability.
But here there is also a new concern, which is command queue creation. In a nutshell, Vulkan devices support simultaneous submission and processing of commands over multiple independent hardware channels, which is typically used to…
- Overlap graphics rendering with general-purpose computing.
- Overlap PCIe data transfers to and from the GPU with other operations.
Some channels are specialized for a specific type of operations, and may perform
them better than other channels. Unlike most other GPU APIs, Vulkan exposes this
hardware feature in the form of queue families with flags that tell you what
each family can do. For example, here are the queue families available on my
laptop’s AMD Radeon 5600M GPU, as reported by vulkaninfo
:
VkQueueFamilyProperties:
========================
queueProperties[0]:
-------------------
minImageTransferGranularity = (1,1,1)
queueCount = 1
queueFlags = QUEUE_GRAPHICS_BIT | QUEUE_COMPUTE_BIT | QUEUE_TRANSFER_BIT
timestampValidBits = 64
present support = true
VkQueueFamilyGlobalPriorityPropertiesKHR:
-----------------------------------------
priorityCount = 4
priorities: count = 4
QUEUE_GLOBAL_PRIORITY_LOW_KHR
QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR
QUEUE_GLOBAL_PRIORITY_HIGH_KHR
QUEUE_GLOBAL_PRIORITY_REALTIME_KHR
queueProperties[1]:
-------------------
minImageTransferGranularity = (1,1,1)
queueCount = 4
queueFlags = QUEUE_COMPUTE_BIT | QUEUE_TRANSFER_BIT
timestampValidBits = 64
present support = true
VkQueueFamilyGlobalPriorityPropertiesKHR:
-----------------------------------------
priorityCount = 4
priorities: count = 4
QUEUE_GLOBAL_PRIORITY_LOW_KHR
QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR
QUEUE_GLOBAL_PRIORITY_HIGH_KHR
QUEUE_GLOBAL_PRIORITY_REALTIME_KHR
queueProperties[2]:
-------------------
minImageTransferGranularity = (1,1,1)
queueCount = 1
queueFlags = QUEUE_SPARSE_BINDING_BIT
timestampValidBits = 64
present support = false
VkQueueFamilyGlobalPriorityPropertiesKHR:
-----------------------------------------
priorityCount = 4
priorities: count = 4
QUEUE_GLOBAL_PRIORITY_LOW_KHR
QUEUE_GLOBAL_PRIORITY_MEDIUM_KHR
QUEUE_GLOBAL_PRIORITY_HIGH_KHR
QUEUE_GLOBAL_PRIORITY_REALTIME_KHR
As you can see, there is one general-purpose queue family that can do everything, another queue family that is specialized for asynchronous compute tasks and data transfers running in parallel with the main compute/graphics work, and a third queue family that lets you manipulate sparse memory resources for the purpose of handling resources larger than GPU VRAM less badly.
From these physical queue families, you may then allocate one or more logical command queues that will allow you to submit commands to the matching hardware command processor.
To keep this introductory course simpler, and because the Gray-Scott reaction simulation does not easily lend itself to the parallel execution of compute and data transfer commands9, we will not use multiple command queues in the beginning. Instead, we will use a single command queue from the first queue family that supports general-purpose computations and data transfers:
use vulkano::device::QueueFlags;
// Pick the first queue family that supports compute
let queue_family_index = physical_device
.queue_family_properties()
.iter()
.position(|properties| {
properties
.queue_flags
.contains(QueueFlags::COMPUTE | QueueFlags::TRANSFER)
})
.expect("Vulkan spec mandates availability of at least one compute queue");
As for optional device features, for now we will…
- Only enable features that are useful for debugging (basically turning out-of-bounds data access UB into well-defined behavior)
- Only do so in debug builds as they may come at a significant runtime performance cost.
- Only enable them if supported by the device, ignoring their absence otherwise.
Overall, it looks like this:
use vulkano::device::Features;
// Enable debug features in debug builds, if supported by the device
let supported_features = physical_device.supported_features();
let enabled_features = if cfg!(debug_assertions) {
Features {
robust_buffer_access: supported_features.robust_buffer_access,
robust_buffer_access2: supported_features.robust_buffer_access2,
robust_image_access: supported_features.robust_image_access,
robust_image_access2: supported_features.robust_image_access2,
..Default::default()
}
} else {
Features::default()
};
Now that we know which optional features we want to enable and which command queues we want to use, all that is left to do is to create our logical device and single commande queue:
use vulkano::device::{Device, DeviceCreateInfo, QueueCreateInfo};
// Create our device and command queue
let (device, mut queues) = Device::new(
physical_device,
DeviceCreateInfo {
// Request a single command queue from the previously selected family
queue_create_infos: vec![QueueCreateInfo {
queue_family_index: queue_family_index as u32,
..Default::default()
}],
enabled_features,
..Default::default()
},
)?;
let queue = queues
.next()
.expect("We asked for one queue, we should get one");
And that concludes our create_device_and_queue()
function, which itself is the
last part of the basic application setup work that we will always need no matter
what we want to do with Vulkan.
In the next chapter, we will see a few more Vulkan setup steps that are a tiny bit more specific to this application, but could still be reused across many different applications.
It is not uncommon for GPUs to run double-precision computations >10x slower than single-precision ones.
Vulkan is an API that was designed to evolve over time and be freely extensible by hardware vendors. This means that the set of functions provided by your operating system’s Vulkan library is not fully known at compile time. Instead the list of available function pointers must be queried at runtime and absence of certain “newer” or “extended” function pointers must be correctly handled.
GPU memory allocators are very slow and come with all sorts of unpleasant limitations, like only allowing a small number of allocations or preventing any other GPU command from running in parallel with the allocator. Therefore, it is good practice to only allocate a few very large blocks of memory from the GPU driver, and use a separate library-based allocator like the Vulkan Memory Allocator in order to sub-allocate smaller application-requested buffers from these large blocks.
In Vulkan, GPUs are directed to do things using batches of commands. By default, these batches of commands are almost completely unordered, and it is legal for the GPU driver to e.g. start processing a command on a compute unit while the previous command is still running on another. Because of this and because GPUs hardware uses incoherent caches, the view of memory of two separate commands may be inconsistent, e.g. a command may not observe the writes to memory that were performed by a previous command in the batch. Pipeline barriers are a Vulkan-provided synchronization primitive that can be used to enforce execution ordering and memory consistency constraints between commands, at a runtime performance cost.
vulkano
makes heavy use of Arc
, which is the Rust equivalent of C++’s
shared_ptr
. Unlike C++, Rust lets you choose between atomic and
non-atomic reference counting, in the form of Rc
and Arc
, so that you
do not need to pay the cost of atomic operations in single-threaded
sections of your code.
Vulkan 1.0 is supported by most GPUs starting around 2012-2013. The main exception is Apple GPUs. Due to lack of interest from Apple, who prefer to work on their proprietary Metal API, these GPUs can only support Vulkan via the MoltenVk library, which implements a “portability subset” of Vulkan that cuts off some minor features and thus does not meet the minimal requirements of Vulkan 1.0. Most Vulkan applications, however, will run just fine with the subset of Vulkan implemented by MoltenVk.
The barrier to using multiple devices is significantly lower in Vulkan than in most other GPU APIs, because they integrate very nicely into Vulkan’s synchronization model and can be manipulated together as “device groups” since Vulkan 1.1. But writing a correct multi-GPU application is still a fair amount of extra work, and many of us will not have access to a setup that allows them to check that their multi-GPU support actually works. Therefore, I will only cover usage of a single GPU during this course.
Most computers either have a single hardware GPU (discrete, virtual or integrated), a combination of a discrete and an integrated GPU, or multiple discrete GPUs of identical performance characteristics. Therefore, treating all devices of a single type as equal can only cause device selection problems in exotic configurations (e.g. machines mixing discrete GPUs from NVidia and AMD), and is thus normally an acceptable tradeoff in beginner-oriented Vulkan code. The best way to handle more exotic configurations is often not to auto-select GPUs anyway, but just to give the end user an option (via CLI parameters, environment or config files) to choose which GPU(s) should be used. If they are using such an unusual computer, they probably know enough about their hardware to make an informed choice here.
If we keep using the double-buffered design that has served us well so far, then as long as one GPU concentration buffer is in the process of being downloaded to the CPU, we can only run one computation step before needing to wait for that buffer to be available for writing, and that single overlapping compute step will not save us a lot of time. To do asynchronous downloads more efficiently, we would need to switch to a triple-buffered design, where an on-GPU copy of the current concentration array is first made to a third independent buffer that is not used for compute, and then the CPU download is performed from that buffer. But that adds a fair amount of complexity to the code, and is thus beyond the scope of this introductory course.