Faster updates
At this point, we’re done with the basic simulation implementation work, and it
should hopefully now run for you. Now let’s tune it for better performance,
starting with the gpu::update()
function.
Why update()
?
The gpu::update()
function is very performance-critical because it is the only
nontrivial function that runs once per compute step, and therefore the only
function whose performance impact cannot be amortized by increasing the number
of compute steps per saved image.
Yet this function is currently written in a fashion that uses Vulkan command buffers in a highly naive fashion, sending only one compute dispatch per invocation and waiting for it to complete before sending the next one. As a result, there are many steps of it that could be done once per saved image, but are currently done once per compute step:
- Creating a command buffer builder.
- Binding the compute pipeline.
- Binding the simulation parameters.
- Building the command buffer.
- Submitting it to the command queue for execution.
- Flushing the command queue.
- Waiting for the job to complete.
And then there is the work of computing the dispatch size, which is arguably not much but could still be done once in the entire simulation run, instead of once per step of the hottest program loop.
Exercise
Rework the simulation code to replace the current update()
function with a
multi-update function, the basic idea of which is provided below:
use crate::gpu::context::CommandBufferAllocator;
use vulkano::{
command_buffer::PrimaryAutoCommandBuffer,
pipeline::PipelineLayout,
};
/// Command buffer builder type that we are going to use everywhere
type CommandBufferBuilder = AutoCommandBufferBuilder<
PrimaryAutoCommandBuffer<Arc<CommandBufferAllocator>>,
Arc<CommandBufferAllocator>,
>;
/// Add a single simulation update to the command buffer
fn add_update(
commands: &mut CommandBufferBuilder,
pipeline_layout: Arc<PipelineLayout>,
concentrations: Arc<PersistentDescriptorSet>,
dispatch_size: [u32; 3],
) -> Result<(), Box<dyn Error>> {
commands
.bind_descriptor_sets(
PipelineBindPoint::Compute,
pipeline_layout,
IMAGES_SET,
concentrations,
)?
.dispatch(dispatch_size)?;
Ok(())
}
/// Run multiple simulation steps and wait for them to complete
pub fn run_updates(
context: &VulkanContext,
dispatch_size: [u32; 3],
pipeline: Arc<ComputePipeline>,
parameters: Arc<PersistentDescriptorSet>,
concentrations: &mut Concentrations,
num_steps: usize,
) -> Result<(), Box<dyn Error>> {
// TODO: Write this code
}
…then update the simulation and the microbenchmark, modifying the latter to
test for multiple number of compute steps. Don’t forget to adjust criterion
’s
throughput computation!
You should find that this optimization is highly beneficial up to a certain batch size, where it starts becoming detrimental. This can be handled in two different ways:
- From
run_simulation()
, callrun_updates()
multiple times with a maximal number of steps when the number of steps becomes sufficiently large. This is easiest, but a little wasteful as we need to await the GPU every time when we could instead be building and submitting command buffers in a loop without waiting for the previous ones to complete until the very end. - Keep a single
run_updates()
call, but inside of it, build and submit multiple command buffers, and await all of them at the end. Getting there will require you to learn more aboutvulkano
’s futures-based synchronization mechanism, as you will want the execution of each command buffer to be scheduled after the previous one completes.
Overall, since the optimal command buffer size is likely to be hardware-dependent, you will want to make it configurable via command-line arguments, with a good (tuned) default value.