Avoiding copies

After cleaning up the CPU side of the simulation inner loop, let us now look into the outer loop, which generates images and saves them to disk.

Since writing to disk is mostly handled by HDF5, we will first look into the code which generates the output images, on which we have more leverage.

An unnecessary copy

Remember that to simplify the porting of the code from CPU to GPU, we initially made the GPU version of Concentrations expose a current_v() interface with the same happy-path return type as its CPU conterpart…

pub fn current_v(&mut self) -> Result<&Array2<Float>, Box<dyn Error>> {

…which forced us to engage in some ugly data copying at the end of the function:

    // Access the CPU-side buffer
    let v_data = self.v_buffer.read()?;
    let v_target = self
        .v_ndarray
        .as_slice_mut()
        .expect("Failed to access ndarray as slice");
    v_target.copy_from_slice(&v_data);
    Ok(&self.v_ndarray)
}

As a trip through a CPU profiler will tell you, we spend most of our CPU time copying data around in our current default configuration, where an output image is emitted every 34 simulation steps. So this copy might be performance-critical.

We could try to speed it up by parallelizing it. But it would be more satisfying and more efficient to get rid of it entirely, along with the associated v_ndarray member of the Concentrations struct. Let’s see what it would take.

What’s in a buffer read()?

To access the vulkano buffer into which the simulation results where downloaded, we have used the Subbuffer::read() method. This method, in turn, is here to support vulkano’s goal of extending Rust’s memory safety guarantees to GPU-accessible memory.

In Vulkan, both the GPU and the CPU can access the contents of a buffer. But when we are working in Rust, they have to do so in a manner that upholds Rust’s invariants. At any point in time, either only one code path can access the inner data for writing, or any number of code paths can access it in a read-only fashion. This is not a guarantee that is built in to the Vulkan API, so vulkano has to implement it, and it opted to so at run time using a reader-writer lock. Subbuffer::read() is how we acquire this lock for reading on the CPU side.

Of course, if we acquire a lock, we must release it at some point. This is done using an RAII type called BufferReadGuard which lets us access the reader data while we hold it, and will automatically release the lock when we drop it. Unfortunately, this design means that we cannot just wrap the lock’s inner data into an ArrayView2 and return it like this:

use ndarray::ArrayView2;

// ...

pub fn current_v(&mut self) -> Result<ArrayView2<'_, Float>, Box<dyn Error>> {
    // ... download the data ...

    // Return a 2D array view of the freshly downloaded buffer
    let v_data = self.v_buffer.read()?;
    // ERROR: Returning a borrow of a stack variable
    Ok(ArrayView2::from_shape(self.shape(), &v_data[..])?)
}

…because if we were allowed to do it, we would be returning a reference to the memory of v_buffer after the v_data read lock has been dropped, and then another thread could trivially start a job that overwrites v_buffer and create a data race. Which is not what Rust stands for.

If not ArrayView2, what else?

The easiest alternative that we have at our disposal1 is to return the Vulkan buffer RAII lock object from Concentration::current_v()

use vulkano::buffer::BufferReadGuard;

pub fn current_v(&mut self) -> Result<BufferReadGuard<'_, [Float]>, Box<dyn Error>> {
    // ... download the data ...

    // Expose the inner buffer's read lock
    Ok(self.v_buffer.read()?)
}

…and then, in run_simulation(), turn it into an ArrayView2 before sending it to the HDF5 writer:

use ndarray::ArrayView2;

let shape = concentrations.shape();
let data = concentrations.current_v()?;
hdf5.write(ArrayView2::from_shape(shape, &data[..])?)?;

We must then adapt said HDF5 writer so that it accepts an ArrayView2 instead of an &Array2

use ndarray::ArrayView2;

pub fn write(&mut self, result_v: ArrayView2<Float>) -> hdf5::Result<()> {

…and then modify the CPU version of run_simulation() so that it turns the &Array2 that it has into an ArrayView2, which is done with a simple call to the view() method:

hdf5.write(concentrations.current_v().view())?;

Exercise

Integrate these changes, and measure their effect on runtime performance.

You may notice that your microbenchmarks tell a different story than the running time of the main simulation binary. Can you guess why?


1

Alternatives with nicer-looking APIs involve creating self-referential objects, which in Rust are a rather advanced topic to put it mildly.