Avoiding copies
After cleaning up the CPU side of the simulation inner loop, let us now look into the outer loop, which generates images and saves them to disk.
Since writing to disk is mostly handled by HDF5, we will first look into the code which generates the output images, on which we have more leverage.
An unnecessary copy
Remember that to simplify the porting of the code from CPU to GPU, we initially
made the GPU version of Concentrations
expose a current_v()
interface with
the same happy-path return type as its CPU conterpart…
pub fn current_v(&mut self) -> Result<&Array2<Float>, Box<dyn Error>> {
…which forced us to engage in some ugly data copying at the end of the function:
// Access the CPU-side buffer
let v_data = self.v_buffer.read()?;
let v_target = self
.v_ndarray
.as_slice_mut()
.expect("Failed to access ndarray as slice");
v_target.copy_from_slice(&v_data);
Ok(&self.v_ndarray)
}
As a trip through a CPU profiler will tell you, we spend most of our CPU time copying data around in our current default configuration, where an output image is emitted every 34 simulation steps. So this copy might be performance-critical.
We could try to speed it up by parallelizing it. But it would be more satisfying
and more efficient to get rid of it entirely, along with the associated
v_ndarray
member of the Concentrations
struct. Let’s see what it would take.
What’s in a buffer read()
?
To access the vulkano
buffer into which the simulation results where
downloaded, we have used the Subbuffer::read()
method.
This method, in turn, is here to support vulkano
’s goal of extending Rust’s
memory safety guarantees to GPU-accessible memory.
In Vulkan, both the GPU and the CPU can access the contents of a buffer. But
when we are working in Rust, they have to do so in a manner that upholds Rust’s
invariants. At any point in time, either only one code path can access the inner
data for writing, or any number of code paths can access it in a read-only
fashion. This is not a guarantee that is built in to the Vulkan API, so
vulkano
has to implement it, and it opted to so at run time using a
reader-writer lock. Subbuffer::read()
is how we acquire this lock for reading
on the CPU side.
Of course, if we acquire a lock, we must release it at some point. This is done
using an RAII type called
BufferReadGuard
which lets us access the reader data while we hold it, and will automatically
release the lock when we drop it. Unfortunately, this design means that we
cannot just wrap the lock’s inner data into an ArrayView2
and return it like
this:
use ndarray::ArrayView2;
// ...
pub fn current_v(&mut self) -> Result<ArrayView2<'_, Float>, Box<dyn Error>> {
// ... download the data ...
// Return a 2D array view of the freshly downloaded buffer
let v_data = self.v_buffer.read()?;
// ERROR: Returning a borrow of a stack variable
Ok(ArrayView2::from_shape(self.shape(), &v_data[..])?)
}
…because if we were allowed to do it, we would be returning a reference to the
memory of v_buffer
after the v_data
read lock has been dropped, and then
another thread could trivially start a job that overwrites v_buffer
and create
a data race. Which is not what Rust stands for.
If not ArrayView2
, what else?
The easiest alternative that we have at our disposal1 is to return the Vulkan
buffer RAII lock object from Concentration::current_v()
…
use vulkano::buffer::BufferReadGuard;
pub fn current_v(&mut self) -> Result<BufferReadGuard<'_, [Float]>, Box<dyn Error>> {
// ... download the data ...
// Expose the inner buffer's read lock
Ok(self.v_buffer.read()?)
}
…and then, in run_simulation()
, turn it into an ArrayView2
before sending
it to the HDF5 writer:
use ndarray::ArrayView2;
let shape = concentrations.shape();
let data = concentrations.current_v()?;
hdf5.write(ArrayView2::from_shape(shape, &data[..])?)?;
We must then adapt said HDF5 writer so that it accepts an ArrayView2
instead
of an &Array2
…
use ndarray::ArrayView2;
pub fn write(&mut self, result_v: ArrayView2<Float>) -> hdf5::Result<()> {
…and then modify the CPU version of run_simulation()
so that it turns the
&Array2
that it has into an ArrayView2
, which is done with a simple call to
the view()
method:
hdf5.write(concentrations.current_v().view())?;
Exercise
Integrate these changes, and measure their effect on runtime performance.
You may notice that your microbenchmarks tell a different story than the running time of the main simulation binary. Can you guess why?
Alternatives with nicer-looking APIs involve creating self-referential objects, which in Rust are a rather advanced topic to put it mildly.