For some, Julia's distinctive feature is its speed. For others it is how easy it Julia makes it to combine together entirely unrelated projects into something new and useful.

Here we'll try and demonstrate this by parallelizing the creation and usage of banded block-banded matrices. These matrices have a very specific super-structure and inner structure: they are composed of bands of block matrices, where each block is itself banded. Matrix-vector operations are much less demanding for banded matrices than for square matrices. They can arise when solving differential equations.

There are quite a few ways the parallelization could be done, but here we want to use Julia's ability to (fairly) easily mix and match concepts. So we'll mix the concept of banded block-banded matrices themselves with Julia's standard shared arrays. If we can parallelize simply by switching storage types that the banded block-banded matrix operate on, then it should be fairly easy to extend the same technique to different parallelization schemes, whether MPI or GPUs.

In order to run the code in parallel, we first need to add worker processes. The following cell adds 7 processes on the current machine. We also create an environment with a few libraries that will be used in this notebook. Downloading and pre-compiling libraries can be quite slow. That's okay, it's a one-off cost we incur only when setting up a new project.

using Pkg Pkg.activate(pwd())

Activating environment at `~/blog/Project.toml`

packages = ( "Distributed", "BlockBandedMatrices", "SharedArrays", "ArrayLayouts", "DataFrames", "VegaLite", "BenchmarkTools", "BlockArrays" ) for package in packages Pkg.add(package) end

Updating registry at `~/.julia/registries/General` Resolving package versions... [no changes]

Pkg.instantiate() using DataFrames, BlockBandedMatrices, SharedArrays, VegaLite, Distributed, BenchmarkTools nprocs() < 8 && addprocs(8 - nprocs()) nprocs()

8

There is still an extra setup step before we can get going. Now that workers have been created, they need to
load the same environment that we have on the main process. Otherwise, when we ask them to create a
block-banded matrix, the workers will have no idea what that is. This is where `@everywhere`

comes
in. Its purpose is to have all processes run the same commands:

@everywhere begin using Pkg Pkg.activate($(pwd())) end

Activating environment at `~/blog/Project.toml` From worker 8: Activating environment at `~/blog/Project.toml` From worker 7: Activating environment at `~/blog/Project.toml` From worker 4: Activating environment at `~/blog/Project.toml` From worker 3: Activating environment at `~/blog/Project.toml` From worker 6: Activating environment at `~/blog/Project.toml` From worker 5: Activating environment at `~/blog/Project.toml` From worker 2: Activating environment at `~/blog/Project.toml`

@everywhere begin using DataFrames using BlockBandedMatrices: BandedBlockBandedMatrix, _BandedBlockBandedMatrix using BlockBandedMatrices: BandedBlockBandedSizes, blockbandwidths, subblockbandwidths using BlockArrays: PseudoBlockArray, Block, nblocks using SharedArrays: SharedArray using ArrayLayouts: muladd!, MemoryLayout using VegaLite end

Finally we can get to work. The goal is to mix and match two different concepts Julia understands: block-banded matrix and shared arrays. Thanks to its typing system and multiple dispatch, Julia is extremely adept at creating unholy love children between unlikely pairs. So the next few lines create a type alias, i.e. a nickname, for a banded block-banded matrix where the underlying storage type is a shared array:

@everywhere begin const SharedPseudoBlock = PseudoBlockArray{T, 2, SharedArray{T, 2}, B} where {T, B} const SharedBandedBlockBandedMatrix = BandedBlockBandedMatrix{T, SharedPseudoBlock{T, B}} where {T, B} end

`T`

refers to the element type. It is left unspecified for now thanks to the `where`

syntax. Later we will be able to specialize it and create matrices of integers or floating points
(`B`

refers to the type coding for block-sizes and is an implementation detail as far we are
concerned).

There two types above are nested matryoshka dolls. In practice each type adds a bit more structure until
we have full banded block-banded matrix. At the very bottom, we have the shared array, responsible for
ensuring the data can be viewed and modified by different Julia processes. Then above that,
`PseudoBlockArray`

views that same data as blocks: we will be able to address each block
independently as though it were a separate array. Finally, `BandedBlockBandedMatrix`

adds a banded
structure to the blocks: only blocks that are non-zero actually are represented in memory, and each block
itself contains only non-zero elements (This is almost true. For more information see blas.).

Now we add some functions to create un-initialized matrices. An un-initialized matrix is fully functional. However, the code makes no guarantee with respect to the matrix coefficients. It could be a matrix filled with zeros, or filled with random or less random numbers, depending on what the memory now allocated to the array used to hold. The constructors for the block-banded matrices are somewhat faffy. The main point is that we can create a banded block-banded matrix by specifying the size of the blocks, the bandwidth of the matrix of blocks and the bandwidth of the blocks themselves, where bandwidth means the number of non-zero upper and lower diagonals.

@everywhere begin function SharedBandedBlockBandedMatrix{T}( ::UndefInitializer, dims::NTuple{2, AbstractVector{Int}}, lu::NTuple{2, Int}, λμ::NTuple{2, Int}; kwargs... ) where T blocksizes = BandedBlockBandedSizes(dims..., lu..., λμ...) shared = SharedArray{T}(size(blocksizes); kwargs...) block = PseudoBlockArray(shared, blocksizes.data_block_sizes) _BandedBlockBandedMatrix(block, blocksizes) end end

So now we have a block-banded matrix which is shared by multiple processes. The magic works because
`BandedBlockBandedMatrix`

is an abstraction, namely a thin layer of behaviours sprinkled over a
type that understands blocks. This type, `PseudoBlockArray`

is itself an abstraction that sprinkles
a few behaviours over an array. And this array is a `SharedArray`

. So when the block banded matrix
needs to see one of its elements, it knows it should refer that request via a block. And the block knows it
should refer the request further down an array. And the array happens to be shared across processes. It's
turtles all the way down except for the turtles.

To illustrate this, let's create a banded block-banded matrix with blocks composed of 3x4 matrices
(`n`

and `m`

), with the band of blocks from `l`

to `u`

and
each block-banded from `λ`

to `μ`

.

# Number of blocks of rows, and number of rows per block N, n = 2, 3 # Number of blocks of columns, and number of columns per block M, m = N + 1, n + 1 # number of non-zero lower and upper diagonals of the banded "super" matrix of blocks l, u = 1, 0 # number of lower and upper diagonals of the banded block λ, μ = 0, 2 A = SharedBandedBlockBandedMatrix{Float64}(undef, (repeat([n], N), repeat([m], M)), (l, u), (λ, μ))

2×3-blocked 6×12 BandedBlockBandedMatrix{Float64,PseudoBlockArray{Float64,2 ,SharedArray{Float64,2},BlockArrays.BlockSizes{2,Tuple{Array{Int64,1},Array {Int64,1}}}}}: 0.0 0.0 0.0 ⋅ │ ⋅ ⋅ ⋅ ⋅ │ ⋅ ⋅ ⋅ ⋅ ⋅ 0.0 0.0 0.0 │ ⋅ ⋅ ⋅ ⋅ │ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.0 0.0 │ ⋅ ⋅ ⋅ ⋅ │ ⋅ ⋅ ⋅ ⋅ ────────────────────┼──────────────────────┼──────────────────── 0.0 0.0 0.0 ⋅ │ 0.0 0.0 0.0 ⋅ │ ⋅ ⋅ ⋅ ⋅ ⋅ 0.0 0.0 0.0 │ ⋅ 0.0 0.0 0.0 │ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.0 0.0 │ ⋅ ⋅ 0.0 0.0 │ ⋅ ⋅ ⋅ ⋅

But it's not all magic. We can create a matrix that is accessible to all the workers. However, that's just data. Logic itself is still sequential. In other words, if we multiply the shared matrix with a vector, then each process will run the whole operation. We still have to teach them how to work together.

We first try and get processes to collaborate on filling up the matrix with meaningful values:

@everywhere begin """Populates the shared banded block-banded matrix in parallel""" function populate!( block_populate!::Function, A::SharedBandedBlockBandedMatrix) @sync begin for (p, proc) in enumerate(procs(A.data.blocks)) @async remotecall_wait(populate!, proc, block_populate!, A, p) end end A end """Populates the blocks of A that are assigned to process p.""" function populate!( block_populate!::Function, A::SharedBandedBlockBandedMatrix, p) n = sum( min(i + A.u, nblocks(A, 2)) - max(i - A.l, 1) + 1 for i in 1:nblocks(A, 1) ) m = length(procs(A.data.blocks)) start = (n ÷ m) * (p - 1) + min(n % m, p - 1) stop = (n ÷ m) * p + min(n % m, p) k = 0 for i in 1:nblocks(A, 1), j in max(i - A.l, 1):min(i + A.u, nblocks(A, 2)) if k >= stop break elseif k >= start block_populate!(view(A, Block(i, j)), i, j) end k += 1 end A end end

The code proceeds in a cascade of three steps. First we have a function to signal workers to start working. Then we have a function that will run on each worker process. It simply parcels out the list of mutable blocks in the banded matrix and figures out which are assigned to its process id. For each assigned block, it then calls the third function in the cascade, to actually populate the given block.

If the number of block $n$ were exactly divisible by the number of processes $m$, then they would be parcelled out by assigning the first $\frac{n}{m}$ blocks to process 0, the second $\frac{n}{m}$
blocks to process 1, and so on. The code does just that. However it also takes care of the case where $n$ is not a multiple of $m$. In that case, assuming there
$n \% m$ leftover blocks ($\%$ is the modulo
operation, or remainder), then each of the first $n \% m$ processes get one
extra block. The expression `min(n % m, p)`

computes the cumulative number of extra
blocks assigned to processes $p$ and lower.

Performance-hungry users will notice that this simple-minded implementation runs on idle for a fair bit of the loop over blocks. But in practice, it turns out to be a minimal waste of resources compared to the actual cost of the calculation. Frankly, it's the lack of elegance that grates, rather than any performance consideration. But it's simple.

The parallelization magic happens in the `@sync`

for loop. In practice, each `@async`

loop item generates a task and schedules it to be sent and executed by process `proc`

. Then, right
before exiting the `@sync`

code-block, the tasks are sent over to the processes, and the code waits
until the tasks are finished running.

In practice, usage looks like this:

populate!(A) do block, i, j # block: a view of the block to populate # i, j: indices of the block for x in 1:size(block, 1), y in max(x - A.λ, 1):min(x + A.μ, size(block, 2)) block[x, y] = 1 end end

2×3-blocked 6×12 BandedBlockBandedMatrix{Float64,PseudoBlockArray{Float64,2 ,SharedArray{Float64,2},BlockArrays.BlockSizes{2,Tuple{Array{Int64,1},Array {Int64,1}}}}}: 1.0 1.0 1.0 ⋅ │ ⋅ ⋅ ⋅ ⋅ │ ⋅ ⋅ ⋅ ⋅ ⋅ 1.0 1.0 1.0 │ ⋅ ⋅ ⋅ ⋅ │ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1.0 1.0 │ ⋅ ⋅ ⋅ ⋅ │ ⋅ ⋅ ⋅ ⋅ ────────────────────┼──────────────────────┼──────────────────── 1.0 1.0 1.0 ⋅ │ 1.0 1.0 1.0 ⋅ │ ⋅ ⋅ ⋅ ⋅ ⋅ 1.0 1.0 1.0 │ ⋅ 1.0 1.0 1.0 │ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1.0 1.0 │ ⋅ ⋅ 1.0 1.0 │ ⋅ ⋅ ⋅ ⋅

For comparison we also construct the serial version of populate, as well as a simple scheme to fill blocks with ones:

@everywhere function populate!( block_populate!::Function, A::BandedBlockBandedMatrix) for i in 1:nblocks(A, 1), j in max(i - A.l, 1):min(i + A.u, nblocks(A, 2)) block_populate!(view(A, Block(i, j)), i, j) end A end @everywhere function simplefill!(A::BandedBlockBandedMatrix) populate!(A) do bk, i, j for x in 1:size(bk, 1), y in max(x - A.λ, 1):min(x + A.μ, size(bk, 2)) bk[x, y] = x + y + i + j end end end

And now we can run benchmarks:

using BenchmarkTools N, n = 8, 3 l, u, λ, μ = 1, 2, 0, 1 M, m = 6, n A = SharedBandedBlockBandedMatrix{Float64}(undef, (repeat([n], N), repeat([m], M)), (l, u), (λ, μ)) @benchmark simplefill!($A)

BenchmarkTools.Trial: memory estimate: 58.13 KiB allocs estimate: 991 -------------- minimum time: 1.037 ms (0.00% GC) median time: 1.201 ms (0.00% GC) mean time: 1.247 ms (1.13% GC) maximum time: 37.296 ms (49.16% GC) -------------- samples: 3993 evals/sample: 1

B = BandedBlockBandedMatrix{Float64}(undef, (repeat([n], N), repeat([m], M)), (l, u), (λ, μ)) @benchmark simplefill!($B)

BenchmarkTools.Trial: memory estimate: 9.86 KiB allocs estimate: 127 -------------- minimum time: 5.882 μs (0.00% GC) median time: 6.181 μs (0.00% GC) mean time: 7.355 μs (10.07% GC) maximum time: 685.475 μs (98.99% GC) -------------- samples: 10000 evals/sample: 5

And that's not a win for the multi-core team... But we do not always expect a win in every situation. Surely, the performance depends on the size of the matrix? So let's try a few:

n, l, u, λ, μ = 500, 1, 2, 2, 1 serial, parallel, Ns = [], [], [1, 3, 10, 50, 100, 150, 200, 300, 500] for N = Ns A = SharedBandedBlockBandedMatrix{Float64}( undef, (repeat([n], N), repeat([n], N)), (l, u), (λ, μ)) bench = @benchmark simplefill!($A) push!(parallel, bench) B = BandedBlockBandedMatrix{Float64}( undef, (repeat([n], N), repeat([n], N)), (l, u), (λ, μ)) bench = @benchmark simplefill!($B) push!(serial, bench) end

To plot the results we will use VegaLite.jl, a Julia wrapper around VegaLite. So we will first create a dataset with all the accumulated dated aggregated by the benchmarks and then feed it to a VegaLite schema.

function timing_data(Ns, serial, parallel) times = DataFrame() for (n, strial, ptrial) in zip(Ns, serial, parallel) times = vcat( times, DataFrame( n=repeat([n], length(strial.times) + length(ptrial.times)), method=vcat( repeat([:serial], length(strial.times)), repeat([:parallel], length(ptrial.times)) ), time=vcat(strial.times, ptrial.times) ) ) end times end function plot_times(times::DataFrame) times |> @vlplot( config = { background="#333", title={color="#fff"}, style={ "guide-label"={fill="#fff"}, "guide-title"={fill="#fff"}, }, axis={ domainColor="#fff", gridColor="#fff", tickColor="#fff" } }, x={:n, title="Number of blocks", type=:quantitative}, color={:method, type=:nominal} ) + @vlplot(mark={:line, point=:true}, y="mean(time)") + @vlplot( mark={:errorband, extent=:stdev}, y={ :time, title="mean time per fill(nanoseconds)", type=:quantitative } ) end plot_times(Ns, serial, parallel) = (plot_times∘timing_data)(Ns, serial, parallel) plot_times(Ns, serial, parallel)

The results are rather encouraging. The behavior of both serial and parallel versions are linear with the number of non-zero blocks, and hence with the number of non-zero numbers entries in the matrix. This is what we hoped for. And the parallel version is faster.

Next we try to teach the shared block-banded matrix to do matrix-vector multiplications in parallel. First, and following the previous schema, we create a function that will offload work to the different processes.

@everywhere begin function multiply!( out::SharedArray{T, 1}, A::SharedBandedBlockBandedMatrix{T}, x::SharedArray{T, 1} ) where T @sync begin for (p, proc) in enumerate(procs(A.data.blocks)) @async remotecall_wait(multiply!, proc, out, A, x, p) end end A end end

Then, each process figures out which blocks are assigned to it. And it calls the multiplication operation for each particular block of the input matrix with each particular segment of the input vector.

@everywhere begin """Populates the blocks of A that are assigned to process p.""" function multiply!( out::SharedArray{T, 1}, A::SharedBandedBlockBandedMatrix{T}, x::SharedArray{T, 1}, p::Integer ) where T l, u = blockbandwidths(A) λ, μ = subblockbandwidths(A) N, M = nblocks(A) n = sum( min(i + A.u, nblocks(A, 2)) - max(i - A.l, 1) + 1 for i in 1:nblocks(A, 1) ) m = length(procs(A.data.blocks)) start = (n ÷ m) * (p - 1) + min(n % m, p - 1) stop = (n ÷ m) * p + min(n % m, p) k = 0 for i in 1:nblocks(A, 1), j in max(i - A.l, 1):min(i + A.u, nblocks(A, 2)) if k >= stop break elseif k >= start Ablock = view(A, Block(i, j)) muladd!( one(T), Ablock, view(x, Ablock.indices[2].indices), zero(T), view(out, Ablock.indices[1].indices) ) end k += 1 end A end end

The clever part is not here. The clever part is the machinery that translates multiplying a view of a
block-banded matrix and a view of a vector into something that a BLAS function designed in the 70s
can consume, i.e. a mix of integers, floating points, and raw pointers to memory locations. That part is done
by ArrayLayouts.jl. Without going into to many
details, ArrayLayouts defines a trait,
`MemoryLayout`

which informs on how a specialized matrix is arranged in memory. With that
information, `ArrayLayout`

can figure out whether it can call BLAS directly, or whether it first
needs to transform the inputs to objects with standard memory layout:

MemoryLayout(typeof(A)), MemoryLayout(typeof([1.0, 2.0, 4.0]))

(BlockBandedMatrices.BandedBlockBandedColumnMajor(), ArrayLayouts.DenseColu mnMajor())

What sets Julia apart from other languages is the how relatively simple it is to extract common tropes
between diverse packages and leverage them into a single common grammar. The drawback is that we are seeing a
proliferation of small packages meant to define traits and methods for other packages downstream. It is not
clear to me what will happen as the ecosystem around Julia gets older, and as maintainers and codes come and
go. But leaving `left-pad`

behind, lets get back to shared arrays and check whether parallelizing
the matrix-vector operation has had any success:

n, l, u, λ, μ = 500, 1, 2, 2, 1 serial, parallel, Ns = [], [], [1, 3, 10, 50, 100, 150, 200, 300, 500] for N = Ns A = SharedBandedBlockBandedMatrix{Float64}( undef, (repeat([n], N), repeat([n], N)), (l, u), (λ, μ)) x = SharedArray{Float64}(rand(size(A, 1))) out = SharedArray{Float64}(rand(size(A, 2))) simplefill!(A) bench = @benchmark multiply!($out, $A, $x) push!(parallel, bench) B = BandedBlockBandedMatrix{Float64}( undef, (repeat([n], N), repeat([n], N)), (l, u), (λ, μ)) x = rand(size(A, 1)) out = rand(size(A, 2)) simplefill!(B) bench = @benchmark muladd!(1.0, $B, $x, 0.0, $out) push!(serial, bench) end

plot_times(Ns, serial, parallel)

Once again, the result is rather encouraging. We are getting some speedup. Exactly how much will depend on your setup. In our case, we are using a Jupyter notebook running on the login node of the Imperial College HPC, and results can be a bit noisy.

We've shown how easy it can be to parallelize operations over an existing and somewhat complicated data
structure. Banded block-banded matrices are quite specialized. It would not be surprising if we could do
little to parallelize theirs operations without digging deep into their internals and modifying the orginal
package itself. But this is Julia. We have implemented a major new feature for banded block-banded matrices
*without touching the original code*. Even more interesting is that we have actually leveraged two
disparate packages to get the work done. We have combined `SharedArrays`

for its ability to
share arrays across processes with `BlockBandedMatrices`

for its specialized matrix type. Julia
does provide speed and nerdy street-creds, but most of all its design promotes code reuse and bottom-up,
unplanned collaboration beyond anything we've seen in any major language.

We gratefully acknowledge sponsorship from NumFocus Open-Source Grant from Summer 2018. The grant allowed us to parametrize BlockBandedMatrices.jl so that it can accept different storage backends, with the goal of pushing the matrices onto the GPU.