GPU and profiling¶
In this example, the ring network created in an earlier tutorial will be used to run the model with a GPU. In addition, it is shown how to profile the performance difference. Only the differences with that tutorial will be described.
Note
Concepts covered in this example:
Building a
arbor.context
that’ll use a GPU. This requires that you have built Arbor with GPU support enabled.Build a
arbor.domain_decomposition
and provide aarbor.partition_hint
.Profile an Arbor simulation using
arbor.meter_manager
.
The hardware context¶
An execution context describes the hardware resources on which the simulation will run.
It contains the thread pool used to parallelise work on the local CPU, and optionally describes GPU resources
and the MPI communicator for distributed simulations. In some other examples, the arbor.single_cell_model
object created the execution context arbor.context
behind the scenes. The details of the execution
context can be customized by the user. We may specify the number of threads in the thread pool; determine the
id of the GPU to be used; or create our own MPI communicator.
Step (11) creates a hardware context where we set the gpu_id
. This requires
that you have built Arbor manually, with GPU support (See here how to do that). On a regular
consumer device with a single GPU, the index you should pass is 0
. Change the value to run the example with and
without GPU.
# (11) Set up the hardware context
# gpu_id set to None will not use a GPU.
# gpu_id=0 instructs Arbor to the first GPU present in your system
context = arbor.context(threads="avail_threads", gpu_id=None)
print(context)
Profiling¶
Arbor comes with a arbor.meter_manager
to help you profile your simulations. In this case, you can run the
example with gpu_id=None
and gpu_id=0
and observe the difference with the meter_manager
.
Step (12) sets up the meter manager and starts it using the (only) context. This way, only Arbor related execution is measured, not Python code.
Step (13) instantiates the recipe and sets the first checkpoint on the meter manager. We now have the time it took to construct the recipe.
# (12) Set up and start the meter manager
meters = arbor.meter_manager()
meters.start(context)
# (13) Instantiate recipe
ncells = 50
recipe = ring_recipe(ncells)
meters.checkpoint("recipe-create", context)
The domain decomposition¶
The domain decomposition describes the distribution of the cells over the available computational resources.
The arbor.single_cell_model
also handled that without our knowledge in the previous examples.
Now, we have to define it ourselves.
The arbor.domain_decomposition
class can be manually created by the user, by deciding which cells
go on which ranks. Or we can use a load balancer that can partition the cells across ranks according to
some rules. Arbor provides arbor.partition_load_balance
, which, using the recipe and execution
context, creates the arbor.domain_decomposition
object for us.
A way to customize arbor.partition_load_balance
is by providing a arbor.partition_hint
. They let
you configure how cells are distributed over the resources in the context
, but without requiring you
to know the precise configuration of a context
up front. Whether you run your simulation on your
laptop CPU, desktop GPU, CPU cluster of GPU cluster, using partition hints
you can
just say: use GPUs, if available. You only have to change the context
to actually define which
hardware Arbor will execute on.
Step (14) creates a arbor.partition_hint
, and tells it to put 1000 cells in a groups allocated to GPUs,
and to prefer the utilisation of the GPU if present. In fact, the default distribution strategy of
arbor.partition_load_balance
already spreads out cells as evenly as possible over CPUs, and groups
(up to 1000) on GPUs, so strictly speaking it was not necesary to give that part of the hint.
Lastly, a dictionary is created with which hints are assigned to a particular arbor.cell_kind
.
Different kinds may favor different execution, hence the option.
In this simulation, there are only arbor.cell_kind.cable
, so we assign the hint to that kind.
Step (15) creates a arbor.partition_load_balance
with the recipe, context and hints created above.
Another checkpoint will help us understand how long creating the load balancer took.
# (14) Define a hint at to the execution.
hint = arbor.partition_hint()
hint.prefer_gpu = True
hint.gpu_group_size = 1000
print(hint)
hints = {arbor.cell_kind.cable: hint}
# (15) Domain decomp
decomp = arbor.partition_load_balance(recipe, context, hints)
print(decomp)
meters.checkpoint("load-balance", context)
The simulation¶
Step (16) creates a arbor.simulation
, sets the spike recorders to record, creates a handle
to their eventual results and makes another checkpoint.
# (16) Simulation init and set spike generators to record
sim = arbor.simulation(recipe, context, decomp)
sim.record(arbor.spike_recording.all)
handles = [sim.sample((gid, 0), arbor.regular_schedule(1)) for gid in range(ncells)]
meters.checkpoint("simulation-init", context)
The execution¶
Step (17) runs the simulation. Since we have more cells this time, which are connected in series,
it will take some time for the action potential to propagate. In the ring network
we could see it takes about 5 ms for the signal to propagate through one cell, so let’s set the runtime to
5*ncells
. Then, another checkpoint, so that we’ll know how long the simulation took.
# (17) Run simulation
sim.run(ncells * 5)
print("Simulation finished")
meters.checkpoint("simulation-run", context)
The results¶
The scientific results should be similar, other than number of cells, to those in ring network,
so we’ll not discuss them here. Let’s turn our attention to the meter_manager
.
# (18) Results
# Print profiling information
print(f"{arbor.meter_report(meters, context)}")
Step (18) shows how arbor.meter_report
can be used to read out the meter_manager
.
It generates a table with the time between checkpoints. As an example, the following table is the result of a run
on a 2019 laptop CPU:
---- meters -------------------------------------------------------------------------------
meter time(s) memory(MB)
-------------------------------------------------------------------------------------------
recipe-create 0.000 0.059
load-balance 0.000 0.007
simulation-init 0.012 0.662
simulation-run 0.037 0.319
meter-total 0.049 1.048
The full code¶
You can find the full code of the example at python/examples/network_ring_gpu.py
.