
become more complex, making it far from trivial to leverage the available performance offered by
supercomputers. Individual computers of HPC clusters (called computational nodes) can consist
of hundreds of CPU cores each, yet it is challenging to write programs that can scale to such high
core counts. The RAM (Random-access Memory) of each node contains multiple levels of complex
cache hierarchies, and it has such a large capacity that it has to be split into multiple physical
locations with varying access latencies (NUMA), which requires usage of specialized programming
techniques to achieve optimal performance. And the ever-present accelerators, for example GPUs,
might require their users to adopt completely different programming models and frameworks.
Historically, optimized HPC software was primarily written using system or scientifically fo-
cused programming languages (e.g. C, C++ or Fortran) and specialized libraries for parallelizing
and distributing computation, such as OpenMP (Open Multi-processing) [14], CUDA [15] or MPI
(Message Passing Interface) [16]. While these rather low-level technologies are able to provide the
best possible performance, it can be quite challenging and slow to develop (and maintain) applica-
tions that use them. It is unreasonable to expect that most domain scientists that develop software
for HPC clusters (who are often not primarily software developers) will be able to use all these
technologies efficiently without making the development process slow and cumbersome. This task
should be left to specialized performance engineers, enabling the scientists to focus on the problem
domain [17].
With the advent of more powerful hardware, HPC systems are able to solve new problems,
which are more and more demanding, both in terms of the required computational power, but also
in terms of data management, network communication patterns and general software design and
architecture. Areas such as weather prediction, machine-learning model training or big data analysis
require executing thousands or even millions of simulations and experiments. These experiments
can be very complex, consisting of multiple dependent steps, such as data ingestion, preprocessing,
computation, postprocessing, visualization, etc. It is imperative for scientists to have a quick
way of prototyping these applications, because their requirements change rapidly, and it would be
infeasible to develop them using only very low-level technologies.
The growing complexity of HPC hardware, software and use-cases has given rise to the popular-
ity of task-based programming models and paradigms. Task-oriented programming models allow
users to focus on their problem domain and quickly prototype, while still being able to describe
complicated computations with a large number of individual steps and to efficiently utilize the
available computational resources. With a task-based approach, a complex computation is de-
scribed using a set of atomic computational blocks (tasks) that are composed together in a task
graph which captures dependencies between the individual tasks. Task graphs abstract away most
of the complexity of network communication and parallelization, and they are general enough to
describe a large set of programs in a practical and simple way. At the same time, they remain
amenable to compiler-driven optimization and automatic parallelization, which helps to bring the
performance of programs described by a task graph close to manually parallelized and distributed
programs, at a fraction of the development cost for the application developer. They are also rel-
atively portable by default, as the task graph programming model typically does not make many
assumptions about the target platform; therefore, the same task graph can be executed on various
13