International Workshop on Runtime and Operating Systems for Supercomputers
Washington, D.C., USA, June 27, 2017
Outline
Application Challenges and Motivation
Telemetry as HPC Platform Service
Context Graph Model
Interaction and Interface
Prototype
Discussion
Definition
HPC Telemetry Data
Any data that describes the state of an HPC platform and the state of
the process-based representation of the applications running on it.
1
Application Challenges & Motivation
A Normal Day at the Office
Strange runtime distribution of homogeneous tasks
Finding the Culprint
Added logging to the application to understand where time is spent
Some tasks spent 10x longer downloading input dataset
A faulty edge switch caused external connectivity issues on some nodes
Introduced helper tasks that collect process-level metrics
Some tasks spent a hughe amount of time in IO Wait
A strange problem with Lustre caused slow filesystem I/O on a small set of nodes
Another Intresting Case
Again, an unexpected runtime distribution of supposedly
homogeneous simulation tasks
Finding the Culprint
Used the same instrumentation strategy
Outlier tasks run out of memory and stall
Specific structural properties of the input data would cause the
algorithm to take a different trajectory
Consequences
We encountered unexpected "dynamic behavior", both on the system as well
as on the application side
Knowing that these are no edge cases, we started making
our "debugging" approach a more vital part of the application framework:
Collecting process- and OS-level information during all runs
Applying simple adaptive strategies to mitigate issues at runtime:
Blacklist 'weird' nodes
Reducing the task-packing (preempt other tasks on the node) when
memory usage exceeds threshold
Experience & Lessons Learned
Instrumetation requires a lot of effort
Collecting and analysing data (at scale) is non-trivial
Interpreting and feeding the data to the application is difficult
Existing tooling is sparse and mostly geard toward post-mortem,
parallel code debugging
Without knowing and understanding the platform "anatomy" and context, data
can be difficult to interpret, e.g., what is considered "poor" I/O,
what is the spatial layout of processes across nodes?
Experience & Lessons Learned cont.
Application-specific instrumentation is wide spread technique to mitigate heterogeneity, dynamic behavior, etc.
Adressing the issue is expensive, but ignoring it can be expensive, too:
2
Telemetry as HPC Platform Service
Status Quo: Application-Driven
Application-level collection and processing of telemtry data can cause a lot of overhead.
Platform Service Approach
Telemetry service takes over data collection and provides data access and higher-level functions to applications
Requirements
Captures the time-variant physicla anatomy and
properties of applications
Captures the time-variant anatomy and
properties of the HPC platform
Describes the mapping between the two (contex!)
Allows for arbitrary levels of detail
Provides programmatic access to the data
Allows offloading data analytics, e.g. extracting trends from streams of raw data
Has notifications capabilities
Requirements cont.
Keeps historic data (possibly in condensed form)
Is deployable at scale (think exascale!)
Consistent across platforms
3
Context Graph Model
Graph-Based Model
Provides the context in which time-series can be embedded
We use attributed graphs to describe entities and their relationships
Graphs provide a intuitive way to model arbitrary levels of complexity
A single context graph (CG) captures the connections between the
platform anatomy (sub-)graph (PAG) and the application anatomy
(sub-)graphs (AAG)
Spatial-Temporal Dynamics
Anatomy and structure of platform and applications is not static:
Application process start and stop
Nodes appear and disappear
Hardware (e.g., GPUs or FPGAs) is added
...
All nodes and edges have timestamps that qualify their existence
To get a snapshot of the platform and applications at a
specific point in time, the graph can be queried for a specific
time or time range