Towards a Unified Telemetry Service Framework for HPC Environments

Ole Weidner
School of Informatics
University of Edinburgh
ole.weidner@ed.ac.uk

Adam Barker
School of Computer Science
University of St Andrews
adam.barker@st-andrews.ac.uk

Malcolm Atkinson
School of Informatics
University of Edinburgh
malcolm.atkinson@ed.ac.uk

International Workshop on Runtime and Operating Systems for Supercomputers

Washington, D.C., USA, June 27, 2017

Outline

Application Challenges and Motivation
Telemetry as HPC Platform Service
Context Graph Model
Interaction and Interface
Prototype
Discussion

Definition

HPC Telemetry Data

Any data that describes the state of an HPC platform and the state of the process-based representation of the applications running on it.

1

Application Challenges & Motivation

I think it is important to understand where our motivation comes from, because what we propose here is probably not equaly important or even relavent to the entire HPC community. I want to start with some small anecdote from the every day life (or better my every day life) as an application developer. A couple of years ago I was working as a research programmer at Louisiana State University and I ended up doing a lot of work in bioinformatics and molecular dynamics simulation, helping scientists to get their small scale simulations and proof-of-concept implementations on the big HPC systems at a much larger scale. Both in bioinformatics and MD we have a plethora of software tools and applications. NAMD probably one of the more popular ones because it won the Gordon Bell AND Sidney Fernbach Award. It is quite intersting that even for the same tools, runtime characteristics, parallel scalability, can vary dramatically, depending on the model parameters or the input data. Some simulations would run for hours on thousands of CPU cores, while others would only run for a few minutes or even seconds and would scale pretty badly, if at all. For the latter category usually use someting we call a PILOT JOB framework, basically an overlay scheduling menchanism which allows us to run 10 or 100 of thousands of small, short-running jobs in one large batch job.

A Normal Day at the Office

Strange runtime distribution of homogeneous tasks

This is the runtime distribution of a number of simulation tasks. Those tasks have a fairly balanced I/O to compute tatio: each individual tasks starts with downloading a specimen from a remote HTTP endpoint, loads it together with a fairly large reference dataset from the cluster filesystem into memory and starts running some sort of a matching algorithm on it. Most of them finish in the expected time, about XX minutes, however, there are two clusters of tasks that don't seem to behave nicely. It's a small number, but in practice this has lead to some problems because the users of the simulation toolkit, which genereally don't have a CS background saw this step, which is part of a multi-stage pipeline, failing and simply started to adjust the wall-time limit of the batch job to the worst case, to somthing like one hour. This obviously lead to a lot of hollow utilization. So we started to look into this, but could not reproduce it on our own systems. On the system the scientists were using the outliers would also not show up consistentyl. So we started to instrument the framework.

Finding the Culprint

Added logging to the application to understand where time is spent

Some tasks spent 10x longer downloading input dataset
A faulty edge switch caused external connectivity issues on some nodes

Introduced helper tasks that collect process-level metrics

Some tasks spent a hughe amount of time in IO Wait
A strange problem with Lustre caused slow filesystem I/O on a small set of nodes

Another Intresting Case

Again, an unexpected runtime distribution of supposedly homogeneous simulation tasks

Finding the Culprint

Used the same instrumentation strategy

Outlier tasks run out of memory and stall
Specific structural properties of the input data would cause the algorithm to take a different trajectory

Consequences

We encountered unexpected "dynamic behavior", both on the system as well as on the application side
Knowing that these are no edge cases, we started making our "debugging" approach a more vital part of the application framework:
- Collecting process- and OS-level information during all runs
Applying simple adaptive strategies to mitigate issues at runtime:

Blacklist 'weird' nodes
Reducing the task-packing (preempt other tasks on the node) when memory usage exceeds threshold

Experience & Lessons Learned

Instrumetation requires a lot of effort
Collecting and analysing data (at scale) is non-trivial
Interpreting and feeding the data to the application is difficult
Existing tooling is sparse and mostly geard toward post-mortem, parallel code debugging
Without knowing and understanding the platform "anatomy" and context, data can be difficult to interpret, e.g., what is considered "poor" I/O, what is the spatial layout of processes across nodes?

Experience & Lessons Learned cont.

Application-specific instrumentation is wide spread technique to mitigate heterogeneity, dynamic behavior, etc.
Adressing the issue is expensive, but ignoring it can be expensive, too:

2

Telemetry as HPC Platform Service

Status Quo: Application-Driven

Application-level collection and processing of telemtry data can cause a lot of overhead.

Platform Service Approach

Telemetry service takes over data collection and provides data access and higher-level functions to applications

Requirements

Captures the time-variant physicla anatomy and properties of applications
Captures the time-variant anatomy and properties of the HPC platform
Describes the mapping between the two (contex!)
Allows for arbitrary levels of detail
Provides programmatic access to the data
Allows offloading data analytics, e.g. extracting trends from streams of raw data
Has notifications capabilities

Requirements cont.

Keeps historic data (possibly in condensed form)
Is deployable at scale (think exascale!)
Consistent across platforms

3

Context Graph Model

Graph-Based Model

Provides the context in which time-series can be embedded
We use attributed graphs to describe entities and their relationships
Graphs provide a intuitive way to model arbitrary levels of complexity
A single context graph (CG) captures the connections between the platform anatomy (sub-)graph (PAG) and the application anatomy (sub-)graphs (AAG)

Spatial-Temporal Dynamics

Anatomy and structure of platform and applications is not static:

Application process start and stop
Nodes appear and disappear
Hardware (e.g., GPUs or FPGAs) is added
...

All nodes and edges have timestamps that qualify their existence
To get a snapshot of the platform and applications at a specific point in time, the graph can be queried for a specific time or time range

4

Interaction and Interface

User- / Application-Facing API

Language-Agnostic HTTP/REST API allows to:

Explore / traverse the context graph
Register simple "server-side" "derived metrics" functions
Define and register call-backs (Websockets)
GraphQL for complex graph queries


{
 process(id: 1) {
   siblings {
     processes {
       cpu_iowait
       memory_uses
     }
   }
 }
}

5

Prototype

System Components

6

Discussion

This is how we envision an ideal system from the application developer's / user's perspective

THANK YOU

Slides available online:
https://oweidner.github.io/ross-2017-talk