Large-scale Systems Performability

Large-scale systems integrate many networked, computer-based nodes. Performability describes the ability of such a system to deliver certain performance to its users, based on factors from its environment, like the presence of failures, the availability of Internet bandwidth, the amount of available energy or money for renting cloud instances. The core observation is that such systems can deliver different levels of performance, described in terms meaningful to their users, within the constraints given by the elementary resources (functioning processor cores, memory, storage, network bandwidth, energy, money, etc.) Typically, large-scale systems consist of so many elementary resources that monitoring or modeling them all is a hopeless endeavour. Instead, our work treats these resources as “black boxes”, attempting to build systems that are more than the sum of the pieces to their users. Our most important efforts are summarized here and can be found in our publications.

Our research in a visual nutshell:

wordle of publication titles

Budget-controled scheduling for the cloud

Cloud Infrastructure-as-a-Service (IaaS) provides many offerings that all differ in promised compute speed and memory sizes. Users can rent IaaS instances on demand, typically paid for by hourly rates. Users face the problem of selecting the most suitable kinds of IaaS machines (like Amazon EC2′s “small” or “large” instances) according to an application’s needs. This requires translating application behavior to execution speed and cost efficiency on particular IaaS machine types, something users need guidance with. We have developed BaTS, a budget-aware scheduler for clouds that keeps the execution of large bags of tasks under control, allowing users to prefer either faster execution or less money spent, or anything in between. BaTS uses tiny task samples to accurately predict overall costs and completion times. BaTS is available as part of ConPaaS.

Adaptive multicast for multi-cluster environments

Dynamic application interoperation in cluster/grid/cloud environments

(more to come…)

Publications

The results of these efforts van be found in my list of publications.