SUPERMON

LOW PERTURBATION

Perturbation is the modified behaviour of a system as seen by a monitor, reflecting the actual system state combined with some amount of self-monitoring of the monitoring tools. For example, if a compute node is running a workload, there are network, CPU, and memory access patterns that occur as a result of the workload. If one requests a monitoring sample from that node, the monitor consumes a small but non-zero portion of many shared resources. This can result in modified cache behaviour of the application workload, blocking on resources used by the monitoring system, etc... The monitoring data will contain a preferably small amount of data caused by the monitor, and this amount of data is what we aim to minimize.

Why does peak sampling rate matter

That's all nice, but what about practical sampling rates?

Is no perturbation possible? If so, is that a good idea?

Why does peak sampling rate matter?

Ideally, the monitoring system should be constructed in such a way that for a single sample of data, the minimal amount of resources are consumed on the compute node. Let's consider two monitoring systems, A and B, with peak sampling rates for a given data set of 5Hz and 50Hz respectively. Roughly speaking, this means that some resource in the system becomes saturated for monitor A at 5Hz, and B at 50Hz. A bit of arithmetic tells us that each sample for system A consumes roughly 20% of the saturated resource, while each sample for system B consumes around 2%.

Though the illustration above is hand-waving at best, glossing over issues of whether or not the same resource is saturated by each system, it illustrates that SOME resource is saturated at the peak rate, and a single sample requires some percentage of that resource relative to the peak. Nobody will argue that a higher peak implies lower overhead per sample - if a system with a lower peak used less resources per sample, it would end up having a higher peak! This is contradicts the fact that the system in question has a lower peak.

This resource consumption per sample represents the perturbation of the monitor due to shared resource requirements. Since we want to minimize perturbation, striving for a high peak should be the goal of any monitoring system that dislikes the idea of monitoring itself.

That's all nice, but what about practical sampling rates?

Obviously peak rates saturate some part of the system, and would be a very bad idea to use in practice (since saturation nearly maximizes perturbation, which is the opposite of what monitors wish to achieve). The practical sampling rate should be much lower than the peak, and should also be tuned based on how useful the resolution of data is for consumers. Rarely will a user need realtime data, so sampling rates on the order of seconds or sub-seconds are not practical from both a perturbation and usability perspective. Frequently, monitoring systems are used to alert administrators and users of health problems and track utilization on hourly or daily basis. In either case, sampling rates of one sample per 10, 30 or 60 minutes are feasible for the end goal of the data consumer. These are sampling rates of 1/600, 1/1800, and 1/3600 Hz - two to three orders of magnitude lower than the peak rates quoted for supermon for systems of up to ~2000 processors!

There is no magic equation or table to determine the ideal sampling rate for any one application, user, or site. Good judgement should be used in choosing it based on needs and system impact that it might cause.

Is no perturbation possible? If so, is that a good idea?

Yes and no. Many hardware vendors attempt to sell out-of-band monitoring systems that have their own processors and interconnects so that the monitor shares no resources with the computational hardware. Unfortunately, this has a negative drawback. In any system, the probability of failure of a single hardware component increases as the number of components increases. Adding additional hardware proportional to the number of processing elements and network interfaces increases the component count by roughly twice the base computer component count. This means the mean-time-to-failure for a single component in the system decreases! So, although you may be avoiding perturbation, you are making the system more unreliable.

Updated 08-13-2008