About (maintaining) clusters

In may 2002 I got involved in the LOFAR project. LOFAR, in short, is a project to implement a very large (virtual) radiotelescope. It will consist of internetworked fields with large quantities of small networked antenna's. LOFAR will in effect implement a virtual antenna with a diameter of 400 KM, that can be used to listen to signals in the range 20-200 Mhz.

This LOFAR telescope is being developed as a giant data-processing machine. A total of 13,000 antennas produce a datastream of 2 Gbit/sec each. This data is processed by a massive cluster computer and converted to astronomical images and other data-products using distributed applications. I wrote a whitepaper that contains some thoughts on the management of the LOFAR (Linux) computer cluster from a system administrators standpoint, which you can find below.

The LOFAR Data Processor will be designed as a heterogeneous system containing digital signal processors, programmable logic and general purpose microcomputers with a total processing power of 40 TFlops. To allow for iterative calibration over 600 TByte on-line storage is envisaged. LOFAR will be developed in a collaboration between the Netherlands (ASTRON) and the USA (NRL and MIT Haystack Observatory). Initial operations are planned for 2004, full capacity science operations for 2007.

HINEAR<a>

Author: H.W. Kloepping, Snow B.V.
Peer review: B. Knowles, B. Mesman (Snow B.V.), K. van der Schaaf (Astron Foundation).

$Id: mgmt.txt,v 1.6 2002/05/13 19:34:10 henk Exp $ DRAFT

This document contains some thoughts on the management of the LOFAR Linux computer cluster from a system administrators standpoint. It outlines additional fields of attention that may need to be researched during the feasibility stage of the LOFAR project. Most of the issues concerning the actual workload have been addressed by Astron, e.g., there are fairly concrete plans for possible implementation of the programming environment, message passing interfaces, specialized hardware and control of the programming environment. This document does not (re)address these issues.

Audience

Kjeld van der Schaaf, scientific project manager research & development Astron. This document will also be published on the Internet under the FDL.

Introduction

LOFAR CEP aims at providing central processing power, storage capacity and data transport bandwidth sufficient to allow for continuous operation of the LOFAR observation facility and additional signal processing of astronomical data.

LOFAR CEP's high-level design is described briefly in [1,2]. In [2] some 200 pipelines each containing 6-10 cluster nodes are envisioned. Each cluster node consists of a micro computer (i.e., COTS Intel based PC) and may contain special hardware (SCI- RAID- and/or DSP-cards). Nodes are optimized for specific needs, e.g., data storage, data routing or (various types of) data processing.

All nodes need to run an operating system (e.g., Linux), a number of programs (processes) related to cluster control (e.g., SCALI) and programs (processes) related to cluster programming and message passing (ScaMPI/MPI, Corba).

A system administrator's view on LOFAR CEP

The LOFAR CEP hardware structure needs to support a broad range of applications each with its specific hardware and software requirements. The data stream needs to be picked up and processed by the nodes most suited to for-fill these needs and the nodes need to be configured to perform their duties adequately. Hence a system needs to be created to redefine the available pool of resources dynamically.

To be able to verify the conditions under which certain measurements were taken it additionally should be possible to reinstall any previous LOFAR CEP configuration more or less 'on the fly'. This may require some kind of central database in which all configuration data is stored and some kind of management software that is capable of reconfiguring the nodes using the input from that central database.

This system should work with and enhance the cluster hardware management- and monitoring system that comes with the cluster. It will provide monitoring information for the cluster nodes along with performance metrics for both the cluster nodes and interconnect system. These metrics are the basis for further analysis, for example to allow for performance optimization. Based on monitoring information, system administration tasks may be initiated (automatically).

The cluster needs to be maintained on a more basic level too -- classic system administration and maintenance work. For example, systems may break down, which needs to be detected and acted upon. Bugs will occur and need to be solved and meta-data needs to be distilled and analyzed. Log files need to be truncated and analyzed. Hardware needs to be replaced when faulty or outdated.

So, from a system administrator's standpoint, the LOFAR CEP is a complex set of individual interconnected Unix hosts that requires massive, specialized (expensive)system administration attention.

The total costs of ownership of any computer system are many times higher than the initial investment needed to install it. The more complex a system is to maintain, the higher the total costs of ownership will be. The building of the LOFAR CEP and its consequent administration and maintenance need to be outsourced to an external party, somewhere in 2003 (given succesfull completion of the Preliminary Design Review, March 2003). But the number of parties that can manage such complex systems is limited which by the well known economical principles of demand and supply raises the maintenance costs even further.

A project like SETI@home[15] copes with this elegantly by spreading these costs amongst hundreds of thousands of voluntary and mostly unskilled "system-administrators". But LOFAR aims to be an on-line system with close to real-time data processing requirements, hence the SETI@home approach to cheap system administration is not doable.

The study of elimination of costs and complexity involved in maintaining a cluster with thousands of nodes by relatively unexperienced system maintainers is a key issue for this projects feasibility.

Hence proper study of the (costs of) maintenance aspects is required during the feasibility phase and solutions need to be found to minimize these costs, e.g. by providing a framework that simplifies and automates system administration.

Resume

The need to out-source maintenance against low costs and the inherent complexity of the LOFAR CEP cluster makes it essential to design a fairly procedural and well-documented system maintenance/administration philosophy and to select (and/or write) the proper tools to enable a small group of averagely experienced<b> Unix engineers to operate it reliably and at low costs. That way it will become possible to negotiate a profitable Service Level Agreement with a number of parties.

It is of vital importance for the feasibility study that these issues be researched and resolved before the PDR.

Goals of the extra research

Delivering the necessary set of procedures and supporting software to be able to out-source LOFAR CEP system management to a third party. More specifically, creating the correct conditions to enable a small group of (averagely experienced<b>) Unix system administrators to manage the LOFAR CEP supporting software and hardware and to generate reports for end-users and management. Ergo, to create a management framework on a prototype cluster that enables the cluster to work well against low total cost of ownership. The proposed name for this set of procedures and software is HINEAR: procedures and software to support "Highly Networked Arrays of Hosts".

Issues to be researched

This list may not be complete. It bases on my personal experience and discussions with other experienced (cluster) system administrators.

CONFIGURATION AND SOFTWARE DISTRIBUTION

Software (re)installation

Initially, software on 1000+ Linux machines needs to be installed. Every now and then bug fixes need to be applied, both on the application and operating system levels. All this while the cluster is up and running (though not necessarily at full capacity at all times). This requires a proper set of procedures and software (see: WEAR and TEAR). Nodes could fetch their software (and configuration data) from special software servers (NFS, bootp/dhcp etc.) or alternately the software could be pushed by dedicated systems. (re)Installation also requires proper testing. A centralized administration needs to be in place (see: configuration management) as much administrative work as possible needs to be centralized. Standard Unix tools like rsync, scp, cfengine[3] and PIKT[4] might be employed. Also a number of distribution specific tool-sets are available, e.g. Debian FAI, Red Hat kick-start and SuSE autoYast. LUI[16], OSCAR[18] and SIS[17] aim at creating distribution independent methods for software distribution and configuration well worth studying too.
Subclusters: WEAR and TEAR

To enable swift and carefree installation and updating of new hardware and software some kind of testing array may be necessary. Given plenty of resources we could simply build two exact same clusters: WEAR (the Work Environment Array) and TEAR (the Test Environment Array). Implementing new software and/or hardware would be done on TEAR, while WEAR would be used to do the actual work. By routing the input data to both WEAR and TEAR and comparing their behavior and output some assurance of the correctness and workability of the new component(s) could be obtained. Cost considerations may force other approaches, e.g., TEAR could be (a temporarily disconnected) part of WEAR -- however it is essential that at least a separate minimal development/testing network is available at all times. Currently, the idea is to realize WEAR and TEAR by implementing subclusters.
Configuration management

To (re)build a cluster and guard its consistency, a number of philosophies can be adapted. On one side of the spectrum we find systems that depend on checking and updating individual (configuration) files (e.g. using tools like PIKT/CFENGINE). On the other hand a somewhat blunt but very scalable approach can be used in which any configuration change results in a complete reinstall of the entire node[20].

An outline of a central configuration management database

High demands on reproducibility and configurability of the LOFAR CEP nodes strongly suggests that some kind of independent central configuration database be part of HINEAR. This database would contain the necessary information to enable (re)building (parts of) the cluster more or less on the fly.

A node configuration definition (nodec) consists of (lists of names of (sets of)) configuration files to be loaded by a node (comparable to Red Hat kick-start). Or it could also be just a pointer to a disk-image. However, a simple and fast method needs to be available to update a node using such an instance and to verify that the configuration actually succeeded.

The central database would contain a list of all possible node nodecs, both historical and actual.

The software that belongs to a nodec would be be stored on (hierarchically grouped) dedicated configuration object servers (nodec servers). The software could be stored unencrypted and hence be publicly available for any node or it may be encrypted and only accessible for nodes that obtained the proper key from the database server first.

Any nodec would be identified by a unique number or string -- its nodec id.

The central database would contain tables that contain information about the specific configuration of the cluster on any given date, to be specific: the combination of nodec id, node servername and the begin- and endtime.

Any change to the cluster would be initiated by the cluster maintainer. When the new configuration uses existing nodecs the cluster administrator just needs to enter the new configuration data into the central database. After a short period the affected nodes will detect the changeing of their nodec id and will (voluntarily) start updating themselves. When a node completes its update it will reboot and inform/update the central database. If all nodes have been updated, the cluster manager is informed and the (sub)cluster can be used for its new purpose.

To create a new node configuration the cluster manager simply designates a node to be a "inactive". Any inactive node can be used to create a new nodec. If the new configuration was found to be working properly its configuration data (or the image) will be stored on the master nodec server and a new nodec id is created and attached to it.

Going back to any given configuration from the past -- given that the actual hardware did not change -- is simply a matter of retrieving the configuration records that were valid that day and re-entering new records with matchin node/nodec pairs but the actual date into the database.

The cluster maintainers do not need to configure the cluster itself, they just need to configure the central database for example by means of a web based user interface.

PROACTIVE MANAGEMENT

management nodes

The cluster should contain a number of nodes that are exclusively used for cluster management. These nodes will also be used as log hosts and performance measuring stations.
node time synchronization

To be able to correlate events on the nodes the nodes all need to have the same system time. Hence, all clocks need to be synchronized. NTP[7] based solutions are available to tackle this problem. Possibly local GPS or radio-controlled NTP time servers need to be in place, they might be based on high quality reference clocks (e.g., rubidium or cesium).
central logging and file analysis

1000+ intensely used Linux hosts will generate massive log file data. To prevent loss of valuable log data and clobbering of local disk space the log data should be sent to a (number of) dedicated log host(s)[8]. Log file analysis would be done on the log host(s): individual analysis (scanning for patterns known to indicate problems) and group analysis (detection of correlations between patterns in log messages from multiple nodes), combined with trend analysis. Nodes need to log their data to at least 2 log hosts. Log hosts are also responsible for archiving (compressed) log file data to more permanent media (e.g., CD/DVD). Also, a system to access (historical) log file data needs to be in place (e.g., web based), probably based on a multi-level scheme where newer data is kept on faster storage and older data may only be available on slower storage.
environmental control

Additionally, environmental data should be gathered, stored and analyzed and trend analysis should be performed on it. Things that could be monitored are: processor temperature, environmental temperature (closet, system, room), air humidity and pressure, changes in sound and light patterns, airflow, opening and closing of (closet) doors etc. With the cluster hardware a cluster HW management/monitoring system will be delivered. Such a system will provide monitoring information for the cluster nodes (temp etc) along with performance metrics for both the cluster nodes and interconnect system. Additionally a number of statistics may be obtained using the LMSENSORS[19] kernel extension.

Another environmental parameter is the electrical power: fluctuations should be monitored, as should brown-outs and black-outs. The cluster should preferably be hooked up to an uninterruptable power supply and proper software to shut down the cluster should be available, however this is conceived as too expensive. This generates extra requirements on software installation, configuration management and testing protocols.
immune systems

Software systems like cfengine[3] and PIKT[4] are widely used to ease configuration and system administration problems for highly networked arrays of computers. For example the Beowulf clusters[5] built at SARA employ(ed) cfengine but are now considering PIKT. These software systems clear the way for real computer immune systems: computer systems that guard themselves against interruptions and reconfigure themselves as needed. There are many more[6] software tools available that may help to build HINEAR and perform tasks like:
- check and set the permissions and ownership of files;
- tidy (delete) junk files which clutter the system;
- checking for the presence of important files and file systems etc.
high availability

To ensure that the array is operational as much as possible probably some redundancy in hardware and software needs to be introduced along with software to detect and route around failures.

Amongst the topics to be studied are:
- Open Source high availability software;
- heartbeat hardware and related software;
- automatic node switch over;
- process monitoring and migration;
- scheduling systems;
- hot pluggable components;
- the need for double wiring etc.;
- journalling file systems;
- logical volume management in relation to the Temporary Storage Facility and node-local file systems;
- - memory monitoring / bad ram patches.

COST MONITORING

A cluster is often used to leverage volume production for substantial cost savings - hence, we should monitor the total costs of ownership and report this regularly.

SECURITY ISSUES

Though LOFAR CEP probably will not be connected to any outside networks it still is necessary to employ at least basic security measurements. An analysis of the risks needs to be made, countermeasures need to be determined and installed and the remaining risks need to be charted and formally accepted[9].

Some topics:

security plans
user account management
physical security
data encryption

HARDWARE MANAGEMENT

Amongst the topics to study are:

remote consoles
remote power switching (controlled startup to prevent power spikes/surges!)
power conditioning
centralized shutdown
heartbeat cards
UPS (various versions e.g., per room, per closet/rack, per building)
backup generator systems

PERFORMANCE TUNING/ANALYSIS

The LOFAR CEP needs to provide sufficient performance. Therefore, 'sufficient' need to be defined in terms of necessary clock-cycles, network bandwidth, acceptable costs etc. and methods to measure this need to be installed.

Most of the performance analysis can be done using log file data (see: central logging and file analysis). Probably some automated analysis of the data needs to be done and human readable extraction (report) needs to be made. Tools like lire[8] may prove to be sufficient in providing extractions from log file data and can be adapted to fit more specific needs. Graphical representations of data can be generated using e.g., rrd[12]. Highly networked arrays need analysis of their networks e.g., a tool like MRTG[12] to help monitoring traffic load on network-links. The SCALI software may also generate valuable (densified) data, hence the interrelations between SCALI tools and other tools needs to be studied too. Also, mechanisms to enable automatic trend analysis must be regarded.

Benchmarking[13] (new) hardware and software is a related field of study, e.g., to study performance of new disks, DSP cards etc. For proper benchmarking, a test environment needs to be available (see: WEAR and TEAR).

Notes:

<a> HINEAR: High Near -- complements Low Far (LOFAR)

<b> System administrators on LPIC-2 level with at least 3 years of 
    system administration experience in a networked environment, 
    comparable with SAGE intermediate/advanced[10].

References:

[1] LOFAR Central Processor pre-design - rev. 1.0 - K. v.d Schaaf,
    7.7.2001

[2] Hybrid Cluster Computing Hardware and Software in the
    LOFAR Radio Telescope - rev. 1.0 - K. v.d. Schaaf, J.D. 
    Bregman, C.M.  de Vos - 9.1.2002

[3] http://www.cfengine.org

[4] http://pikt.org

[5] http://www.sara.nl/beowulf/

[6] http://linas.org/linux/NMS.html

[7] http://www.ntp.org

[8] http://logreport.org

[9] Site Security Handbook (RFC1244): http://www.faqs.org/rfcs/rfc1244.html

[10] http://www.usenix.org/sage/jobs/jobs-descriptions.html

[11] http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/

[12] http://people.ee.ethz.ch/~oetiker/webtools/mrtg

[13] http://www.spec.org

[14] Using SANs and NAS - W. Curtis Preston - O'Reilly - ISBN: 0596001533.

[15] http://setiathome.ssl.berkeley.edu/

[16] http://oss.software.ibm.com/developerworks/projects/lui/

[17] http://sisuite.org

[18] http://www.csm.ornl.gov/oscar/home.html

[19] http://www2.lm-sensors.nu/~lm78/

[20] http://rocks.npaci.edu/papers/rocks-documentation/f18.html