About (maintaining) clusters

In may 2002 I got involved in the LOFAR project. LOFAR, in short, is a project to implement a very large (virtual) radiotelescope. It will consist of internetworked fields with large quantities of small networked antenna's. LOFAR will in effect implement a virtual antenna with a diameter of 400 KM, that can be used to listen to signals in the range 20-200 Mhz.

This LOFAR telescope is being developed as a giant data-processing machine. A total of 13,000 antennas produce a datastream of 2 Gbit/sec each. This data is processed by a massive cluster computer and converted to astronomical images and other data-products using distributed applications. I wrote a whitepaper that contains some thoughts on the management of the LOFAR (Linux) computer cluster from a system administrators standpoint, which you can find below.

The LOFAR Data Processor will be designed as a heterogeneous system containing digital signal processors, programmable logic and general purpose microcomputers with a total processing power of 40 TFlops. To allow for iterative calibration over 600 TByte on-line storage is envisaged. LOFAR will be developed in a collaboration between the Netherlands (ASTRON) and the USA (NRL and MIT Haystack Observatory). Initial operations are planned for 2004, full capacity science operations for 2007.


HINEAR<a>

Author: H.W. Kloepping, Snow B.V.
Peer review: B. Knowles, B. Mesman (Snow B.V.), K. van der Schaaf (Astron Foundation).

$Id: mgmt.txt,v 1.6 2002/05/13 19:34:10 henk Exp $ DRAFT

This document contains some thoughts on the management of the LOFAR Linux computer cluster from a system administrators standpoint. It outlines additional fields of attention that may need to be researched during the feasibility stage of the LOFAR project. Most of the issues concerning the actual workload have been addressed by Astron, e.g., there are fairly concrete plans for possible implementation of the programming environment, message passing interfaces, specialized hardware and control of the programming environment. This document does not (re)address these issues.

Audience

Kjeld van der Schaaf, scientific project manager research & development Astron. This document will also be published on the Internet under the FDL.

Introduction

LOFAR CEP aims at providing central processing power, storage capacity and data transport bandwidth sufficient to allow for continuous operation of the LOFAR observation facility and additional signal processing of astronomical data.

LOFAR CEP's high-level design is described briefly in [1,2]. In [2] some 200 pipelines each containing 6-10 cluster nodes are envisioned. Each cluster node consists of a micro computer (i.e., COTS Intel based PC) and may contain special hardware (SCI- RAID- and/or DSP-cards). Nodes are optimized for specific needs, e.g., data storage, data routing or (various types of) data processing.

All nodes need to run an operating system (e.g., Linux), a number of programs (processes) related to cluster control (e.g., SCALI) and programs (processes) related to cluster programming and message passing (ScaMPI/MPI, Corba).

A system administrator's view on LOFAR CEP

The LOFAR CEP hardware structure needs to support a broad range of applications each with its specific hardware and software requirements. The data stream needs to be picked up and processed by the nodes most suited to for-fill these needs and the nodes need to be configured to perform their duties adequately. Hence a system needs to be created to redefine the available pool of resources dynamically.

To be able to verify the conditions under which certain measurements were taken it additionally should be possible to reinstall any previous LOFAR CEP configuration more or less 'on the fly'. This may require some kind of central database in which all configuration data is stored and some kind of management software that is capable of reconfiguring the nodes using the input from that central database.

This system should work with and enhance the cluster hardware management- and monitoring system that comes with the cluster. It will provide monitoring information for the cluster nodes along with performance metrics for both the cluster nodes and interconnect system. These metrics are the basis for further analysis, for example to allow for performance optimization. Based on monitoring information, system administration tasks may be initiated (automatically).

The cluster needs to be maintained on a more basic level too -- classic system administration and maintenance work. For example, systems may break down, which needs to be detected and acted upon. Bugs will occur and need to be solved and meta-data needs to be distilled and analyzed. Log files need to be truncated and analyzed. Hardware needs to be replaced when faulty or outdated.

So, from a system administrator's standpoint, the LOFAR CEP is a complex set of individual interconnected Unix hosts that requires massive, specialized (expensive)system administration attention.

The total costs of ownership of any computer system are many times higher than the initial investment needed to install it. The more complex a system is to maintain, the higher the total costs of ownership will be. The building of the LOFAR CEP and its consequent administration and maintenance need to be outsourced to an external party, somewhere in 2003 (given succesfull completion of the Preliminary Design Review, March 2003). But the number of parties that can manage such complex systems is limited which by the well known economical principles of demand and supply raises the maintenance costs even further.

A project like SETI@home[15] copes with this elegantly by spreading these costs amongst hundreds of thousands of voluntary and mostly unskilled "system-administrators". But LOFAR aims to be an on-line system with close to real-time data processing requirements, hence the SETI@home approach to cheap system administration is not doable.

The study of elimination of costs and complexity involved in maintaining a cluster with thousands of nodes by relatively unexperienced system maintainers is a key issue for this projects feasibility.

Hence proper study of the (costs of) maintenance aspects is required during the feasibility phase and solutions need to be found to minimize these costs, e.g. by providing a framework that simplifies and automates system administration.

Goals of the extra research

Delivering the necessary set of procedures and supporting software to be able to out-source LOFAR CEP system management to a third party. More specifically, creating the correct conditions to enable a small group of (averagely experienced<b>) Unix system administrators to manage the LOFAR CEP supporting software and hardware and to generate reports for end-users and management. Ergo, to create a management framework on a prototype cluster that enables the cluster to work well against low total cost of ownership. The proposed name for this set of procedures and software is HINEAR: procedures and software to support "Highly Networked Arrays of Hosts".

Issues to be researched

This list may not be complete. It bases on my personal experience and discussions with other experienced (cluster) system administrators.

CONFIGURATION AND SOFTWARE DISTRIBUTION

An outline of a central configuration management database

High demands on reproducibility and configurability of the LOFAR CEP nodes strongly suggests that some kind of independent central configuration database be part of HINEAR. This database would contain the necessary information to enable (re)building (parts of) the cluster more or less on the fly.

A node configuration definition (nodec) consists of (lists of names of (sets of)) configuration files to be loaded by a node (comparable to Red Hat kick-start). Or it could also be just a pointer to a disk-image. However, a simple and fast method needs to be available to update a node using such an instance and to verify that the configuration actually succeeded.

The central database would contain a list of all possible node nodecs, both historical and actual.

The software that belongs to a nodec would be be stored on (hierarchically grouped) dedicated configuration object servers (nodec servers). The software could be stored unencrypted and hence be publicly available for any node or it may be encrypted and only accessible for nodes that obtained the proper key from the database server first.

Any nodec would be identified by a unique number or string -- its nodec id.

The central database would contain tables that contain information about the specific configuration of the cluster on any given date, to be specific: the combination of nodec id, node servername and the begin- and endtime.

Any change to the cluster would be initiated by the cluster maintainer. When the new configuration uses existing nodecs the cluster administrator just needs to enter the new configuration data into the central database. After a short period the affected nodes will detect the changeing of their nodec id and will (voluntarily) start updating themselves. When a node completes its update it will reboot and inform/update the central database. If all nodes have been updated, the cluster manager is informed and the (sub)cluster can be used for its new purpose.

To create a new node configuration the cluster manager simply designates a node to be a "inactive". Any inactive node can be used to create a new nodec. If the new configuration was found to be working properly its configuration data (or the image) will be stored on the master nodec server and a new nodec id is created and attached to it.

Going back to any given configuration from the past -- given that the actual hardware did not change -- is simply a matter of retrieving the configuration records that were valid that day and re-entering new records with matchin node/nodec pairs but the actual date into the database.

The cluster maintainers do not need to configure the cluster itself, they just need to configure the central database for example by means of a web based user interface.

PROACTIVE MANAGEMENT

COST MONITORING

A cluster is often used to leverage volume production for substantial cost savings - hence, we should monitor the total costs of ownership and report this regularly.

SECURITY ISSUES

Though LOFAR CEP probably will not be connected to any outside networks it still is necessary to employ at least basic security measurements. An analysis of the risks needs to be made, countermeasures need to be determined and installed and the remaining risks need to be charted and formally accepted[9].

Some topics:

HARDWARE MANAGEMENT

Amongst the topics to study are:

PERFORMANCE TUNING/ANALYSIS

The LOFAR CEP needs to provide sufficient performance. Therefore, 'sufficient' need to be defined in terms of necessary clock-cycles, network bandwidth, acceptable costs etc. and methods to measure this need to be installed.

Most of the performance analysis can be done using log file data (see: central logging and file analysis). Probably some automated analysis of the data needs to be done and human readable extraction (report) needs to be made. Tools like lire[8] may prove to be sufficient in providing extractions from log file data and can be adapted to fit more specific needs. Graphical representations of data can be generated using e.g., rrd[12]. Highly networked arrays need analysis of their networks e.g., a tool like MRTG[12] to help monitoring traffic load on network-links. The SCALI software may also generate valuable (densified) data, hence the interrelations between SCALI tools and other tools needs to be studied too. Also, mechanisms to enable automatic trend analysis must be regarded.

Benchmarking[13] (new) hardware and software is a related field of study, e.g., to study performance of new disks, DSP cards etc. For proper benchmarking, a test environment needs to be available (see: WEAR and TEAR).

--

Notes:

<a> HINEAR: High Near -- complements Low Far (LOFAR)

<b> System administrators on LPIC-2 level with at least 3 years of 
    system administration experience in a networked environment, 
    comparable with SAGE intermediate/advanced[10].  
--

References:

[1] LOFAR Central Processor pre-design - rev. 1.0 - K. v.d Schaaf,
    7.7.2001

[2] Hybrid Cluster Computing Hardware and Software in the
    LOFAR Radio Telescope - rev. 1.0 - K. v.d. Schaaf, J.D. 
    Bregman, C.M.  de Vos - 9.1.2002

[3] http://www.cfengine.org

[4] http://pikt.org

[5] http://www.sara.nl/beowulf/

[6] http://linas.org/linux/NMS.html

[7] http://www.ntp.org

[8] http://logreport.org

[9] Site Security Handbook (RFC1244): http://www.faqs.org/rfcs/rfc1244.html

[10] http://www.usenix.org/sage/jobs/jobs-descriptions.html

[11] http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/

[12] http://people.ee.ethz.ch/~oetiker/webtools/mrtg

[13] http://www.spec.org

[14] Using SANs and NAS - W. Curtis Preston - O'Reilly - ISBN: 0596001533.

[15] http://setiathome.ssl.berkeley.edu/

[16] http://oss.software.ibm.com/developerworks/projects/lui/

[17] http://sisuite.org

[18] http://www.csm.ornl.gov/oscar/home.html

[19] http://www2.lm-sensors.nu/~lm78/

[20] http://rocks.npaci.edu/papers/rocks-documentation/f18.html