Modern supercomputers consist of clusters of thousands of independent nodes interconnected through fast networks. These nodes run independent operating system kernels, thus synchronization among them is demanded for user mode programs. This means that temporal synchronization of the nodes is a daunting task. On the other hand, HPC cluster applications often require a rather strict temporal synchronization for activities like performance analysis, application debugging, or data checkpointing. Therefore, the performance of an HPC parallel application may be severely impaired by the lack of temporal synchronization among the activities of the nodes of the cluster; this poses a severe limit on the scalability of such architectures. In this paper we introduce CAOS, an extension of the Linux kernel that aims to address the temporal synchronization problems of modern HPC clusters. We describe the general ideas behind CAOS, and we discuss some details of a possible implementation. We also illustrate some experiments performed on a prototype implementation of CAOS including a centralized network time tick, which allows a master node to synchronize the activities of all other nodes in the cluster, and a specific task scheduler tailored for HPC applications. These experiments, performed on a modern HPC cluster, witness that this new component has no measurable impact on the efficiency of the nodes while reducing the OS noise and providing better performance prediction. An implementation of CAOS based on this component can achieve a significant gain in terms of synchronization, global control, and scalability of the cluster.

Betti, E., Cesati, M., Gioiosa, R., Piermaria, F. (2009). A global operating system for HPC clusters. In Proceedings of the 2009 IEEE International conference on cluster computing (pp.1-10). Institute of Electrical and Electronics Engineers, Inc. [10.1109/CLUSTR.2009.5289191].

A global operating system for HPC clusters

CESATI, MARCO;
2009-01-01

Abstract

Modern supercomputers consist of clusters of thousands of independent nodes interconnected through fast networks. These nodes run independent operating system kernels, thus synchronization among them is demanded for user mode programs. This means that temporal synchronization of the nodes is a daunting task. On the other hand, HPC cluster applications often require a rather strict temporal synchronization for activities like performance analysis, application debugging, or data checkpointing. Therefore, the performance of an HPC parallel application may be severely impaired by the lack of temporal synchronization among the activities of the nodes of the cluster; this poses a severe limit on the scalability of such architectures. In this paper we introduce CAOS, an extension of the Linux kernel that aims to address the temporal synchronization problems of modern HPC clusters. We describe the general ideas behind CAOS, and we discuss some details of a possible implementation. We also illustrate some experiments performed on a prototype implementation of CAOS including a centralized network time tick, which allows a master node to synchronize the activities of all other nodes in the cluster, and a specific task scheduler tailored for HPC applications. These experiments, performed on a modern HPC cluster, witness that this new component has no measurable impact on the efficiency of the nodes while reducing the OS noise and providing better performance prediction. An implementation of CAOS based on this component can achieve a significant gain in terms of synchronization, global control, and scalability of the cluster.
IEEE International conference on cluster computing
Louisiana
2009
Rilevanza internazionale
contributo
2-set-2009
2009
Settore ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
English
cluster computing; operating system noise; synchronization
Intervento a convegno
Betti, E., Cesati, M., Gioiosa, R., Piermaria, F. (2009). A global operating system for HPC clusters. In Proceedings of the 2009 IEEE International conference on cluster computing (pp.1-10). Institute of Electrical and Electronics Engineers, Inc. [10.1109/CLUSTR.2009.5289191].
Betti, E; Cesati, M; Gioiosa, R; Piermaria, F
File in questo prodotto:
File Dimensione Formato  
978-1-4244-5012-1.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 808.78 kB
Formato Adobe PDF
808.78 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/32891
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 9
  • ???jsp.display-item.citation.isi??? 0
social impact