Architecture Verification – Case Studies
Contents:
1. Simulation of Next-Generation Supercomputing Systems at IBM Research
Dr. Denzel and his colleagues at the IBM Zurich and Austin Research Laboratories have built an OMNEST-based simulation framework that is capable of simulating next-generation HPC (High-Performance Computing) systems with hundreds of thousands of interconnected processors. Simulation of such systems is indispensable for evaluating the system design options and to help optimize the performance of the processors, the interconnection network, and eventually the entire system, including software and HPC applications.
"Large-scale end-to-end simulation of HPC systems running benchmark applications on hundreds of thousands of processors communicating across a large interconnection network is a challenge at the same level of detail as is required for system design and development," they write. The system, codenamed MARS, was built on an earlier, also OMNEST-based simulation framework that they originally developed for switch and network simulations in telecom applications. This system allowed the simulation of multistage fat-tree or mesh-type packet-switching networks driven by statistical traffic.
They extended this tool to support end-to-end coverage by replacing the existing statistical packet generators with a new abstract computing node model that is driven by real-world application traces. As the Message Passing Interface (MPI) standard is pervasively used in HPC applications, they used MPI traces to drive the model. These traces were collected from applications of various fields: ocean modeling (HYCOM, POP), weather research and forecast (WRF), shock-wave physics (CTH), and molecular dynamics, fusion and transport physics (AMBER, CPMD, LAMMPS, GYRO, SPPM, SWEEP3D, UMT2K). For another set of applications in the field of weather forecast (DWD, ECMWF T639, ECMWF 4DVar), they used synthesized artificial traces that were generated by application experts.
The power of the framework was made possible in part by OMNEST facilities such as parameterization. For example, the switch module could be configured into the most popular switch architectures by parameterization. The size of the switch, the number and arrangement of logical queues, buffer sizes, scheduling options, the number of virtual circuits and priority classes, port speeds, and the internal speedup and delays are examples of switch parameters.
In simulations accompanying the development of new switches, it is frequently necessary to add new functions or change existing functions. As they wrote, "[this was] flexibly possible because in OMNEST the lowest-level module functions are programmed in C++."
For the simulation of larger HPC systems, the team exploited OMNEST's parallel distributed simulation capability to speed up simulations and to distribute memory requirements to multiple computers. A cluster of SMP machines with the Parallel Operating Environment (POE) of the AIX® operating system was used for parallel simulations. (The simulator can also be run on x86 machines with either Linux or Windows operating system.)
Using the system, the team was able to choose the optimal interconnection network (including the network topology, switch architecture and buffer sizes); evaluate the trade-offs involving the use of indirect routes and adaptive routing; perform GUPS (Giga Updates Per Second) benchmarks on the model; and project the performance of selected existing MPI applications onto the new future supercomputer system.
In a follow-up paper, the team reports on how the MARS simulator was used together with other tools to optimize the interconnection network at the Mare Nostrum supercomputer of Barcelona Supercomputing Center (Top500 link). The goal was to optimize the end-to-end performance with regard to actual application programs running on the system. To accurately model the applications, the team has collected application traces at the message-passing interface level using an instrumentation package, and connected the OMNEST-based network simulator MARS to an MPI task simulator that replays the MPI traces. Both simulators generated output that can be evaluated with a visualization tool. In the paper they also present several examples of results obtained that provide insights that would not have been possible without this integrated environment.
Wolfgang E. Denzel (IBM Zurich Research Laboratory), Jian Li (IBM Austin Research Laboratory), Peter Walker (Open Grid Computing Inc.) and Yuho Jin (Texas A&M University), 2008. "A framework for end-to-end simulation of high-performance computing systems." Simutools '08: Proceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems & Workshops: 1--10. March 7, 2008, Marseille, France.
Cyriel Minkenberg and German Rodriguez Herrera (IBM Zurich Research Laboratory), 2009. "Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++." OMNeT++ 2009: Proceedings of the 2nd International Workshop on OMNeT++ (hosted by SIMUTools 2009). March 6, 2009, Rome, Italy.
2. Architectural Exploration of Chip-Scale Photonic Interconnection Networks
nanophotonic network (illustration; source: IBM)
Researchers at the Lightwave Research Laboratory, Columbia University have been working on exploring architectural aspects of future on-chip photonic (as opposed to electronic) networks. Recent advancements in silicon nano-photonic technology have opened the possibility of integrating photonics for chip-scale interconnection networks. In comparison to electronics, photonics has the potential to offer higher-bandwidth connections by leveraging data parallelism offered by wavelength-division-multiplexing (WDM). Many photonic topologies have been proposed by other researchers in an effort to improve computing performance, but so far less emphasis has been placed on understanding whether such designs are feasible from a physical-layer standpoint.
The group has used OMNeT++ simulations to perform detailed physical-layer analysis of chip-scale photonic interconnection networks; to the authors' best knowledge, this is the first such detailed physical-layer analysis. They used simulation because it is not currently practical to test full network topologies in a laboratory environment. The simulation framework they have developed has been published as open-source software.
The PhoenixSim simulator is based on the OMNeT++ simulation environment, and it incorporates detailed physical models of basic photonic building blocks such as waveguides, modulators, photodetectors, and switches. More complex photonic circuits and full topologies can be created by properly arranging these building blocks. These composite structures can then be analyzed within the simulator to determine the overall performance characteristics.
In the quoted study, the group evaluated a previously proposed interconnection topology (Torus) and two newly introduced ones, TorusNX and Square Root, and explored the impact of three physical-layer metrics on system scalability, performance, and efficiency. This is only a first result; as the PhoenixSim simulator allows researchers to analyze the overall scalability and performance of various network designs in terms of physical-layer metrics such as insertion loss, crosstalk, and energy, many more results can be expected, contributing to the eventual realization of practical on-chip photonic networks.
Johnnie Chan, Gilbert Hendry, Aleksandr Biberman and Keren Bergman (Dept. of Electrical Engineering, Columbia University), 2010. "Architectural design exploration of chip-scale photonic interconnection networks using physical-layer analysis". OFC/NFOEC 2010: Optical Fiber Communication (OFC), collocated National Fiber Optic Engineers Conference, San Diego, 21-25 March 2010.
Read also a follow-up paper:
Gilbert Hendry, Shoaib Kamil, Aleksandr Biberman, Johnnie Chan et al. (Lightwave Research Laboratory, Columbia University; Computer Science Dept, Columbia University; CRD/NERSC, Lawrence Berkeley National Laboratory), 2009. "Analysis of photonic networks for a chip multiprocessor using scientific applications." In Proceedings of the 2009 3rd ACM/IEEE international Symposium on Networks-on-Chip (May 10 - 13, 2009). NOCS. IEEE Computer Society, Washington, DC, 104-113.
3. Improving the Performance of InfiniBand in a Supercomputing Cluster
Dr Birk and his colleague at the Parallel Systems Laboratory of the Technion (Israel Institute of Technology) investigated congestion in high-performance (HPC) computing clusters using the InfiniBand® interconnection network, with the help of Eitan Zahavi of Mellanox Technologies (Mellanox is a leading provider of InfiniBand equipment). InfiniBand (24%) is one of the most prevalent interconnects in top-500 supercomputers beside Gigabit Ethernet (58%) (2009). Congestion arises in cluster-based supercomputers due to contention for links, and spreads due to oversubscription of communication resources.
The researchers used OMNeT++ simulations to explore and evaluate various options to mitigate congestion to improve the performance of the system. Since the goal was to simulate large networks with thousands of nodes, they created special InfiniBand models that operate at the functional, rather than cycle-accurate, level. Although the methods under study for reducing congestion are topology agnostic, the team examined them on a k-ary n-tree topology, which is a variant of a practical fat tree. This topology is popular in modern clusters.
Based on simulation experiments, the team proposed novel adaptive routing and rate calculation algorithms. On a slightly augmented 16-ary 3-tree implementing a 4096-node fat tree (which is highly representative of current computer clusters), adaptive routing alone was shown to be effective at mitigating the "topological" congestion, i.e. reduced it by some 50%. The necessary slight topological extension only entailed a 10% increase in the number of switch ports. The study contributes to the understanding of supercomputer architectures, and helps build more powerful supercomputers in a cost-effective way.
Yitzhak Birk and Vladimir Zdornov (Technion, Israel Institute of Technology), 2009. "Improving communication-phase completion times in HPC clusters through congestion mitigation." SYSTOR '09: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference: 1--11.
