The Role of SINET in the Management and Operation of the Supercomputer Fugaku and the HPCI Shared Storage System
RIKEN Center for Computational Science (hereinafter “R-CCS”) manages and operates the world’s fastest supercomputer Fugaku and the HPCI shared storage system. To find out about the role of SINET in high performance computing, we spoke to Unit Leader Keiji Yamamoto and Technical Scientist Shinichi Miura of the Advanced Operation Technologies Unit and Technical Scientist Hiroshi Harada and Technical Staff Hidetomo Kaneyama of the HPC Usability Development Unit within the Operations and Computer Technologies Division of R-CCS.
(Interview date: December 7, 2020)
Could you start by giving an overview of R-CCS?
Yamamoto： R-CCS engages in activities in three areas: Science of computing, by computing, and for computing. Science of computing is the research of computing technology itself and includes the development of supercomputers and software, and computer operations and technologies. The field in which we use high performance computing technologies produced through such activities in research areas such as the life sciences, meteorology and particle physics to solve scientific and societal problems is called science by computing, whilst the development of materials and devices to support new computing concepts in collaboration with various scientific areas is science for computing.
Within this framework, what are the responsibilities of the Advanced Operation Technologies Unit of the Operations and Computer Technologies Division?
Yamamoto： The Operations and Computer Technologies Division is responsible for keeping the supercomputer Fugaku, which will be made available for shared use from 2021, and other facilities such as the HPCI (High Performance Computing Infrastructure) shared storage system up and running and providing related services to users. However, we do not simply operate these facilities, we also conduct research into the operations and computer technologies themselves. For example, recently, in the area of operation and management as in other areas, automation and AI have become important themes, and the utilization of new technologies such as virtualization and containerization is also required. Our unit’s role is to research and develop such cutting-edge operations and technologies through demonstration using Fugaku and the HPCI shared storage system.
Miura： With the supercomputers built so far, highly skilled experts write special programs to tune their performance and access is via a batch processing system. Simply put, supercomputers have been designed for specialists. However, in the future, we want to go a step further and make supercomputers even easier to use. We, therefore, incorporated commonly used public cloud computing technology into Fugaku and also trialed a cloud-like use service using the computational resources of the supercomputer Fugaku called Fugaku Cloud Platform.
Tell us about the role played by SINET.
Yamamoto： SINET is infrastructure which is essential for our activities. For example, the HPCI shared storage system enables supercomputer input/ output data to be shared across organizations, giving rise to a massive amount of traffic. A network which can support this amount of traffic is, therefore, essential. The existence of the virtual university LAN service is also important for us. This service is extremely helpful as it allows us to connect with RIKEN centers all over simply by creating a VLAN.
So the HPCI shared storage system gets such a large amount of traffic, does it?
Harada： Originally the HPCI shared storage system was built to enable supercomputer research results to be shared efficiently across organizations. If there is an enormous file system which can be accessed from any organization using the same user account, there is no need to move huge amounts of data between supercomputers. Researchers can conduct research efficiently using multiple super computers. Currently, we have storage at two sites: R-CCS and the University of Tokyo Kashiwa Campus and are constantly performing data replication. Storing data at two geographically dispersed sites allows us to protect valuable research results. Meanwhile, users can quickly access data by accessing the replica that is close to them on the network. Depending on the research theme, multiple terabytes of data are generated with one simulation and such data must also be synchronized quickly across both sites. Consequently, there is always a large amount of traffic flowing between R-CCS and the University of Tokyo Kashiwa Campus.
Kaneyama： Storing data at two storage sites allows us to continue providing services without a problem even in the event of a fault at either R-CCS or the University of Tokyo Kashiwa Campus. The replication of data which has built up following a fault or maintenance means that a large amount of data is transferred all at once but, thanks to SINET, even at times like this, data can be resynchronized quickly. There are also huge benefits of using SINET when it comes to ensuring network reliability and availability. We currently have a 100G line to handle the large amounts of traffic but the network is designed so that regular fully meshed redundant data paths are also available.
So you’re saying that a highly reliable wide-area network like SINET is essential for providing stable services to users, right?
Harada： When it comes to the HPCI shared storage system, we regard service continuity as extremely important. Our entire staff also makes a concerted effort to achieve this goal but as far as the network is concerned, we cannot control this alone. NII also affords us generous cooperation, contacting us in advance of urgent network construction work, for example, and we are extremely grateful for this.
Connection of Fugaku with the Oracle Cloud infrastructure was recently announced. What is the aim of this?
Miura： Firstly, there was concern whether R-CCS assuming responsibility for storing the data generated by Fugaku was the best option. Since there are undeniably risks in terms of security and system failure, an arrangement whereby users can themselves assume responsibility for storing the data in cloud storage is required. Secondly, since Fugaku and regular PCs have slightly different processors, the pre-processing of data is sometimes necessary. In such cases also, R-CCS cannot always provide the environment for this and must, therefore, get the data prepared somewhere else. In light of such considerations, it is extremely important for R-CCS to have a environment that enables Fugaku users to connect to cloud resources. Fortunately, SINET offers the SINET cloud connection service and use of this service enables the seamless integration of Fugaku and Oracle Cloud resources.
Yamamoto： Now that Fugaku users have access to the Oracle Cloud environment, Fugaku can be expected to be deployed in a wider range of applications in the future. In research using supercomputers, the question of what to do about preprocessing and postprocessing always comes up. However, once users are able to freely set up new instances in the cloud to operate the program and perform preprocessing and postprocessing, this will further increase the efficiency of research and development.
Kaneyama： The ability to seamlessly integrate Fugaku and cloud resources is also likely to be hugely beneficial for research and development by enterprises. Our decision to connect Fugaku with Oracle Cloud first of all was based on the fact that the Oracle Cloud service is a flat-rate service. In the future, we intend to offer similar environments using other public clouds. We also received scrupulous support from NII with respect to operations to connect with Oracle Cloud.
Finally, what is your outlook for the future and what are your expectations for SINET?
Harada： We plan to further accelerate integration and data sharing with research organizations not only in Japan but also internationally with overseas research organizations. R-CCS is also pursuing a range of initiatives and we will definitely be asking SINET about further development and expansion of overseas connections.
Yamamoto： In the days of the K computer, SINET connection went as far as the login node and it was impossible to do things like view storage directly. However, if integration with cloud resources is possible, users will be able to access Fugaku resources directly from anywhere at all. As a result, traffic will inevitably increase further and we are therefore hoping for a next-generation SINET network to be able to accommodate this.