AI Bridging Cloud Infrastructure ABCI and its

Ai Bridging Cloud Infrastructure Abci And Its-Free PDF

  • Date:31 Mar 2020
  • Views:26
  • Downloads:0
  • Pages:25
  • Size:7.28 MB

Share Pdf : Ai Bridging Cloud Infrastructure Abci And Its

Download and Preview : Ai Bridging Cloud Infrastructure Abci And Its


Report CopyRight/DMCA Form For : Ai Bridging Cloud Infrastructure Abci And Its


Transcription:

Introduction of AIST and ABCI,ABCI Detail,Architecture software stack network topology. MPI performance on ABCI,A latest application on ABCI. Introduction of AIST, A research institute as a part of the Ministry of Economy Trade and Industry. METI of Japan,Our mission,Integrate scientific and engineering. knowledge to address industry and,society needs,Bridge the gap between innovative.
technological seeds and,commercialization, We built ABCI for popularize AI technologies in Japan. 234 0 CN FTV J W 7OVW GVM BIG,U S 2 SLVGW V I V,07 1PHCFHMF 2 SC 7M P RPSBRSPD. n PC 8DTD B L SRD MC C R P BD, Univ Tokyo Kashiwa II Campus n U S H OI GSJ 5 JOIG J HM P RPSBRSPD. P0 1HF 3 R 0 F PHRGL RU PD MC,n U S SST G OTS G LTVR R BBD DP RD I HMR. B CDLHB HMCS RPW 3 P 07,GP VLTVRGSI 0,6LL I O VLTVRGSI 0 GW TL S.
87 B F OS 8A66,C7 B OS 9 48,T V DWGM 0 1 F, OPERATED SINCE AUGUST 1ST 2018 2 VGM D60 1 6W ORG J. Some of 1 Year Achievements,100 projects 1000 users. Academia research institutes and companies, Large Japanese companies start to use ABCI as their R D platform. ABCI supports NGC containers,SONY and Fujitsu lab achieved good. performance of ImageNet 1k,classification on ABCI,Two research papers that use ABCI.
were accepted by SC19, https blogs nvidia com blog 2019 06 17 abci adopts ngc. World s Highest Speed in ImageNet 1k Training,ImageNet ResNet 50 Relative speedup Accuracy. 1600 0 765,Relative speedup, MSRA Facebook Google Preferred Tencent Sony Google Google Sony Fujitsu NVIDIA Google Fujitsu. 2015 2017 Brain Networks 2018 ABCI 2018 2018 ABCI Lab 2019 2019 Lab. 2017 2017 2018 2019 ABCI ABCI,SONY s work 2019 2019. https arxiv org abs 1811 05233,Relative speedup Accuracy.
Fujitsu lab s work MLPerf Training v0 6,https arxiv org abs 1903 12650. ABCI Hardware Overview, High Performance Computing System Large scale Storage System. 550 PFlops FP16 37 2 PFlops FP64, 476 TiB Memory 1 74 PB NVMe SSD 1 PB Lustre Home Directory. DDN SFA14KX w SS9012 Enclosure x 10 x1, Computing Nodes w GPU x 1088 7 68TB SAS SSD x 185 for data. 960GB SAS SSD x 13 for metadata,GPU NVIDIA Tesla V100 SXM2 x 4.
CPU Intel Xeon Gold 6148 2 4GHz 20cores x 2 22 PB GPFS Group Shared Directory etc. Memory 384GiB DDN SFA14K w SS8462 Enclosure x 10 x3. Local Storage Intel SSD DC P4600 NVMe 1 6TB x 1 12TB 7 2Krpm NL SAS HDD x 2400. 3 84TB SAS SSD x 216,Interconnect InfiniBand EDR x 2. Multi platform Nodes w o GPU x 10,17 PB Object Storage Preparing. Intel Xeon Gold 6132 2 6GHz 14cores x 2 HPE Apollo 4510 Gen10 x 24. 768GiB Memory 3 8TB NVMe SSD 1 5TB Intel Optane x2 12TB SATA HDD x 1440. Interactive Nodes x 4 Management and Gateway Nodes x 15. 3 2TB SSD x 24,Interconnect InfiniBand EDR Gateway and Firewall. Mellanox CS7500 x 2 Nexsus 3232C x2,Mellanox SB7890 x 229 FortiGate 1500D x2. FortiAnalyzer 400E x1,Service Network 10GbE SINET5.
ABCI Software Stack,Operating System RHEL CentOS 7 6. Job Scheduler Univa Grid Engine 8 6 3, Docker 17 12 0 Users can use only supported container images. Container Engine, Singularity 2 6 1 Users can use any container images. Intel MPI 2018 2 199, MPI MVAPICH2 2 3rc2 2 3 MVAPICH2 GDR 2 3a 2 3rc1 2 3 2 3 1. OpenMPI 1 10 7 2 1 3 2 1 5 2 1 6 3 0 3 3 1 0 3 1 2 3 1 3. Stack2017 8 2018 2 2018 3 2019 3,Intel Parallel Studio XE Cluster Edition.
PGI Professional Edition 17 10 18 5 18 10 19 3,NVIDIA CUDA SDK 8 0 9 0 9 1 9 2 10 0. Development tools cuDNN 5 1 6 0 7 0 7 1 7 2 7 3 7 4 7 5. NCCL 1 3 5 2 1 2 2 2 3 2 4,Intel MKL 2017 8 2018 2 2018 3 2019 3. GCC Python Ruby R OpenJDK Go Perl, Caffe Caffe2 TensorFlow Theano Torch PyTorch CNTK MXnet Chainer Keras etc. Deep Learning, Frameworks provided by NVIDIA GPU Cloud can also be deployed. Big Data Processing Hadoop Spark,234 4TRU TJ,FUJITSU PRIMERGY Server 2 servers in 2U.
CPU Xeon Gold 6148 27 5M Cache 2 40 GHz 20 Core x2. GPU NVIDIA Tesla V100 SXM2 x4,Memory 384GiB DDR4 2666MHz RDIMM. Local Storage 1 6TB NVMe SSD Intel SSD DC P4600 u 2 x1. Interconnect InfiniBand EDR x2,Skylake Skylake,128GB s 128GB s. DDR4 2666 Xeon Gold 10 4GT s x3 Xeon Gold DDR4 2666. 32GB x 6 6148 UPI x3 6148 32GB x 6,IB HCA 100Gbps IB HCA 100Gbps. PCIe gen3 x16 PCIe gen3 x16,PCIe gen3 x16 PCIe gen3 x16 NVMe. x48 switch x64 switch,NVLink2 x2,Tesla V100 SXM2 Tesla V100 SXM2.
Tesla V100 SXM2 Tesla V100 SXM2,234 TJ AGIP S VITSS I. SPINE 1 SPINE 2,CS7500 CS7500 n 5 SW UGIPGM J VGIP0 STJ W C W G E. GD PDRHB D DP PL MBD DP P B,1 3 Oversubscription BW 4 4 4 4. IB EDR x 24,B F D C 4 P B,UDP B M SL RH M DP P B,FBB 1 FBB 2 FBB 3 FBB 1 FBB 2 FBB 3. SB7890 SB7890 SB7890 SB7890 SB7890 SB7890 n S VITSS I. Full bisection BW,4 R RPDD R FW,Full bisection BW,IB EDR x 72 IB EDR x 72 7MRP P B S AH DBRH M 1.
7MRDP P B TDP SA BPH RH M, LEAF 1 LEAF 2 LEAF 3 LEAF 4 LEAF 1 LEAF 2 LEAF 3 LEAF 4. SB7890 SB7890 SB7890 SB7890 SB7890 SB7890 SB7890 SB7890. HRG SR C RHTD P SRHMF,HRG SR D M 60, CX400 CX400 CX400 CX400 CX400 CX400 CX400 CX400 InfiniBand EDR x4. 1 2 3 17 1 2 3 17,InfiniBand EDR x6,Rack 1 Rack 2 InfiniBand EDR x1. Grounds for the Interconnect Design,1 3 over subscription network. A cost effective solution, Many DL training apps do not use large number of nodes.
High bandwidth network is required by only a small set of nodes. Without adoptive routing, InfiniBand is shared by compute and IO communication and the delivery. vendor suggested not to use adaptive routing on such a configuration. Without Mellanox SHARP, We use EDR and switches which do not support large size message in. Distribution of Number of Used GPUs Nodes in DL Jobs. Single GPU Single Node,and Multi Node jobs,Used Nodes in Multi Node. Workload is collected from pre,ABCI system AAIC,A system dedicated for AI research. Single GPU Jobs are dominant and,degree of parallelism is low.
ABCI is designed for a capacity,computing system for AI. Nodes 32 38,GPUs Node 8, GPUs 256 304 Workload is published under https github com aistairc aaic workload. Performance Impact on MPI, under 1 3 over subscription without adoptive routing. Measured P2P host mem to host mem transfer,performance using OpenMPI 2 1 6. Intra rack 17 17 nodes Group0 Rack0 Group1 Rack1,Inter rack 34 34 nodes Node Node.
Increase of node pairs Node Node,S VG AGIP VLTVRGSI. Per Node Average Throughput Aggregated Throughput,Per Node Throughput GB s. Total Throughput GB s,0 5 10 15 20 0 5 10 15 20,Intra Group Connection Intra Group Connection. Ideal Measured Ideal Measured, More than 80 of the theoretical performance is achieved. S V AGIP VLTVRGSI,Per Node Average Throughput Aggregated Throughput.
Total Rack Throughput GB s,Per Node Throughput GB s. 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40,Inter Rack Connection Inter Rack Connection. Ideal Measured Ideal Measured, Large performance degradation at 6 and 20 connections and far from Ideal performance. AI Bridging Cloud Infrastructure ABCI and its Communication Performance Shinichiro Takizawa The National Institute of Advanced Industrial Science and Technology

Related Books