Pingmesh A Large Scale System for Data microsoft com

Pingmesh A Large Scale System For Data Microsoft Com-Free PDF

  • Date:06 Apr 2020
  • Views:28
  • Downloads:0
  • Pages:14
  • Size:3.54 MB

Share Pdf : Pingmesh A Large Scale System For Data Microsoft Com

Download and Preview : Pingmesh A Large Scale System For Data Microsoft Com


Report CopyRight/DMCA Form For : Pingmesh A Large Scale System For Data Microsoft Com


Transcription:

ures live site incidents happen A live site incident is 2 BACKGROUND. any event that results in an impact to the customers. partners or revenue Live site incidents need to be de 2 1 Data center networks. tected mitigated and resolved as soon as possible But. data center networks have hundreds of thousands to mil Data center networks connect servers with high speed. lions of servers hundreds of thousands of switches and and provide high server to server bandwidth Today s. millions of cables and fibers Thus detecting where the large data center networks are built from commodity. problem is located is a hard problem Ethernet switches and routers 1 12 2. To address the above challenges we have designed Figure 1 shows a typical data center network struc. and implemented Pingmesh a large scale system for ture The network has two parts intra data center. data center network latency measurement and analy Intra DC network and inter data center Inter DC. sis Pingmesh leverages all the servers to launch TCP network The intra DC network is typically a Clos net. or HTTP pings to provide the maximum latency mea work of several tiers similar to the network described. surement coverage Pingmesh forms multiple levels of in 1 12 2 At the first tier tens of servers e g 40. complete graphs Within a data center Pingmesh lets use 10GbE or 40GbE Ethernet NICs to connect to a. the servers within a rack form a complete graph and top of rack ToR switch and form a Pod Tens of ToR. also uses the top of rack ToR switches as virtual nodes switches e g 20 are then connected to a second tier. and let them form a second complete graph Across of Leaf switches e g 2 8 These servers and ToR and. data centers Pingmesh forms a third complete graph by Leaf switches form a Podset Multiple Podsets then. treating each data center as a virtual node The calcula connect to a third tier of Spine switches tens to hun. tion of the complete graphs and related ping parameters dreds Using existing Ethernet switches an intra DC. are controlled by a central Pingmesh Controller network can connect tens of thousands or more servers. The measured latency data are collected and stored with high network capacity. aggregated and analyzed by a data storage and analy One nice property of the intra DC network is that. sis pipeline From the latency data network SLAs are multiple Leaf and Spine switches provide a multi path. defined and tracked at both the macro level i e data network with redundancy ECMP equal cost multi. center level and the micro level e g per server and path is used to load balance traffic across all the paths. per rack levels The network SLAs for all the services ECMP uses the hash value of the TCP UDP five tuple. and applications are calculated by mapping the services for next hop selection As a result the exact path of. and applications to the servers they use a TCP connection is unknown at the server side even. Pingmesh has been running in tens of globally dis if the five tuple of the connection is known For this. tributed data centers of Microsoft for four years It reason locating a faulty Spine switch is not easy. produces 24 terabytes of data and more than 200 bil The inter DC network is to interconnect the intra DC. lion probes per day Because of the universal availability networks and to connect the inter DC networks to the. of the Pingmesh data answering if a live site incident Internet The inter DC network uses high speed long. is because of the network becomes easier If Pingmesh haul fibers to connect data centers networks at different. data does not indicate a network problem then the live geolocations Software defined networking SWAN 13. site incident is not caused by the network B4 16 are further introduced for better wide area net. Pingmesh is heavily used for network troubleshooting work traffic engineering. to locate where the problem is By visualization and au Our data center network is a large sophisticated dis. tomatic pattern detection we are able to answer when tributed systems It is composed of hundreds of thou. and where packet drops and or latency increases hap sands of servers tens of thousands switches and routers. pen identify silent switch packet drops and black holes and millions of cables and fibers It is managed by Au. in the network The results produced by Pingmesh is topilot 20 our home grown data center management. also used by application developers and service opera software stack and the switches and NICs run soft. tors for better server selection by considering network ware and firmware provided by different switch and NIC. latency and packet drop rate providers The applications run on top of the network. This paper makes the following contributions We may introduce complex traffic patterns. show the feasibility of building a large scale network la. tency measurement and analysis system by designing 2 2 Network latency and packet drops. and implementing Pingmesh By letting every server In this paper we use the term network latency from. participate we provide latency data for all the servers application s point of view When an application A at. all the time We show that Pingmesh helps us better un a server sends a message to an application B at a peer. derstand data center networks by defining and tracking server the network latency is defined as the time in. network SLA at both macro and micro scopes and that terval from the time A sends the message to the time. Pingmesh helps reveal and locate switch packet drops B receives the message In practice we measure round. including packet black holes and silent random packet trip time RTT since RTT measurement does not need. drops which were less understood previously to synchronize the server clocks. DC2 DC3 a set of Autopilot services including Device Manager. DM which manages the machine state Deployment, network Service DS which does service deployment for both. Autopilot and various applications Provisioning Ser. DC1 vice PS which installs Server OS images Watchdog. Service WS which monitors and reports the health, Podset status of various hardware and software Repair Service. Leaf RS which performs repair action by taking commands. Pod from DM etc,Autopilot provides a shared service mode A shared. Servers service is a piece of code that runs on every autopilot. managed server For example a Service Manager is a, Figure 1 Data center network structure shared service that manages the life cycle and resource. usage of other applications a Perfcounter Collector is a. RTT is composed of application processing latency shared service that collects the local perf counters and. OS kernel TCP IP stack and driver processing latency then uploads the counters to Autopilot Shared ser. NIC introduced latency e g DMA operations inter vices must be light weight with low CPU memory and. rupt modulation 22 packet transmission delay prop bandwidth resource usage and they need to be reliable. agation delay and queuing delay introduced by packet without resource leakage and crashes. buffering at the switches along the path Pingmesh uses our home grown data storage and anal. One may argue the latencies introduced by applica ysis system Cosmos SCOPE for latency data storage. tions and kernel stack are not really from the network and analysis Cosmos is Microsoft s BigData system. In practice our experiences have taught us that our similar to Hadoop 3 which provides a distributed file. customers and service developers do not care Once a system like GFS 17 and MapReduce 11 Files in Cos. latency problem is observed it is usually called a net mos are append only and a file is split into multiple. work problem It is the responsibility of the network extents and an extent is stored in multiple servers to. team to show if the problem is indeed a network prob provide high reliability A Cosmos cluster may have. lem and if it is mitigate and root cause the problem tens of thousands of servers or more and gives users. User perceived latency may increase due to various almost infinite storage space. reasons e g queuing delay due to network congestion SCOPE 15 is a declarative and extensible scripting. busy server CPU application bugs network routing language which is built on top of Cosmos to analyze. issues etc We also note that packet drops increase massive data sets SCOPE is designed to be easy to. user perceived latency since dropped packets need to use It enables users to focus on their data instead. be retransmitted Packet drops may happen at differ of the underlying storage and network infrastructure. ent places due to various reasons e g fiber FCS frame Users only need to write scripts similar to SQL without. check sequence errors switching ASIC defects switch worrying about parallel execution data partition and. fabric flaw switch software bug NIC configuration is failure handling All these complexities are handled by. sue network congestions etc We have seen all these SCOPE and Cosmos. types of issues in our production networks, 2 3 Data center management and data pro 3 DESIGN AND IMPLEMENTATION.
cessing systems, Next we introduce Autopilot 20 and Cosmos and 3 1 Design goal. SCOPE 15 Data centers are managed by centralized The goal of Pingmesh is to build a network latency. data center management systems e g Autopilot 20 or measurement and analysis system to address the chal. Borg 23 These management systems provide frame lenges we have described in Section 1 Pingmesh needs. works on how resources including physical servers are to be always on and be able to provide network latency. managed how services are deployed scheduled moni data for all the servers It needs to be always on be. tored and managed Pingmesh is built within the frame cause we need to track the network status all the time. work of Autopilot It needs to produce network latency data for all the. Autopilot is Microsoft s software stack for automatic servers because the maximum possible network latency. data center management Its philosophy is to run soft data coverage is essential for us to better understand. ware to automate all data center management tasks manage and troubleshoot our network infrastructure. including failure recovery with as minimal human in From the beginning we differentiated Pingmesh from. volvement as possible Using the Autopilot terminol various public and proprietary network tools e g tracer. ogy a cluster which is a set of servers connected by oute TcpPing etc We realized that network tools do. a local data center network is managed by an Au not work for us because of the following reasons First. topilot environment An Autopilot environment has these tools are not always on and they only produce. Data Storage and Analysis, the ping results in local memory Once a timer times. Pingmesh Controller DSA out or the size of the measurement results exceeds a. Pingmesh Network Visuali, threshold the Pingmesh Agent uploads the results to. generator graph zation,Cosmos for data storage and analysis The Pingmesh. service Perfcounter, Agent also exposes a set of performance counters which.
Pinglist xml Database,Aggregator, are periodically collected by a Perfcounter Aggregator. Pingmesh Agent PA service of Autopilot, Servers Jobs Data Storage and Analysis DSA The latency. Pingmesh data from Pingmesh Agents are stored and processed. agent agent Cosmos, Store in a data storage and analysis DSA pipeline La. tency data is stored in Cosmos SCOPE jobs are de, veloped to analyze the data SCOPE jobs are written. in declarative language similar to SQL The analyzed. Figure 2 Pingmesh architecture results are then stored in an SQL database Visualiza. tion reports and alerts are generated based on the data. in this database and the PA counters, data when we run them Second the data they produce.
does not have the needed coverage Because these tools 3 3 Pingmesh Controller. are not always on we cannot count on them to track. the network status These tools are usually used for 3 3 1 The pinglist generation algorithm. network troubleshooting when a source destination pair The core of the Pingmesh Controller is its Pingmesh. is known This however does not work well for large Generator The Pingmesh Generator runs an algorithm. scale data center networks when a network incident to decide which server should ping which set of servers. happens we may not even know the source destination As aforementioned we would like Pingmesh to have as. pair Furthermore for transient network issues the large coverage as possible The largest possible coverage. problem may be gone before we run the tools is a server level complete graph in which every server. probes the rest of the servers A server level complete. 3 2 Pingmesh architecture graph however is not feasible because a server needs. Based on its design goal Pingmesh needs to meet to probe n 1 servers where n is the number of servers. the requirements as follows First because Pingmesh In a data center n can be as large as hundreds of thou. aims to provide the largest possible coverage and mea sands Also a server level complete graph is not nec. sure network latency from applications point of view essary since tens of servers connect to the rest of the. a Pingmesh Agent is thus needed on every server This world through the same ToR switch. has to be done carefully so that the CPU memory and We then come up with a design of multiple level of. bandwidth overhead introduced by Pingmesh Agent is complete graphs Within a Pod we let all the servers. scale data center network latency measurement and anal ysis to answer the above question a rmatively Pingmesh has been running in Microsoft data centers for more than four years and it collects tens of terabytes of la tency data per day Pingmesh is widely used by not only network software developers and engineers but also ap

Related Books