[MUSIC] Hi, in the last video we took a look at some of the key requirements that we need to build a multi-tenant data center. In particular, agility. And to achieve agility, location independent addressing, performance uniformity, security, and preserving the semantics of traditional layer two networks. Today what we' re going to do is talk about a case study of a system called VL2 which was one of the earliest virtualize data center designs that was publicly described in SIGCOMM 2009 for Microsoft Research. This design influenced the architecture of Microsoft Azure, and you can check out a bit more about that in a talk by Albert Greenberg that's listed on the slide here. Now this is a case study, so we're actually going to see some characteristics that come together from both previous weeks' ideas of physical topology and routing as well as what we're focusing on today and this week in network virtualization. So some of those ideas will come together in one system and that's why this is a pretty nice paper to read. All right, so the paper begins with a measurement study that leads to some motivating characteristics of data centers. So, the first is that increasing internal traffic is a bottleneck. So the paper mentions that the traffic volume between servers in the data center and their measurements of a cluster of about 1,500 servers Is about four times larger than the external traffic, either to or from the servers outside the cluster. And also, they found that the traffic patterns within the data center were unpredictable. So what the authors in the paper did was take traffic matrices in 100 second buckets and classify them heuristically in about 40 categories of similar clusters of traffic matrices. And over time we are plotting which of these clusters do we see appear in the measurements? What you can see here is over time, it's actually changing very rapidly and there is no apparent pattern to what the particular traffic matrix is. So, the result of these measurements, these characteristics. It points a direction toward good design which is to have a non-blocking fabric. Again, this is an idea that we've seen before. Where we're able to achieve high throughput for any traffic matrix that we see. Okay, as long as it respect the line card rates, the nick rates of the servers. Right? Those will be let's say, 10 gigabits uplink and downlink from each server and subject to those constraints, the network itself, the fabric joining together all the servers, we don't want that to be a bottle neck. So that's what a non blocking fabric means. Now another characteristic that's interesting in this paper is looking at failures in the data center. So they analyzed about 36 million error events from the cluster. And a few things that they found were that .4% of those failures were resolved in longer than one day. And so they took a while to be fixed. And .3% of those failures eliminated all redundancy in a device group. So it was a correlated failure, that, for example, could take out all of the uplinks of a switch. So what this is pointing to is that failures, individual failures, can be serious and it's going to be hard to make an individual device extremely reliable. These numbers looks pretty small but when we're trying to achieve you know, three, four, five, nines of availability they become significant. So, the direction that this points us towards the design is a closed apology so this is a particular kind of non blocking depology that we can build that is scale out instead of scale up. That is we're going to have a larger number of components forming the back bone of the network. Rather than a small number of beefier components. Okay, so putting together these ideas brings us to the VL2 physical topology. So we're moving form this traditional network design. Over to a closed network which will look pretty familiar. This particular design is a bit different than the factory paper that you read because of different line speeds and port counts configurations and switched. So how do we route in this physical topology? Well there was another conclusion of the measurements that, the traffic is unpredictable, so that means that it's difficult to adapt. Right. So this leads us to a design that is what's called oblivious. Oblivious routing meaning that the path along which we send a particular flow does not depend on the current traffic matrix. Right? So we're going to be able to locally decide at each switch where this traffic should go, without some larger scale global coordination. Now, valiant load balancing is a idea that comes from a paper by Leslie Valiant, for routing on hyper cubes, it's a theoretical paper we're not going to go into the algorithm and proof there. But the key idea, is that what you do is you take an arbitrary traffic matrix and you make it look like a completely uniform even traffic matrix. And the way we do that, is by taking the flows and spreading them evenly over all the available paths. Okay? So, we can do that by sending the flows to each of the top layer switches. Okay, now we do want to keep each individual flow on a single path. So we're not going to achieve perfect valiant load balancing. The flows are going to be a little chunkier rather than having each of them split evenly over all of the possible paths, but this is the intuition, that we want to spread traffic as much as possible. So what does the implementation of that look like in this particular design? Let's take a look. The first thing we want to do is spread that traffic over the top level switches which are called intermediate switches in this design. Now to do that what VL2 does is it assigns those intermediate switches and any cast address. The same any cast address for all of the switches. So, then a top of rack switch can send to a random one by just using that single address. And if we are using ECMP we will use a random one of those paths that are shortest. Well, all of the paths are shortest because all of those intermediate switches are the same distance from the top of racks, right? So what ECMP is going to effectively do is give us the full breadth of possible paths to any one of those switches, just by naming the single anycast address of all of the intermediates. Okay? Now we could have named them differently. We could have named the intermediate switches with different IP addresses, but by using a single anycast address, the top of rack switches don't have to know if there are any changes in the membership, like a failure of one of the intermediate switches. Okay. So ECMP lets us select from one of those paths from the universe. So one will be picked from any particular flow. We send it to that intermediate switch. Now that outer anycast address is wrapping an inner header that actually has the destination address, in this design. So we'll forward it from there onto the destination. Now it's useful to note here how this compares with the simpler design of just having a single address for each of the destination top of rack switches. And using ECMP to go along a random path, shortest path to one of those. How does that compare to this design that has this intermediate step? Of the anycast address? Well, it has a similar effect. That we're going to choose a random path to that destination top of rack switch. The path selected are going to be similar. This design with the anycast address will result in smaller forwarding tables at most of the switches. Which maybe a consideration. For some networks depending on the scale and the hardware used. [MUSIC]