28 NOVEMBER 2023
BRIAN NISBET: Good morning, folks. Hope you have been reinvigorated by your coffee. And we have installed additional breakers and a small nuclear power plant outside the room to hopefully have a nice smooth session. I'm Brian, and along with Antonio we will be chairing this session. Three very interesting talks.
So, first up we have Christian Petrash from DE‑CIX.
CHRISTIAN PETRASH: Good morning, ladies and gentlemen. Thanks for having me. My name is Christian. And I want to bring you on a journey for a big data platform.
At first, let's talk about why we want to do that. We want to provide our customers and our, especially our smaller customers, a tool where they have insights, deeper insights in their ports. A lot of them are blind because they are too small to have own services or telemetry metrics, it's a company, and we also want to be technical state of the art and have the observability possibilities. We have at that point of time, and at the end, it's a good idea to have better possibilities to do network planning, capacity planning and so on.
And, for sure, as the metric is showing you, in the last five years traffic is increasing ‑‑ it doubled. So we have to do something faster than the old one.
The first challenges we figured out was that we have more or less 5 gigabyte a second load of data, which consists of 300,000 packets a second of flow data of IPFix data, which is coming from our routers, our edge routers on the DE‑CIX platform, and 10,000 values a minute which consists of the telemetry data, which is the statistics of port counter, error counters, discards and so on, and doing analysis and enrichment, and big data stuff with that amount of data is a challenge.
So we invented a three‑step solution, data ingest and storage, front end and enrichment. For that, we went on a shopping trip, and it should be fun, we heard.
And the first step was getting a data hose and the data lake. And for data hose, there are a lot of companies were relying on cavity Kafka is scaleable, it's fast, it looks like a good idea. And for data storage, we mentioned about the last RIPE meetings where CloudFlare is doing some talks about Clickhouse and a lot of people are using Clickhouse and it's a really fast database and it fits perfectly together with Kafka so we choose Clickhouse. So we have our IXP platform and the data is coming in, we need some collector miracle, I will come to that later, and then we put the stuff to Kafka and Clickhouse.
So, next step was which tool we are using for the collector and at that point there are more or less two big players in the market, and had a look to the transport performance. At first, we used JSON, but JSON is very ‑‑ JSON packets are very big and this means that it's slow parsing. And that was the first lesson we learned, it's too slow for that big amount of IPFix data, so we thought about oh, binary performance could be nice.
So the battle of binary performance starts, and we had a look to Protobuf and Avro, and more or less Protobuf wins because in a later step we knew that we want to use Rust, so implementation of Avro and Rust was at that point not really well. So, we said cool, let's use Protobuf.
And at that point, only goflow 2 was able to use Protobuf, and at that point, thanks for developing and supporting that, Louis. He is at the RIPE meeting, if somebody wants to talk with him. And for example, we had a finding that our platform consists of Nokia routers and Nokia has a different template, IPFix template, so we had to change something in cooperation with Louis, and that was really fast. I really appreciate it.
So, our platform is running at the Cloud, but my journey is not you have to run it on the Cloud if you want to do that, it's possible wherever you want to run it. But we run it at the Cloud. And we thought, ah, cool, let's run goflow at the Cloud, and we tried that, and we sent our IPFix packets and said okay, we are losing around 75%. Why?
And after some investigations, we saw okay, the network of the Cloud provider has a maximum MTU of 1,400 bytes. That is complicated because a typical IPFix packet has 1,460 bytes. So it is GDP fragmented and, because of the fragments Mac attack, the Cloud providers, our our Cloud provider says okay, we are dropping fragmented GDP, which means 75 percent or more of our IPFix packets are dropped. Damn!
So, at that point, Nokia people, if they are in the room, maybe it could be a good idea to have a configuration option to have for the MTU servers the IPFix export.
So, our second lesson learned was we have to run the goflow collector on premise.
The next step of the three‑step solution journey was to search for a nice front end to present the visualisation, the data and visualisation. We had a look to Grafana. We had look to Superset, all the typical players in the market. At the end, we chose Superset because Superset set has a really nice embedding framework where you can embed your metrics in the high frame at your own pitch without the net of a separate login. So, it's creating a guest token and then it creates an i‑frame in your home page with the login of your home page of your user.
At that point of time, Grafana can't provide that. We said oh, cool, Superset.
On the left side, you can see how Superset graphs can look like. It's in our portal but it doesn't matter in which website or, the Superset graphs in the normal Superset page are looking like that.
So we have Kafka, Clickhouse and the dashboard requests coming via Superset from our customers, or from the people who want to look like.
So, after that, at that point we started to cook with the data, and first step of cooking was the enrichment of the data. Flow data we started with the flow data, it has no real combination to the business data of your choice, or, in our case, of our customers. So we want to enrich the data with additional data to get better metrics.
So, we had a look what Clickhouse is providing and Clickhouse is providing a technique, it's known, it's dictionaries. It's really fast because it's tables in memory and you can combine that with a lot of sources, with other databases, with APIs, and everything what you like because you can programme a web API if you like, we did that, to have a ‑‑ add an API to our database. And you have the possibility to automatically renew that data after several hours, seconds, what you want.
So, that was the next step for the enrichment. But now we figured out we had a large pile of data, 300,000 records a second. And we said okay, we can't store that in granularity. We said okay, five minutes, one hour, 4 hours granularity, we put them into packets and aggregation. And at that point a technique came in discussion, which is called materialised views. But, at first, what is a materialised view? It's a database technique, and a materialised view is more or less a permanent running query which is looking for insert triggers of a table, which is configured for, and it's a little bit like I notify for Linux, and it's really fast, and you can combine data or manipulate data, join data, so it looks really promising.
And ‑‑ but I want to mention that it never knows the data itself. It only knows that in the triggers, remembers that, because, now, our exploding or our bomb begins.
We configured that, we said okay, great, we have an enrichment materialised view, we have another materialised view for aggregation, stack them together and then we switched on replication for data redundancy, and at that point our load was tremendous, the queries were slow, and nobody knows what really happens. But, after some investigations, although there was clicker support, we figured out that stack materialised views with a high database ingest load is not a good idea.
On the other hand, we had the problem that our aggregation, our five‑minute packets, we started with five minutes, was, it was strange because, in that five minutes packets, there was not the data you are expecting. It was ‑‑ sometimes, it was more than five minutes, sometimes it was less, so it was strange. And the reason is because the materialised view only knows that insert trigger and it only inserts data if inside Clickhouse packet size is full.
So you never will have the typical fie‑ minute packet you are expecting.
So, what to do: And we have Airflow to the rescue. Airflow is from the Apache tool stack and it is more or less a scheduler, a graphical user interface scheduler which can catch up jobs, you can see what is failing, you have depentend jobs, you can write everything as Python, so it's really, really powerful, so patch processing was a better idea to have a better control. We did it, you can see that is a screenshot of a typical Airflow job with dependencies. And after getting the aggregation under control, one of the last steps was yeah, we have to take care about GDPR, we have to anonymise it anonymisation, we combined with our flux tool. It's a priority tool. It was ‑‑ the algorithm was presented at RIPE 84 from our research team lead, Matthias Wichtlhuber, he is also here if you want to talk to him. It's written in Rust, it's doing the DDoS mitigation and the anonymisation, we anonymise the last octet, and then it writes the data to the Clickhouse database, and with Rust it's very fast, but that maybe it's, the anonymisation part is open source somewhere.
And last step was to implement telemetry. To implement telemetry was more or less get the data from the telemetry and the users, the stack we developed, and then we said oh, it's a lot of stuff and then we wrote a filter, a telemetry filter, it's just a Python script, where we filter out the relevant data and put it into our database.
So, we have everything together, and that is the final cluster structure. We got the data from the IXPs, we put it directly to Kafka. Then we have the telemetry filter docking to Kafka and this flux data anonymisation tool and then we write the raw data to Clickhouse, and doing some enrichment and aggregation, and provide it by a Superset to the users of the data.
And that is how how it can look like, even at your side as well. It's Superset only. You have plug‑ins, you have a lot of metric possibilities, like it's known from typical big data systems. It's really, really nice. So I can suggest it.
And what insights we have: It's something, for most people I guess want to see. It's metrics about ingress, egress, metrics about top talkers and ports, pie charts and so on. And we also created for DDoS traffic, a chart which also consists on pie charts, and for a product, we have, it's a Cloud product.
I'm happy to answer any questions if you have them.
BRIAN NISBET: Do we have any questions? We do.
SPEAKER: Hi there. Thank you very much for your talk. This is very relevant to what I do for a living. I am Tom Hill, I work for British Telecom. I was curious where you were using gNMI to export streaming telemetry. Which models, YANG models presumably, are you utilising for this?
CHRISTIAN PETRASH: The YANG models, we are using the one Nokia is providing and, at Nokia, you have the prefixes for the statistics, you have that for the VPLS, for the services itself, that YANG models we are using, that answering your question a little bit?
SPEAKER: It does. I mean, I'd refer to them as the proprietary.
CHRISTIAN PETRASH: It's more or less the ones you can have from Nokia and there is a Nokia GNMI module for Python which can subscribe on the channel.
SPEAKER: Sure. So a follow‑up question. Have you considered providing telemetry streams to your connected members?
CHRISTIAN PETRASH: Not yet.
SPEAKER: Okay. Thanks.
SPEAKER: Rinse Kloek, speaking from Delta Fiber from the Netherlands. This is a very nice presentation. I am doing a current implementing Nokia routers and I still had something open about how I would do streaming telemetry, so I'm interested in your projects. Two questions. You mentioned some enrichment API. What kind of data you enriched your telemetry? And second question: Do you plan to release parts or the whole project at GitHub?
CHRISTIAN PETRASH: At the first question, for the first question, we are enriching more or less a business data of our customers to the telemetry and to the flow data, so everything what we have. And with that embedding framework solution, we have a roll limit securities that you only see the data of ‑‑ your own data. For the second question, I'm not sure. I have to talk with my boss.
SPEAKER: It would be very nice if you could release a part or the whole part of it.
CHRISTIAN PETRASH: I think the biggest issue we have with that at the moment, we are only two people inventing that and I think we will have problems to support that at GitHub.
SPEAKER: Thank you.
SPEAKER: Michael Richardson, Sandelman. I worked on a bunch of graphy‑like things with data mostly from DNS servers, and one of the problems I had was that, in order to have a set of test data big enough to do things like your five‑minute thing that sounds exactly what I was fighting ten years ago kind of thing with MongoDB ‑‑ don't ask. But the problem I had was that to get enough data to actually replicate it, I would then have too much personal data in the thing, and I did not feel confident in the anonymisation of that such that I could actually note those into test cases that I could actually put on GitHub or something. And that's ‑‑ I am wondering if you are dealing with that or where are you with that?
CHRISTIAN PETRASH: Yeah. For the anonymisation, we have the last octet of the IP addresses and for all other data we are fine with that, to put that in the database. So, we are allowed to do that.
SPEAKER: But would you be fine being as a regression test in a public repo?
CHRISTIAN PETRASH: Probably not.
SPEAKER: That's the conflict that I had. You stated that a person who submits a patch and you say did you run the 10 gig test data through it? And they are like what 10 gig test case? Oh the one that I can't give you. Right? So I think we have this over and over again with big data and PII and it's like we need an alternate Internet with, I don't know, all the octets above 265 or something and...
CHRISTIAN PETRASH: Yeah, you are right. I guess a worldwide solution would be really appreciated for that problem.
SPEAKER: Maria, developer of BIRD. I'd like to ask whether you are thinking about using the data basically online feeding back into the platform and stopping the DDoSs you have found right away?
CHRISTIAN PETRASH: No, not yet. What you can have is, if you are a user or a customer of our platform, you can have a look to that, and you see it. And if you see the DDoS and you can see, for example, also the protocol and the rule which fits to detect that, then you can do an IS propend and switch on blackholing which you can do. It's the only possibility at the moment. Maybe later at some point, but not yet.
BRIAN NISBET: Okay.
CHRISTIAN PETRASH: If you want to come and talk to me or want to have a meeting during the coffee break, feel free, I am happy to do that.
BRIAN NISBET: Okay, thank you very much. So, up next we have Jayasree Sengupta, who will be talking about valuing DNS resilience with truncation fragmentation and DoTCP Fallback.
JAYASREE SENGUPTA: . Hello, and welcome everyone, so today's talk is evaluating DNS resilience and responsiveness with truncation, fragmentation and ‑‑ the main authors couldn't be here today to speak so I'm talking on behalf of themselves, and I will do my best.
So, we'll begin.
So, this is ‑‑ so everybody here probably already knows it, but this is a general overview of the stats so as to get us into the topic. So, we all know that today mostly on the Internet we are using the UDP because of its latency, because that is pretty much faster in terms of latency, but there are also certain other issues with it like because of the size it is not able to always catch up with the DNS responses and hence, this might at some point of time create troubles such as leading to truncation and fragmentation and also in certain cases then truncation happens, it may also lead to the OTC P fallback and as we know on the DNS flag day of 2020, it was said that in order to go back with the challenges, it is better that we have buffer size of 1,232 bytes, which is recommended as the standard buffer size. Therefore, the experiment that we conducted was actually to find how is the failure rate of both DoUDP and DoTCP in the current Internet traffic that we have across the globe and to also test that what kind of buffer sizes are currently in use in the global traffic of the Internet today.
This is what that leads to our data motivation and that I already spoke about, that the current payload size of DoUDP, which is 12 bytes, which might lead to truncation which is not the case of DoTCP and also with the introduction of this extension of the DNS mechanism, that is the EDNS buffer sizes, this can be somewhat dealt with, and we are also trying to find out better, better the DNS traffic today is trying to follow the DNS flag day recommendation or not.
So, this takes us into the major research questions that we tried to focus here in this talk. So the major research questions are, how resilient is DoTCP and DoUDP over IPv4 and IPv6 address space and what is the scale of the usage and performance of both of these protocols in general, again both over IPv4 and IPv6, and also the last question is: What is the buffer size that is currently in use?
And so, we will go ahead with the findings through the talk and find out what we see.
So, to talk about the methodology here.
So, we have used the RIPE Atlas probes and the experiments have been conducted with the RIPE Atlas probes that are deployed all offer the world, and we have mainly used the version 3 and the version 4 probes, the latest ones, and the probes are at ‑‑ these probes are at least supporting the IPv4 and IPv6, either of them, or both. So, we actually worked on the 2,527 globally distributed RIPE Atlas probes whose density is somewhat like this, which is more dense in the European region and North America and the rest have very less density.
So the figure on the right size actually shows that ‑‑ shows our major experimental setup, so what we tried to do is, we tried to analyse this performance both from the age and the core of the network, so we have captured the interactions between the resolvers and the DNS client on the the edge and on the core side we tried to capture the interactions between the resolver and the authoritative NSes. And so that's how it is.
We do it for both the probe and the public resolvers, and so we tried to ‑‑ we'll just go to the next slide.
So, from for the evaluation from the edge, what we tried to do is is, we tried to use the cache DNS request, and for both the resolvers, and this is the experiment is performed over the course of 1 gig with 10 blocks every day and 10 requests per block resolver.
So, this is the map which actually tries to show. So the map that's on the talk it actually tries to show the DoTCP failure rates over the public resolvers and probe resolvers and across continents, as well as autonomous systems. So we have actually used the ten most popular public resolvers and the hit map on the bottom is actually trying to compare the performance of DoTCP versus DoUDP for the same set of experiments.
So the immediate thing that we observe here is that, for the public resolvers, we see that DoTCP has less of failure rates compared to DoUDP, whereas, for the probe resolvers the picture is pretty much the object sit where we see that DoUDP actually performance way better and it's almost like 74%, approximately 74% better than the DoTCP, and when we look at those values across the continents, we also see the similar trend. However, there are quite some exceptions to this. So if you look from Asia and Oceania, we do see that all these cases DoUDP, so DoTCP actually performs worse compared to DoUDP. But, in general, the first one ‑‑ in general the normal conclusion is that the DoTCP performance better.
So from the edge, we also tried to analyse the EDNS buffer sizes that are currently in use, and we see that four out of the ten resolvers that we have tested actually follow the DNS flag day recommendation of using the 1,232 bytes of buffer size, whereas four other resolvers, they actually use four other resolvers, that is the Comodo and Neustar and OpenDNS and OpenNIC, they actually use the larger buffer sizing of 4096 bytes which might also lead to the problem of fragmentation, and the general observation that is for IPv4 and IPv6, the difference in the buffer sizes is generally very low except for Quad9. And for Quad9 we see that for approximately 24% of IPv4 cases, there is ‑‑ it doesn't at all use any EDNS options, which is pretty interesting to note.
So, then we actually moved into the next set of experiments which we do from the core of the network, and here we actually tried to analyse both for the 2KB and 4KB response sizes and the experimental setup is still the same where we perform it over the course of 1 gig, so blocks every day and 10 requests per block per resolver.
So we have the same set of experiments here as well. So, we have the DoTCP performance on the top, whereas the comparison between the DoTCP and the DoUDP performance on the bottom. So, here we also kind of see the similar trends that we see that in the case of public resolvers, DoTCP actually performance better when compared to DoUDP, that is the failure rates of DoUDP are more. But if we try to analyse the results with the results that we had seen in the previous slides, that is for the experimentation from the edge, we see that the failure rates from the core are actually more than the failure rates from the edge. And this is true for both DoTCP and DoUDP cases. Similarly, when we look for the probe resolvers, we kind of see the results to be synonymous as to what we had seen in the earlier slides, that, even here, the DoUDP actually performs better than DoTCP and it is way ahead of DoTCP as well. And even here, we have some of the exceptions where we again see Comodo expressing the same kind of results as it was in the previous case as well for the experimentation from the edge. And there are also some interesting results such as when we look into Asia especially, we see that there are pretty much high failure rates at certain ‑‑ for certain of the resolvers and not just Comodo, which is also interesting to note. And also, similar results we see for Oceania, and when we tried to look into the autonomous systems we also see that we have this kind of higher failure rates for AS, like DTAG and also Comcast and Vodanet.
So we kind of do the same experiment of looking into the EDNS buffer sizes from the core as well, so here, we see that three out of the ten resolvers show larger buffer sizes like 4,096 bytes, for example, we see for ‑‑ this is for Comodo again and Neustar and also Yandex, which is the newer one, which was not showing the larger buffer sizes when we were looking at from the age of the network, and we also find that four out of ten resolvers, they actually follow the recommendation of the DNS flag day, and resolvers in general over here, they try to exhibit preferred buffer sizes for more than ‑‑ so for more than majorly 95% of the cases but at least for over 90% of the cases.
So, next, we also tried to see that what are the other EDNS options that are being offered, and what is the percentage of the offerings in the current DNS traffic. So we see that there are, apart from EDNS 0, we also have the EDNS client sub‑net which is the easiest option and also the EDNS cookie option. So if you look into the data, we find that Google actually offers easiest option. And also, in one sense DNS also offers the cookie option, but in general, if we look at all resolvers, we see that for 99% of the cases, EDNS 0 is being offered by the resolvers, and also if we look at IPv4 versus the IPv6, then we see there is smaller difference in the usage rate between the two. That's pretty much some set‑up.
And this also pretty much ends my talk here. So we will look at the key takeaways, but for the IPv6 results, it's there in the papers, I do not want to bore everybody more with the IPv6 results, but in general the IPv6 results pretty much follow the similar trends as we have seen in the IPv4 results here. So, overall, summing it up.
We see that the failure rates of the DoUDP is greater than DoTCP over both IPv4 and IPv6. And it's also true for both the cases, that is when we evaluate from the edge as well as well as when we evaluate from the core. And as we have already seen, three out of ten resolvers in case of the core and four out out of ten resolvers in the case of the edge, they actually show larger buffer sizes which might potentially cause fragmentation issues, and the resolvers in general exhibit their preferred buffer sizes for greater than 90% of the cases for both parts of the experimentation, and I would like to say that we tried out with DoUDP and DoTCP here because that is the most widely available traffic in the Internet these days, but with DoQ coming up, pretty much in the traffic we would also like to look at DoQ traffic at some other point. We do have an interesting DoQ work coming up, which again I'm presenting right after lunch here. So that pretty much sums is up. And I am open for questions.
SPEAKER: Jen Linkova, Google. I was slightly surprised to see the difference in v4 and v6 behaviour and that made me think, and in, maybe, networks, in some networks, when you talk to say 8888 you are talking to an ISP intercepted your traffic and serving it from the servers. I am wondering if you actually checked if you were getting responses from Google or like some other DNSs you tested, or was it some ISP in the middle?
CHRISTIAN PETRASH: So we only checked for Google, not from the Edge.
JEN LINKOVA: I am saying you really do not know if it was 8888 responding at this point or if it was your ISP intercepting your 8888 traffic and changing it to 8888.
CHRISTIAN PETRASH: I have to actually get back to the main ‑‑ I'll get back to you.
BRIAN NISBET: Oy other questions, folks? In which case, no, thank you very much.
I do wonder, just with my anti‑abuse hat on, sometimes, about how much of an entire conference you could get malware through through a QR code on a slide. It's just, it's one of those interesting ideas, which none of you should ever do, obviously.
So, moving on to our third speaker of the session, we have Jen Linkova from Google talking about turning IPv4 off in the enterprise network.
JEN LINKOVA: Hello. I am Jen. I hope your laptops are successfully low on power so I can get your undivided attention now.
So, I'm going to talk about things which I considered impossible a few years ago and I am going to prove they are possible now. So it's basically a talk about unicorns. It's a talk about turning IPv4 off in an enterprise network, Google enterprise network. It's a kind of continuation of the presentation I gave, like, two, three years ago, when we tried it for the first time.
First of all, why? Why would any reasonable human being would try to disable IPv4? First of all, because we actually run out of address space. We run out of the private address space, which is much worse. You cannot buy it really. And short in content, when you deploy dual stack, it does not solve your problem. If you deploy dual stack, you still need IPv4 everywhere, and you are still running out. So the only way was to start turning off IPv4 in one place and use it in some other place.
And also, I am a hands‑on person and I hate dual stack. You have basically two protocols, failure scenarios, unpredictable. Much harder to troubleshoot. So, why bother? Let's try to get rid of the legacy protocol. Who cares, right.
What kind of network we are talking about? We have a large corporate network, a lot of sites, let's say four digit number, and every site ‑‑ like, everywhere we are not using DHCPv6 for address assignment, only prefix delegation So basically we assign addresses through SLAAC. Again, this talk is focusing on offices like places when people come with their laptops or have their desktops. It's not about like management part. It's about end points. Desktops, laptops, phones.
They are using slack. Every side has a device which does NAT64, basically the same device which does NAT 44, and to access v4 only destinations we obviously have to deploy NAT64 and DNS64. So, yeah, we are doing NAT64 in the age, and in route advertisements we provide hosts with DNS64 addresses, and we PREF64. So hosts know which PREF64 we are using for NAT, which is well‑known prefix, actually. We have a centralised DHCPv4 infrastructure. You'll see why it's important.
And also, wired ports all have .1 X enabled and we put devices into VLANs.
So, previously, you might remember I gave a talk about the day I broke all the treadmills, because when we tried it for the first time, we wanted to do a proof of concept, and for proof of concept we selected the largest wi‑fi network we have, the guest network. And we turned off IPv4 for Google guest wi‑fi and for wireless guests and we created dedicated IPv4 enabled SSID instead, because we knew there was going to be some use cases for IPv4.
We did it back in 2020. We did reclaim a lot of address space, which was very good, management was happy because we were able to reuse that address space.
Well, it kind of worked, right? But we found out that having a dedicated v6 only network segment and dual stack network segment is not a very good idea. It's confusing for yours because users can't decide on different ID, pick up a random one. As a result, sometimes connect to dual stack SSID even if the device can actually work perfectly fine on v6 only. For example, some phones which work actually very well on v6 only can still connect to dual stack and they would consume my precious IPv4 addresses. It leads to high IPv4 consumption. And also, I do not have any visibility to problems. Why people joined dual stack network? Is it just random choice or is it because something does not work, something I might be able to fix? Nobody tells me anything, right? Also multiple IIS DIDs, wi‑fi team hates it and I realise that for wired network I would be not be able to do this. We have so many VLANs if I start doubling the number of VLANs for .1X it would be an operational nightmare. So we wanted something better.
What if, instead of doing dual stack and v6 only, we are going to do what I like to call IPv6 mostly. What if we can let devices co‑exist on the same network segment and some of them can be v6 only, like new devices which I know cannot be can operate in v6 only mode and some devices like old devices would be still dual stack. How can we do this? Let's say client can signal v6 only capability and say hey, I can IPv6‑only, does this network support this? If it's just a legacy network which only has v4 or dual stack, fine, I can be dual stack. But if this network supports v6 only or v6 mostly mode, I'm happy to be without v4 address.
As a result, devices can move between different networks and operate in the mode which the network expects them to. Because, yeah, I was considering can I just turn off IPv4 for corporate devices but then as long as users go outside to public, like Starbucks, wi‑fi, it would get broken, right.
How did we do that? We wrote RFC 8925, or you probably heard about this, especially if you attended tutorial on Monday, option 108. So we decided that the best way to turn off IPv4 is use the DHCP phone, because if you have your phone network ‑‑ how it works. In the DHCP discover the client includes the option 108, which basically indicates the client capability to v6 only. So if DHCP server is not aware of that option, let's say like normal Starbucks network, nothing would happen, right, the normal would complete, everything works fine. If that device connects to my enterprise network, and it connects to the network segment which I already have converted to v6 mostly mode, then my DHCP server say oh, for this network segment, I will return the option 108, which means shut up. Don't do DHCP 4 for specified period of time. V6 only. Again it only will happen for clients which supports this. So for example, if you take some, I don't know, very old network device, nothing going to happen, that device will get performed as before.
Now I can have two IPs too of devices in my network and only some of them going to get IPv4.
So, what do we mean when I say device can operate in IPv6‑only mode? It's a kind of very wide definition. For some devices I know because I tested. My laptop is only on v6 network for five years but I only use browsers and NN S H. For me it's everything works. For some people they are playing some games and that stuff for some reason breaks and they are unhappy. Well...
So, practically, the only way you can say that the device is capable of operating in v6 only mode, if you know that the device runs for v6 Etisalat. Because mobile operators are way ahead in the game. They are deploying v6 network for a long time. Why? Because most of mobile use 464XLAT, they use a special translation demon on the device and that demon provides IPv4 address and IPv4 default route to applications. So if applications can not operate without v4, what happens on v6 only network? A.m. place, I don't have v4 address, I can't connect to v4, go away, I'm not working. 164 XLAT fix it and here is the v4 address and here is the default route and the translates the device on the the traffic to v4 to v6 and sends it over v6 only network as a v6 packet and then it's translated by NAT64 device and the age is normal. That Starbucks on mobile phones. And it also works on your Mac books if you have something. If you have Mac OS 13 or newer.
Obviously, it probably means that if your device can operate even without DNS 464, which is also good for seasons deck, I know there are DNS people in the room. You should love this. So, for me, practically, it means that I want devices which run for 464XLAT to send option 108 and obviously I can manually opt‑in some devices if I want to.
So, project scope. What we did:
We were disabling ‑‑ we were migrating our network to v6 mostly, and the scope was all user phrasing wi‑fi and wired network infrastructure all across the globe. So all our offices. Well most, most of them. There are a few which I did not touch because they are very old. But basically I want to cover the vast majority of user face in VLANs. So devices in scope. All devices which concerned option 108. Currently on /TKRAEUD, IOS and Mac OS. They sent option 108 unconditionally. You just turn them on. Option 108 at the detect, either by... advertisement or by using DNS look‑up of IPv4 only name. And they enable 464XLAT and everything works. I also have selected Linux and Chrome devices and opt‑in when users can enable option 108 selectively and try it and disable if they don't like it. But, Mac OS and phones, they will like automatically opt‑in and opt‑out mechanism.
So, how long did it take?
We are currently at 90% of rolling out. 90% of all network segments. I am counting VLANs where this is enabled. Fingers crossed, a hundred percent will be reached next week when I'm back from RIPE. So, we started just doing like extended pilot, chosen in three locations. I started in Sydney where I was. Then we extended it for one months in five big offices and then we encrypted the switch and said okay from now on all new offices are turned on about option 108 by default as a v6 only mode and we started doing brownfield. And it basically took us four months to do this. So I started in March and really started in May doing the large brownfield rollout.
I can't believe it, fingers crossed. Worked surprising well. No real showstoppers found. We found a few cosmetic issues. We found a few bugs in Mac OS and Apple, Apple have been amazing, those have been fixing those bugs I have been recording. We did find some issues for which we had to implement work‑arounds and you can talk about this more hours, right, by I only have half an hour. So my apologies, I am not going to cover all stories and funny cases we found. I have some backup slides to you can take a look at them. Maybe I have time for them. We'll see.
So, yeah, we were able to do this. Well, 90% done. I hate reporting stuff which is not a hundred percent done, but ‑‑ sorry, we should have have RIPE meeting one week later.
So ‑‑ every utilisation drop, three or four times on average networks. There are a few networks which did not show much utilisation drop probably because there are a lot of Windows devices or Chrome OS devices but on average, I basically expect to downsize most of the subnets between like 2 and eight times sub‑net size reduction. We basically start reclaiming address space and estimation in about 300,000 addresses to be reclaimed by the end of the year. This is not really a typical network because here is a significant utilisation drop ‑‑ almost 40% utilisation DPS.
If I make it /20, it will be 80% utilisation, much higher than acceptable threshold. So after we did that. You see it dropped below 5 percent, so I just downsized it to /22, got 7 K addresses back instantly.
Lessons lender. The most important lessons learned is that the only way to do this, like you can get much further with a good word and a gun than just with good word itself. So, until you actually in danger of running out of address space, nothing happens. I have been working on this for, I don't know how long. We launched Google public DNS64 for what it was, 2015, 2014. Because we needed that for this project, right.
But until we ran out. In like 2019, it never was a priority. Now it's a very important thing.
Lesson number one is that I got surprises write. Ten years ago I was very unsatisfied with the quality of v6 in the network, but then I spent a significant amount of time making it better, and I was under ‑‑ I had a full sense of security. I was thinking we have been operating v6 for so many years, my v6 network is perfect, it works as well as IPv4. Oh my God, how many things I found.
Since happy eyeballs. You do not a proper visibility to v6 issue while I have a safety net of IPv4.
For example, like your v6 would mostly work, mostly like 90% of the time. Who would notice if your work station or laptop when it wakes up after sleep, would lose a DNS configuration for ‑‑ well until the next RA comes. Maybe seconds, maybe minutes. Nobody would notice it as long as you have v4, but when you turn off IPv4, that's when the fun starts. Users do not report the issue, and all those issues on end points are really hard to detect and test, right, because sometimes it's a raised condition, sometimes it at the dividends on what users do, right. So it's not something I can easily test.
And as a result, when you have v4, nobody reports anything, nothing gets fixed.
So if you think your network is like v6 ready, well, you might reconsider.
So, some discoveries we made.
If something looks like a host, behaves like a host, is it actually a host? People put some devices in the network like some box with some screen on it maybe, and it gets an IPv4 address over DHCP, gets IPv6 address over SLAAC, and then actually we discovered that when we did v6 only guest, so it was option 108 here. We found that it's actually a router. It might be a router. It might be something inside. I actually had to take a screwdriver, open the box and found the OpenWRT box inside the tablet and a few other things connected. And for me it was a host, right.
No. And as soon as I took away IPv4 from that thing, right, everything got broke and because this device cannot extend the network any more. It was using NAT 484 IPv4 but it was not able to extend v6 connectivity downstream. Obviously we can use 464XLAT here, but is it a good thing do I really want these systems to stay forever? We we actually want IPv6 there. What we are doing is, we are using DHCP prefix delegation to give a prefixes to end points. It was some discussion at IETF. It's not strictly speaking the host any more. It's a router. The line between host and router is now is blurred, like my phone is the hoster router. Strictly speaking it's a router because I have my laptop connected to Chrome OS,.and I am going to talk about it later, have like 57 VMs inside. Is it a host or router?
Let's say a node. Something which looks like aest /HO, okay. I have so much time we can talk about this one.
This is a slide which actually summarised one‑hour talk I gave at the IETF. So, we could talk ‑‑ I have some backup slides here.
First of all surprisingly the biggest issue we discovered, biggest in terms of number of support tickets open, was people, oh my God I came to the office and I cannot connect to Google wi‑fi any more. Why? Because they had IPv6 disabled on their laptops. Why they had IPv6 disabled? Because for years, support people were saying, ah, you have to problem, did you try to disable IPv6? Oh, it helped. Ticket closed. Problem solved.
I asked many times how exactly we are going to relabel that thing back. Nobody cared. Well, until we suddenly stopped providing IPv4. So, yeah, we had to ‑‑ like use a script to go to every single corporate device and reply enable IPv6 just to make sure it stays enabled.
So, fragment extension headers. I love extension headers. They are actually used. I know some people think they don't but they are. There are two very useful extension headers in the world. One of them is fragmentation. Okay, let me find supporting slides for this.
So, fragmentation obviously. For end users, it mostly is DNS and some other UDP application like /PHORB, for example. If you are using /PHORB you might again unpleasantly surprised if you are blocking fragment header. The header is actually ESP header, which conveniences VPNs and wi‑fi calling. Right. Wi‑fi calling.
Your phone actually established IPv6 channel to a voice gateway and if you are using IPv6, right, you need to remember that you need to permit not just TCP UDP, EMT but also ESP headers.
We found some funny issues on some platforms. Some NAT64 devices were really happy doing NAT64 unless IPv4 packet was coming in, with a zero UDP checksum. Which is perfectly fine in IPv4. Well some NAT64 devices start putting some garbage inside which made end points very unhappy.
So, some other devices decided that I am stateful firewalling, but when I am permitting outgoing ISP traffic I'm not going create any state for return flow because for state you need port numbers. No port numbers for ESPs, so, no, just drop it on the floor. Well, that was easy to fix. Fortunately, it was just explicit rule. But still...
Fragmentations. Surprise, surprise, IPv6 had 20 bytes longer than IPv4. So inevitably, if your NAT64 device receives a 1,500 bytes from Internet, and you have [beat asset] to zero. NAT64 device will fragment it, unless you have more than 1,500 MTU in your v6 infrastructure, but I know that some ports don't like packets of bigger than 1,500. What's going to happen? Your NAT64 will create two packets. With fragment headers, and by the way some devices are not even using 1,500 packet size, they are using 12 will 0 as a default fragmentation size in this case. Fortunately, it's configurable. But again, it's something I was not aware until people started to complain about some packets disappearing in the wild. And again, like for end points, as I say, it's UDP applications. If you start using v6 for network infrastructure, [radio] is becoming a problem because [radio] is, DNS wants to get an answer in one packet and if you start using certificates for .1X, well, packet will be definitely bigger than you can fit into MTU.
I talk about this, let's go back to summary slide.
So, yeah, talk about NAT64. This one probably deserves much longer discussion. So I started getting complaints at ChromeOS devices having connectivity issue and kind of strange connectivity issue. It's not like nothing works, which is normally easy to troubleshoot. Some applications losing connectivity from time to time. What happened is, as I say, ChromeOS, it's actually a very complicated thing. Inside it runs a lot of virtual machines, name spaces, a lot of them. Each of them obviously getting IPv6 address, more than one sometimes. So you can easily see your ChromeOS laptop use a Chromebook, using like 10 or even 20 IPv6 addresses at the same time. Other devices are trying to be very, very helpful. They are trying to enable discovery proxy. They are trying to respond on behalf of our devices, and for that they keep in the table. Which client has using which IPv6 addresses. The size of the table might unpleasantly surprise you. Some vendors use magic number 7, and I thought oh you have seen more than seven addresses on a device. I'm like yeah, I have seen 20.
And this is actually ‑‑ and the problem is troubleshooting this is a nightmare because when N plus one address appears, it just fails to install it into the table. So, everything works except this one address. Troubleshooting that stuff was fun really. And again some addresses disappears from the table that probably MAC address started to work and some other addresses started falling off the network. So job security, I was very busy for a child.
Obviously ‑‑ and also, let's say you have a VS LAN infrastructure, which probably means every address is a route. Now you suddenly have 20 times IPv4 routes in your routing table than you planned. And this now translates to money. Because it's completely different hardware now. So kind of does not scale very well, right, people start complaining about the IPv6 thing because it's very expensive.
So ‑‑ and this is actually very similar problem to one I had on the slide before about device extending connectivity behind itself. So what we are trying to do, this is a work in progress, is to start delegating prefixes where the ISP to such devices so instead of maintaining a huge route neighbour discovery cache and wireless table, which has 10 or 20 or 30 addresses per device, we only need to maintain in single route, and a single entry for device link local address, which attaches to the number of the devices I have.
So, another interesting thing. Where is Jan, I know he loves this. So, renumbering case. You probably heard about that for ISP environment, you see PC reboots, gets another prefix PADP, your device did not realise that prefix changed and you have now an old prefix and new prefix and nothing works.
Surprisingly, we have a similar problem in the enterprise. So,.1 X, a desktop boots up, gets into, let's say, machine VLAN, right, gets an address from blue machine VLAN A. Then a user comes to the office, logs in, .1X authentication happens. Device moves to another VLAN. Gets an address from another VLAN. The old one stays. Theoretically indeed the machine must be smart enough and realise that when .1X authentication happens you need to do something with your network stack. I had this discussion with Microsoft support back in 2001 for 2004. For IPv6 it's still the same, network manager is a total nightmare in this case, so, basically in half of the cases what happens, right, you basically have two subnets on your interface, two addresses and only one of them really works.
However, SLS a solution for this, because again this is a case which we never noticed until we got v4 off.
There is an RFC Tor detecting network attachment which says like if your device disconnects and reconnects, for example wi‑fi network, like Layer2, device is a kind of trying to dot right thing, am I still on the same network? Do I need to complete refresh my network stack or am I fine using the old network address? The first thing they all do is they check if my default router changed, if my link local and MAC address of my default router is still the same. Theoretically, after that they still need to get a router advertisement and compare prefixes, but it's not all of them do that.
So, basically, a lot of devices kind of realise that when your Layer3 network changes, your router changes as well. However, if you use VRP, your MAC address is always depends on your group ID, and any reason to have more than one group ID? I'm just using the same number everywhere, right, why would I have different numbers? And actually, not so many of them, strictly speaking, just 256.
Some devices also violates RFC and generates virtual link local address from its MAC address, which means, every network segments that uses the same VRP ID will have the same link local address and the same MAC address everywhere.
Well, never been a problem, right.
Until another renumbering case appears. I have two buildings, two different subnets but they are too close to each other. When you walk from one building to another, you actually move between two subnets. But again, the same VRP ID everywhere, same link local, same MAC address, device, like, I am definitely on the the same SSID, same network, I am keeping old network config. As a result, nothing works, right, no connectivity.
People start complaining. What can they do? Oh, there is a solution actually. Default address selection RFC says that when you select ‑‑ you have multiple addresses, it's a fundamental thing in IPv6, right? If you have multiple addresses, you need to select one which to use, and by ‑‑ and there is ‑‑ one of the nine, eight, eight rules says that if you have multiple routers and each router advertises a prefix and you selected the router, please use the source address from the prefix advertised by the router. It makes sense for multi‑homing right. I have two ISPs, I always want to send traffic from ISP A from source address from ISP A address space.
So, if host implements this, then all your renumbering cases are very easily solved. You are basically making sure that if you get a new prefix from a router, it should be a different router. Which means as long as every VLAN in my network has its own link local address, it sounds crazy, globally, unique link local addresses. Well I don't know, I can probably make them CT unique, but why bother? I make them globally unique. What I'm doing in my network because it's easier, I am just using global /64 prefixes and interface ID for VRP. Very easy to code. One line of code change. It works like a charm, I can tell you. So many people are now happy.
So, yeah, and I think it's basically applies for CPE cases as well, as long as your link local address is a function of your prefixes. By the way, rule 55, supported by ‑‑ Jan, you are late ‑‑ I was talking about renumbering cases. I'm not going to repeat it.
Rule 55 being supported by Microsoft for a long time, Mac supports this and I know there is a work in progress for Linux so we might actually get most operating systems covering it.
What other issues have we discovered?
So, by the way, until recently Max OS was doing very funny thing, it was using 464 slack address, special one and considered it to be a normal address because it's assigned on the interface and it was sending DNS packets from that source. Well, it freaked out some wireless devices because, oh, my God, it's a spoofed address definitely, it's a bad client, I'll just block it and you are not getting on my network. It also makes VXLAN networks very, very unhappy because they see the given IPv4 address start moving between all ports in the network at the same time. CPU goes up the sky, operational people also unhappy.
So, basically, strictly speaking, the only issue you would probably see now is a cosmetic issue with traceroute, if you do it from your Mac books. By the way this network is using the same approach. So most of your MacBooks are probably v6 only right now. You probably would not see anything, but this is a work in progress. Besides that, so make this happen and we had to publish a number of RFCs to document the stuff. And thanks for open source community, that was implemented in various operating systems, even before I was able to deploy it in my network.
There are some ongoing v6 office, the document stuff I was talking about, so feel free to read for datasets on this address. For example, we are talking about how to enable 464, when to enable it, disable it, so all developers at least have some common guidelines.
Next step for me, because as I said I'm mostly done, famous last words, with Apple devices. ChromeOS starting from 1114 supports option 108. It's disabled by default. You can go into settings, whatever it's called, and enable it. For Linux, you can enable that stuff manually but there is no 464XLAT, CLAT implementation in the standard package of Linux so there's some work to be done there as well. So it's basically my next step.
And we have time for questions.
BRIAN NISBET: Thank you very much. Questions?
SPEAKER: Rinse Kloek speaking for myself. Very nice presentation. Thank you. One simple question: Do you expect Microsoft to support 108 and 464XLAT shortly?
JEN LINKOVA: Can you describe shortly. There is a work in progress on this, let me put it that way. I can't speak for them. But I have asked. So, yeah, very good point actually, so I understand the enterprise people would be mostly interested in Microsoft. So if you have support contract, please, because it would make everyone's life easier, the more people ask for it the easier it would be for them internally to justify that work. So if you would like to see that, ask your Microsoft representatives, yeah.
SPEAKER: Some post /TPOF ID professionals. As a person who is looking for the means and the why to implement IPv6, I would like to thank you for your experiences and sharing them with us.
JEN LINKOVA: My pleasure.
SPEAKER: We have a question from Elvis. "Can I sell your reclaimed IPv4? Just joking."
JEN LINKOVA: Private ones?
SPEAKER: "Seriously now, since you have done most of the work, how hard would it be to replicate this transition in a company offices and infrastructure?"
JEN LINKOVA: I would say it's getting better and better. As I say, I believe that right now if you take macOS is in Openflow a ‑‑ it depends on your client base ‑‑ if you have up to date macOS and reasonably up to date phones, you can do it easily, and the beauty is you do not care about legacy devices really, right, because legacy devices will be just dual stack, right. So your most problematic part might be if you have say ‑‑ macOS 13 where you might see some issues. Again there is no showstoppers, you can have work‑arounds in everything we found in sin Openflow a. Network infrastructure, I guess it depends. We found some bugs. We got them fixed, right. So I assume your network infrastructure does not have those problems. I would say it should be like reasonably easy for people to do this. Cisco and Juniper supports PREF64 in recent users, I just don't remember the numbers. You can do this now. And again, the great thing about that is your devices which do not work very well with v6 will be just dual stack. So yeah, it could be done.
BRIAN NISBET: He asked a follow‑up question: "If this is very simple, how long until I need to find a new job?"
JEN LINKOVA: No, I am not concerned about that. You see, I just have a slide. You see I want to get to like ‑‑ let me tell you a story. When I started in Google in 2009, it was 0.2% of v6 traffic or something, and people were just you and Lorenzo, nobody else is using this. So I want to get to the point when I look in v4 traffic and I'm like who cares, nobody using that. We are not there yet. So, I will be around for a while. I am not retiring yet.
BRIAN NISBET: And Elvis will have a job for a while, which I think is his main concern.
JEN LINKOVA: Nobody is getting out of a job here. Stay in the room, please.
BRIAN NISBET: Okay. I think if there are no more questions. Thank you once again for a fascinating talk.
So, just thanks to all our speakers in this session. Just before you all rush off to lunch, I would just remind you please to rate the talks. I will say that also there is still an opportunity to nominate yourself or with their consent, somebody else, for the Programme Committee, and we'll be talking about that later, and also the NCC have asked me to tell you that the meeting T‑shirts are available, it's downstairs, it's where they are available, there is a maze of twisty passageways, do not get eaten by [a grew], but they are available downstairs for your fashion plans for the rest of the week.
Thank you all very much.
LIVE CAPTIONING BY
MARY McKEON, RMR, CRR, CBC