Tuesday, 28th November 2023
At 9 a.m.:
CLARA WADE: Good morning, everybody. So, welcome to the Tuesday plenary. The first speaker we have today is Robert Beverly who will be presenting remotely about geo‑auditing RIR address registrations.
ROBERT BEVERLY: Hi, everyone. I will go ahead and start. I am going to take that as a yes, I hear my own echo, so hello, everyone. I am joint work with my ‑‑ and here to tell you about some of the work that we are doing, geo‑auditing RIR address registration. So let me just start with a bit of background of the what and the why of this work, so as I think we all know, regional Internet registries are the organisations that are responsible for receiving large chunks of contiguous address blocks from IANA, and then subsequently, distributing those addresses, and today there's five RIRs and there's RIRs have regional responsibilities so of course there's ARIN, APNIC and today, we are in the room with RIPE.
And if we take a moment to look at some of the documentation that specifies the role of RIRs, there is a couple of documents that are of interest. The first is of course the ARIN number resources policy manual, this appears in some of the RIPE documentation and the NRO documentation as well, the primary role of RIRs is to manage and distribute public Internet address space within their respective regions. RFC720 talks about the explicit goals of Internet number registries including being a body that can allocate resources, of course IP addresses are a finite resource and they must be unique, it talks about doing hierarchical allocation... and operational needs and the what is of interest here is talks about the core requirement is to maintain a registry of allocations to provide accurate registration information, of those allocations in order to meet a variety of operational requirements.
So, a quick overview on our work, what I'm going to talk about this morning. We did the following, first of all we examined IPv4 address registry information across all five RIRs. Then we performed active latency based IP geolocation of allocated IPv4 prefixes, in order to determine where, physically, these allocated prefixes were being used.
We then developed a taxonomy of prefix registration gee consistency, what we mean by this is we are comparing the physical location of the RIRs service region, to things like the registration info, to things like where it's physically located.
And then finally, we used this taxonomy to complete an.
AUDIENCE SPEAKER: Of the prefix registration consistency to understand how geoconsistent they are across the different RIRs.
One of the first things that you are sitting in the room is Rob, wait, what are you talking about, out of region use is allowed?
Let me make a couple of things clear. First of all we are not looking at Inter‑RIR transfers, these are logged and other people have done research on these, of course they are vetted by the RIRs, instead what we are looking for are instances of out of region use that can only be uncovered via measurement.
Secondly, we adopt a rather conservative view of what we mean by out of region use. For instance, if it is used out of the prefix ‑‑ is used out of region one of the things we ask, is it at least consistent with the registered organisation's location?
And of course it's complicated, as with all things, on the Internet. Different RIRs have very different policies. So let's take a look at that quickly. Here is a quick snippet from the NRO comparative policy overview that looks at some of the policies for the five RIRs, and you will note some of these vary quite a bit in how strict they are with respect to out of region use. So for instance, ARIN says "to receive resources, ARIN requests organisations to verify that it plans or using the resources within the ARIN region"
APNIC is one of the more loose organisations, I would say, where they only require someone to have an actual account with APNIC and they explicitly allow out of region use. They say they "permit account holders located within the APNIC service region to use APNIC‑delegated resources out of region."
Must have an active element located in the RIPE NCC service region and AFRINIC is one of the most conservative, saying they "require organisations or persons to be legally present and the infrastructure from which the services are originating must be located in the AFRINIC's service region."
So, our motivation for doing this work was threefold. We wanted to increase transparency and help the community understand where this scarce resource was being used.
Secondly, we wanted to quantify the extent to which the registry information is accurate and conserve the operational needs as specified in some of these RFCs and third, inform some of the ongoing discussions that have occurred recently about in region address use and policy.
And I threw up a couple of examples of this that have made some headlines recently within the media and across the various RIRs and different NOGs.
I also want to talk about what this talk is not going to be about, and so we explicitly recognise that there's an economic IP addresses and of course there's a need for efficient and equitable use of IP addresses and we also understand that there's sort of operational realities of the real world, things that are done just out of the various constraints that are in place or simply for expedience so our goal is really to shed quantitative light on... or not claiming violations and we are not advocating for any policy changes, we are we are really just trying to quantify what's going on.
Okay. Let me talk about how we are doing this, let me start with an example to make sure everyone is on the same page, I think most of us have looked at some of the Whois records but here is record of a /24 that comes from a /8 that IANA has allocated to RIPE. This is an example where the registered owner is in Hong Kong so they are outside of RIPE's region and the question we ask is, where is this /24 physically? And there's a couple of different possibilities. First of all, it could be in RIPE's region, physically, it could be in APNIC's region or it could be in neither RIPE's region nor APNIC's region.
Again, we are taking a very conservative view of what it means to be out of region and so we say okay, this means that it's consistent if it's in the RIPE region, it's consistent if it's in APNIC, but it would be called inconsistent by our taxonomy if it's in neither RIPE nor APNIC's region.
I want to just briefly about this taxonomy that we have established, that essentially goes from being fully geo‑consistent where we have all of the different values from the RIR matching, to fully inconsistent. What are these different values? For a given prefix for comparing three different values, first of all the RIR that's responsible for allocating the prefix, secondly the RIR that's responsible for the country of the registered organisation and then finally, this notion of doing geolocation to figure out where the prefix is, we figure out which RIR is responsible for the inferred physical geolocation of the prefix. So if all of these match they call it fully geodistant and if none of them match, i.e. all three are different, they would be fully inconsistent and then we have a range of different consistencies in the middle.
How are we doing this? I am going to each of these steps in a bit more detail. Let me give you the high level of what we're doing.
First of all, we parsed the bulk Whois records from each of the RIRs ‑‑ where we have extracted all of the relevant information.
Secondly, we find responsive addresses that are within each of these prefixes, using the hit‑list. This will allow us to drive a probing logic, that is tied to RIPE Atlas where we use it to select RIPE Atlas nodes that are within the country of the registered prefix as well as RIPE Atlas nodes that are located on each continent to represent the different RIRs.
All of these results that then get fed back into the database where we then use an algorithm to identify the most likely continent that allows us to identify which RIR is actually responsible for the geolocation of this prefix and then it gives us our prefix IR geo consistency.
So going through each of these steps in a bit more detail, the first part is to parse the bulk Whois records. This is a bit of work because all the different RIRs, different scheme mass have different identify ewe sin crassies do we do pull all of this out to figure out which different records are tied to which different prefixes, of course we parse the prefix and the registered organisation's mailing address and we are careful to ignore certain records, things that are transferred or things that are non‑‑managed records down in the lower right here is an example of something you might see in the RIPE RIPE Whois database which actually specifies it's in the RIPE database but it's not maintained or managed by RIPE.
So this first step in our methodology gives us these first two values of our taxonomy. The registered RIR and the RIR responsible for the organisation that has registered the prefix.
So just looking at that data before we do any of the actual measurements to find the geolocation, we can get some bulk Whois stats on what's going on and how many of these prefixes actually belong to organisations that are out of region according to their mailing address. First of all, the number of prefixes that we look at in total from all of the Whois, bulk Whois, is approximately 8 million, so this is quite more granular than BGP, for instance.
You will see that out of all of these, in terms of the number of prefixes, the percent of prefixes that have an out of region owner, ARIN has 2.5%, RIPE has .8% and AFRINIC has actually the highest percent as a fraction of prefixes at 14.2%.
If we look at where these are going, we can see that, for instance, most of the ARIN addresses that are registered to organisations outside of the ARIN region are in countries that belong to APNIC or RIPE, whereas AFRINIC, most of the countries belong to APNIC and ARIN. And again, at this stage, I want to just emphasise this Inter‑RIR region registration is fairly common and in many cases may be explicitly allowed.
So, the second part of our methodology is to look ‑‑ is to find targets within each of these prefixes in order to do our geolocation and do our latency‑based measurements. So we utilised a "hit‑list" of known or likely responsive IPv4 addresses and then we use a prefix matching so we match hit‑list addresses to the RIR prefix to which they belong, of course we are going to have to ignore prefixes without any responsive addresses, we are going to have to ignore Anycast prefixes because that wouldn't make sense.
For this work that I am going to describe today, we randomly sampled 10,000 non‑Anycast prefixes with responsive targets from each of the RIRs, so we take a total of 50,000 total prefixes for the (inaudible) I am going to talk about today.
I want to take a quick diversion to thank the Atlas folks, Atlas is a huge and valuable resource to the community, and it's really been essential to our work and doing what you are hearing today. So one of the things that are really nice about it, first of all it has extensive physical coverage especially in country, this is critical because we really need the ability for some of these latency based measurements to be in country. Otherwise it really doesn't work the way we might want to, especially for poorly connected countries. It also has a very sane and usable restful API and very nice properties of being fair or findable, accessible, intra operable and reusable, if you want to look at our measurement results we have got them tagged with different values so they are easy to find so you can look at all of them yourselves.
So, the third methodology is to do this delay based IP geolocation using RIPE Atlas so we use a total of 20 RIPE Atlas nodes to send nodes to target prefix address. We selected Atlas nodes in the following way: We select three of them within each RIR, that gives us 15 total vantage points and we select five nodes within the registered country. There's some more details that are in a technical paper that in essence what's going to happen here, is that the inferred geolocation is going to be the RIR that's responsible for whichever node is in an RIR region returning the minimum roundtrip time.
Of course there's some limitations. So let's run through these quickly. First of all, we are selecting 10,000 prefixes at random from each RIR. The reason we did that in this initial work is that of course there's drastically different numbers of prefixes in each RIR and we wanted to get an equal number of prefixes for these measurements in each RIR so we could do something, comparative between each of the RIRs.
Of course there may be no ICMP responsive target in the prefix in which case we can't do anything, that's why we are selecting ‑‑ randomly selecting 10,000 that have a responsive target in the prefix. It may be possible that there's no Atlas probes within the prefix's registered country, but that doesn't happen at least for the prefixes that we consider because RIPE Atlas 43,000 nodes, 87% of all countries are actually covered. In terms of the geolocation, of course the Atlas node itself may indicate the wrong physical location, to try to deal with this we of course have multiRIPE Atlas nodes and then we also do a refinement round and so what this means is once we find an inconsistency we do a second round using different Atlas nodes to try to verify that inconsistency.
Again, the registration country may of course be a corporate headquarters, it's located somewhere else but remember, we are using this registered country as a second chance in essence to be consistent. All the work that we are doing would stand if we were only looking at the RIRs and the geolocation. And then of course it could be the case that the prefixes themselves are inconsistent, I have a quick slide on that that's something that worth looking at currently.
Now, why are we doing latency based geolocation? There's many reasons, the primary reason is the latency based geolocation of course relies on the physical signal propagation constraints and it's hard to beat physics, especially when we have small RTTs. If we look at things like BGP and AS origin that can obscure the true location. The problem with those is not only are they known to contain inaccuracies but they often use the Whois themselves so this wouldn't make any sense to try to measure something that's already using Whois when we are trying to validate who is itself.
So to minimise error again, we are using this latency based geolocation which is generally known to be accurate especially at the continent and the country granularity, it's sound in proving geo consistency and again if any geo inconsistency is found, we select new nodes and repeat in this refinement round.
So let's talk about some of the fun stuff, the results.
I like to start with a picture and actual walk through and example. So here is a fun case study, here is a /16 that's registered within RIPE to a big CDN and it has a corporate headquarters in the UK what is our system going to do? The first thing it's going to do is select five Atlas nodes in the UK, when it did that it turns out it got a minimum RTT of 129 milliseconds, which, as you might know, is rather surprising to the in country with an RTT of 129 milliseconds. The system also selects nodes that are in the RIPE service region, so in this particular example it selected some nodes that were in Germany and France and elsewhere and, of those, it actually had a minimum RTT that was even higher, of 149 milliseconds. The Atlas nodes in Africa yielded a minimum RTT about a quarter of a second so even higher, but then when it got to north mark we started getting lower RTTs, indeed the nodes in Canada starting giving us 71 milliseconds and as this refinement proceeded, it turns out that we were able to find, able to constrain it to a node in Phoenix Arizona that had 7 millisecond RTT so. In this particular example, this is something that has a RIPE registered prefix, it has a RIPE organisation because the CDN's headquarter address is registered in the UK but it's actually physically located in ARIN's region, service region. So this would be an our taxonomy, this would be something that's registered geoinconsistent.
Here is our findings, there is some good news here, here, 96% of all the prefixes that we examined are fully consistent so the good news about that is that implies the things that are inconsistent may be possible to look at in more detail or to do some more manual investigation of what's going on, especially for these things that are fully geo inconsistent.
If we look across the across the different RIRs it turns out RIPE has the highest consistency with 98.1%, whereas AFRINIC has the lowest consistency, around 81.3%.
AFRINIC has the largest fraction of inconsistencies and these are dominated by prefixes physically located in Europe and in China, and if we look at what's going on in RIPE the primary contributor are North America, so Canada, Mexico and the United States.
I mentioned some of our current work, we are looking at intra‑prefix inference consistency, we are extend to go this IPv6 prefix registration
AUDIENCE SPEAKER: Ing as we speak. We are doing a bunch of analysis, looking at correlations with the registration age, the prefix length, different status attributes, some of this is actually quite difficult using the information that's available in the bulk Whois and then looking at the ASs that are responsible for the most inconsistencies and we are working with some of the RIRs for validation and we would love to work with RIPE to do some validation as well.
Here is an example of intra‑prefix inference consistency, this was an example where we looked at two targets on the hit‑list who are within a single registered prefix, this prefix is a /20 and it's actually a large gaming provider and here is an example where our two different measurements to two different prefixes inferred different regions, it inferred that the one the one target was actually in the RIPE service region whereas the second target was in the ARIN service region, and when we actually look at this in some more detail it turns out that this actually is explained by the fact that they are subneting within the BGP, this is an example of a /20 Whois record that's a /20 that's registered but the BGP is actually subdividing those with different subnets so our current measurements are actually taking this into account as well, in order to separate these from the other prefixes that can't be explained by BGP.
All right. So with that, the takeaways: First of all the things we want to you takeaway, different RIRs have very different out of region address use policies. But today, there's really limited visibility on where the resources are used, especially post allocation. So some RIRs are able to do and do do quite a bit of vetting, once the resource is requested, but then don't have a tonne of visibility after the resource has been allocated and that's where some of our work is actually coming in.
The RIR allocations are largely geo‑consistent with some notable exceptions. Again we see this as being a good thing, because this implies that there's actually a small, most of the system is working really well, there's a small number of prefixes that are indeed very inconsistent that probably are worth looking into with some more detail.
Third, geo‑inconsistencies raise particular operational and security concerns that suggest the registration information should be updated or at least investigated.
And then finally, the RIR Whois records are quite difficult to parse because of their inconsistent scheme mass, we are hoping it will fix.
Thanks, if you are interested in actually more detail about this, we have a paper on archive, here is the link, we are hoping this is the first quantitative geo‑audit of RIR registration information, all of our data is open and public for transparency, in RIPE Atlas and again we are doing future work and we are very interested in your feedback and/or flames.
CLARA WADE: Thank you, Robert, we have some time for questions, if you come up to the mic, please remember to state your name and affiliation.
SANDER STEFFANN: Six connect and old address poll Working Group chair. Thanks for doing this research, it's really interesting. I am actually surprised that RIPE hit 98.1% so I think congrats to the staff of those RIRs.
And I think taking into account the BGP stuff is going to be really important because ‑ is a US company, we have a block from ARIN but we use the addresses for our US data centres or Dutch, Slovenian, we have a couple of prefixes for Anycast so yeah, we are one of those /20s that spread all over the world so I am really interested to see what is coming out of your next step of research. Thank you.
ROBERT BEVERLY: Awesome, thank you.
MARCO SCHMIDT: Thank you for this interesting presentation and from the we receive quite a lot of requests from companies that are based outside our region and we are wondering what happened to those resources and on that part I want to extend of course our ‑‑ our offer to help you with the validation because you mention that you are looking forward to that. And I have also one question: Did you already research what what are the reasons for these inconsistencies especially if AFRINIC and do you have any idea to comment on these reasons?
ROBERT BEVERLY: So, we have, of course it's ‑‑ in some regions it's quite politically sensitive. We have looked at some of those reasons but are trying to stay away from the political aspects of it. We have looked at some of the ASs that contribute to most of the inconsistencies, things like that, but at this point before we say anything more, we were really hoping to get some more validation from the RIRs so that's wonderful and we would be delighted to follow up with you.
CLARA WADE: Thank you, Marco. We have on online question from.
ELVIS VELEA: For the inconsistencies found due further look whether assignments are registered, that may explain for example an assignment of the /23?
ROBERT BEVERLY: Okay, so I'm not positive I understand the question. So maybe take that ‑‑ maybe best to take that off‑line. Of course everything that we are looking at in our view is registered, in the sense that it's in the Whois. And so it has a bona fide record, of course there's different kinds of records in there so there's provider‑independent or provider‑allocated but we are looking at those different features of the records, but for what I wanted today ‑‑ presented today, we are not differentiating between those.
CLARA WADE: All right. Thank you, Robert Elvis is trying to speak. Give us one moment so we can try and put him online. I think we are going too take that one off‑line. Thank you so much Robert.
Next up we have Massimo Candela and Sasha Romijn and they will be presenting on the present and feature of IRRd.
MASSIMO CANDELA: Good morning, everybody. Thank you for the introduction, also thank you for being here in my city, finally.
So, I am a principal engineer at the global IP network of NTT and today we will see the present and feature of IRRd, IRRd stands for Internet routing registry daemon and daemon because we are going to talk mostly about the software that is, that basically is managing, it's running many if not most of the routing registries that you are familiar with.
And together with me, there is Sasha Romijn, shell join me on stage in a bit, she is the developer and maintainer of these amazing software and shell do the second part of this presentation.
So, briefly, what are we talking about. Routing registries, I am pretty sure all of, you know, essentially we can just say something like extension of Whois that allows you ‑‑ allows operators to exchange routing information, like routing objects, routing policies and this dataset needs to be managed by ‑‑ offered by a software and IRRd is the software that we are talking about, in particular IRRd version 4 and you can also see this logo with the, this blue elephant, this educated blue elephant, and this software has the role of validating, cleaning, storing, the data, and also provides several ways to query this data, and also several ways to import and do mirroring, including by using a protocol called NRTM and about the ways to query this data and about NRTM, /SHARB /SHA will tell you more.
Lets start with a bit of history. It was commissioned and funded by NTT in 2018, Commissioned to /SHARB is a. We used to have a previous version of IRRd but it was mostly maintained by NTT but at some point it reached the end of maintenance life, let's say like this because it was really difficult to add new features, it was an old peace of software so we decided to start from scratch, wipe it and start with a new more modern language and go with more flexible architecture and future proof and also have tests in the, unit tests to cover the code as much as possible and also we decided that it has to be Open Source, in general we love Open Source, we use it, we support it, we also produce like Open Source, another Open Source software, mostly talking often that is for example BGP alerter for RPKI monitoring, we are not new to the Open Source. One of the reasons why also we release is because it was going to be more audited, I think there's nothing quite like this and possibly we were going to essentially improve the routing registry ecosystem. So later on other organisations start to recasting new features directly to Sasha and providing support and I put names, I hope I included all of them in this list.
So, now, who is using today this software? First of all, we are using that, to run NTT and this was essentially the main.reason why we did this, but managed by merit so two of the larging routing registries, I would say private hosted routing registries but there is also ARIN, LACNIC and there is conversation going on with APNIC as well plus there are others so you can already see that in addition to the two large NTT and there is two and potentially three out of the five regional registries, so it's an important piece of software, that's why we care.
A few milestones. 2019, first release, feature parity and test coverage and then 2020, more features, among which first round of performance improvement, we will see why it's important and then the RPKI‑aware mode where essentially data in the routing registry that conflicts with RPKI is going to get suppressed. Then in 2021, graph QL to query the data and other APIs to interact and create the data and more features again. In 2023, and again in 2023, 4.4. This is the latest version that came out, like really recently, and the reason why we did that is, I mean the main things that we achieved with this release is to have a more secure ‑‑ to improve the security of this important software and also, again, performances.
Now, there is a total refactoring of authorisation and authentication and but to understand what's going on let me start ‑‑ start from the basics.
So in general, there are the various objects in the routing registry are managed by the maintainer and so if there are various users in an organisation that has access to added this data, in general they share a password or the maintainer so this is a problem because it means that several users share the same password so it's not really auditable, it's not a good approach.
The second thing is that if one of these user goes crazy can change the password and screw up with everybody else and the third is if, having authentication information directly in the maintainer object is complicated for the object that has to manage because has to be careful and not output any of these data in any of the query. In this refactoring the entire data is split on in another dataset and each user has his own account, even if at the end let's do the same maintainer, each user authenticates with his own account and, in addition, there is the possibility for each user to have what we call here scope, so essentially user can be a full maintainer that can add whatever objects, including the maintainer or a bit below, which means you can audit everything except for maintainer objects, you cannot change any authentication data.
So in addition to this, we introduce a new "superuser" role that is mostly related to the organisation that hosts the routing registry, we will see later some of the things that this superuser can do. API keys for programatic access so all the modern thing that we needed and it's important to mention migration from 4.3 to 4.4 you don't have to really worry about any of this, nothing is going to break, it's absolutely compatible and this enhanced security is opt‑in so you will have to essentially start a migration process which is a few e‑mails and links, before you can, otherwise you can just keep ‑‑ nothing to is going to break.
So, the second still in terms of security is safer person, role and maintainer data handling. Again, historically, so if if you have an object that refers to person object and this person object at some point is removed and reference is essentially broken points do nothing, somebody can go there and create again a person object with the same nic‑handles and claim he basically is the admin, and even worse if somebody does with a maintainer.
So already since version 4.0, there is a mitigation feature that says that you cannot delete any object like a person if there is still a reference coming in so you have to first remove the references. This already since 4.0.
However, this can happen anyway and still happens, we verified this in the data and this is can happen because you are mirroring other data from important data sources or also because this is something that needs occasionally to be done, for example it a person doesn't want to be in the registry and conducts a maintainer, the maintainer doesn't act on it, for privacy, at some point the organisation that hosts the routing registry gets contacted and they have to do something and they can this and this is where the user, superuser comes into effect, they can actually, they are the superuser is the only one that can actually break the reference and essentially delete the object.
Now, however, when the ‑‑ when the referring object gets the next update that somebody has to change something, their reference must be fixed, it doesn't allow to leave a broken reference for any update.
And even more, even more important in nic‑handles and maintainers, names cannot be ever reused inside the data set so we prevent this and since version 4.4 not only based on the routing data already available but also based on the RIS, even if it's currently not referenced any more, if it was generated that name in the past is not going to be allowed to use and this is to prevent essentially the kind of squatting, let's say, of the reference that we were talking before.
Now, performance. Again, another performance, so in ‑‑ when you do queries in routing registries a lot of these queries are hierarchical or recursive and this can impact performance quite a bit. So, and there is now in this version A data structure that is pre‑computed and stays in memory, that has bits and pieces of the data, and it's used to provide an answer much faster compared to before, to your queries, including complex reference to the one done by maintainers. And we measure it's 3 to 9 times faster and I have an example here, so you can see at the top, so left is 4.3 and right is 4.4. There are two different queries and you can see for the same query, the same first query goes from 8 seconds to 0.9 so quite a reduction. And the second one again is another of those recursive query and goes from 30 to 5 seconds. So, overall a great improvement in performance.
Now, before giving the mic to Sasha, just we are still, nothing is written on stones, we are brainstorming about the next possible steps for this IRRd software and we would like to have ‑‑ include support to integrate this with SSO solutions, Keycloack, another great project, this would allow organisations to essentially easily integrate this in their system routing registry like RIPE they can use their own SSO for example or user would be able to use their Peering DB as an identify provider and share basically the same account to login. And we are also thinking even more performance improvements so to shift those big queries from five seconds to possibly below, one if we manage.
So, anyway, this is just, as I said, brainstorming. Please provide feedback if you want to provide guidance, you have feedback or if you want to contribute, remember this is the link of the GitHub and I now I give the mic to Sasha, she is the real expert about this:
SASHA ROMIJN: I am going to talk about a few specific features of IRRd, because there's actually a lot more going on then just Port 43 which is our most well known feature. First, one of the things that distinguishes IRRd from say RIPE database is that we have a lot of different users, many for authoritative databases, all with their own policy, NTT has different ideas than IRRd. One of the goals is to include features that suits everyone's different needs but also remain compatible with everyone else's application which can be an interesting challenge and sometimes I have also implemented features such as AS dot query support, if you weren't around 15 years ago this was a way to write 4 byte AS numbers but still, used extensively enough that this was a migration blocker for a major registry.
IRRd is a big project, there's 17,000 lines of code and 15,000 lines of tests and they make sure people still here like me, there is a low rate of bugs released to production and also extensive documentation.
And one of the ways to look at complexity is all the data flows from IRRd so there's a lot of ways that IRRd data comes in, I will not go through all of them in detail but there's authoritative mainly and various kinds of mirroring, different validation modes to deal with data that is a little less precise but maybe we can still work it out but authoritative is very strict. There are ‑‑ Massimo mentioned RPKI, you can also filter on scopes set by the authoritative operator, you can do filtering that depends on the presence of other objects in other registries, you can do very interesting things, mirror suspension if you don't pay your bills in certain registries and there is a number of query mechanisms and what I want to highlight is there's now also a WebSocket stream where you get continuous parsed data of everything that happens in your IRRd.
I can go into any single one of these points for about an hour probably but I will spare that and I would not be allowed back, I am going to focus on some of the more widely interesting parts. First of all, querying beyond plain TCP, because our most well known interface is Port 43, you connect to it, you run some object securely formatted queries, there's no ‑‑ you have to read the docks to see how they work and there's no authentication of any kind on this. There is a number of other issues with this, about 25 different queries, limited flexibility but no version, who you are even connected to or whether this data is from where you thought it was, so one may be introduced at some point is querying the same API over HTTPS which has the benefits of having authentication, in the exact same query format, exact same input, that is the URL from my home testing instance, it might die at any point and if you are too annoying I will block you.
But the most interesting one that we have now is graph QL which is not a widely known format, it's sort of JSON, it's an interface that allows you to explore the IRR more through inter centre field relations, this is a query for all made by certain maintainer and one of the interesting things about it is that you get access to the parsed data that we have inside the database already, not just blocks of text so the output you go is into individual fields, which has some very weird features that are not widely used but sometimes, you don't have to deal with all of that, the parser will pick it up already, we have all this metadata, you can explore related objects, so for certain objects that meet certain filters, look for all maintainer objects, then for older admin‑cs and find if they have any addresses and it will dig through the graph of IRR data. This has some scalability limitations currently, those can be fixed but they haven't yet and also especially to help deal with latency and HTTPS which is of course larger, you can run a lot of unrelated queries in the same request and get the data back in bulk, there is some limit on scalability but it stretches pretty far and basically ‑‑ there is also a little playground included where you can play around with it, see what fields are, it's completely self documented API, but it's actually just not a very complex thing, it's just a little layer on top, client libraries that will help you and make it easier but you can just do it pretty easily with QLand it will work.
Basically, every time someone asks me can I query this or that, how would that work, I say graph QL because every time is the answer is you just write this query and it's done and you get structure JSON data and every time. If you happen to know IRR explorer, it's frontend on top of, with some extra data from BGP so this one is really nice, there's not a lot of public interfaces yet, you can run your own mirror, NTT is looking into opening up their https on their instance.
Second, I want to talk about NRTM version 4, Massimo mentioned there is mirroring and replication so we have one or two dozen and they mirror and replicate data from each other, one of the benefits being you can query from a single source and get a mix of data, some run local mirrors for performance, I don't know how many. This is all based on NRTM version 3 and there was another RFCfor routing policies, no one actually uses this so it's all NRTM version 3 which is not really a protocol, there's a website page, there's no authenticity check at all. This affects everyone who then later queries that source, so it's kind of a big deal, that is data is authentic because we trust it. It's tied to Whois, to Port 43, which also means it scales very poorly, you can't scale it up to too many users. It's published on a dump which is somewhere else and you have to hope that the dump you retrieve and the source from which you then stream have the same data stream because if they don't, very strange things happen and you can't detect it. There's no consistent character set at all, and you basically, there's no error detection, there are just a tonne of very interesting ways in which you can break your replication and then I get e‑mails about and the answer is you have got to reset the whole thing, you can't reset it, do it once in a while.
Version 4 is a replacement for this which I authored together with Job Snijders and Stavros and it draws inspiration from RRDP being JSON‑ish files on an HTTPS endpoint, and delivering files over HTTPS which is something which is quite a solved problem by now. There are signatures and hashes for authenticity, path of a signature from the publishing IRRd to the one parsing it. And also a single publication point.
The object format of RPSL is out of scope, there isn't really any uniform standard, there's just rough outlines so this is left out of scope, we don't define what the objects look like. This is already a problem that we tend to work around with pretty well.
How does it work? There is a small update notification file which is sort of an index, it's very tiny, you can request it a lot, it refers to a snapshot which is a large file, full dump of the whole database to start with and there's one minute batches of changes in delta files. Snapshots and deltas are immutable so you can cache them very well so the scalability of this is, whatever infrastructure you can pay for really, it's very easy and cheap to scale, on commodity services or hardware and the format is JSON and sequentials so you don't have to load the whole thing into memory, as we discovered while trying to write implementations.
And so the latency is about one minute, like every minute you get a new set of changes by pulling the update notification file.
The goal here is that we improve reliability, security and scalability. So secure transport, secure end‑to‑end signature processing, scalability because downloading files over HTTPS is not a difficult problem, but especially also detection of errors, loss of synchronisation, so if the source instance loses its history and can't produce delta files any more, it can restart and actually tell all the mirrors that I don't have a coherent history for you, you need to reload which is currently not possible, you have to e‑mail people and hope you get all of them.
Maybe more open access if it becomes more scaleable, and of course if you have your own NRTM implementation, who here has ever written code for NRTM v3? Yeah, I figured.
So, status. RIPE NCC has a mirror server implementation in production on the RIPE database, IRRd has a mirror client in testing and they work together, they have been streaming data, it's actually not a very difficult protocol, there's just a few details here and there to deal with some obscure things but overall it's pretty straightforward, so it works. Key rotation we haven't tested yet in intra‑operability and also the other direction, we haven't implemented yet so RIPE NCC as a client, RIPE database as a client and IRRd as a server. The draft has been adopted by the routing operations Working Group, we published a new version and also I want to note that my work is supported by LACNIC and the RIPE NCC community projects funds for NRTM version 4.
That is all we have for you. I think we have a little time for questions but also if you have ideas or questions about relating to IRRd, then feel free to find us at any time, I think we are both here all week. I think one of my wheels is a little squeaky so I shouldn't be too hard to find, and thanks for listening.
CLARA WADE: Thank you. We have some time for questions if you want to come up to the mic or if you are a remote participant you can submit your question in the Q&A.
SPEAKER: Maria, developer of BIRD. We are often telling users that they should, if they want to check their routes against the RRI, the information from RIRs, that they should configure it as an RPKI thingy, loading it from IRT protocol from an RTR cache. My question is whether you were thinking about adding a support for producing RTR stream to make it possible to directly connect to these informations?
SASHA ROMIJN: You mean to pull the RPKI data into IRRd?
Maria: The other way around.
SASHA ROMIJN: To publish IRRd records as RTR. I have never thought about this before but I think it's interesting.
SPEAKER: If I understood it correctly, you can mirror the data as stored in all the five RIRs with this. Do we have an idea how much data this is? So if I want to do this on my own for example, how much data would I need to be able to mirror this, all five RIRs?
MASSIMO CANDELA: No ‑‑
SASHA ROMIJN: The order is gigabytes, it's all ‑‑ it's the order of gigabytes, definitely less than 15, it's not that big. Most of the data is actually to query.
SPEAKER: Randy Bush, Arrcus and IIJ. I am an old Unix user of I don't know how many years. Is there any feature you chose not to do?
SASHA ROMIJN: Yes, yes, yeah, absolutely. So, I looked of course at IRRd version 3 back then and we also actually pulled a few 100,000 queries from NTT and ran them through and compared them and one we intentionally did not implement at the time was e‑mail from authentication.
RANDY BUSH: Darren. I used to run IRRd, the current version has more dependancies than the overly attached girlfriend.
SASHA ROMIJN: I always have the philosophy, I prefer not to write things if other people have done it myself and better and more maintenance behind them.
CLARA WADE: All right. There are no questions online, so thank you so much, Massimo and Sasha.
I will introduce our next speaker, Raffaele Sommesse, he will discuss how amassing county‑code top‑level domains from public data is going.
RAFFAELE SOMMESSE: I am a post doctorate at University of Twente and today, I am going to talk to you about this work I did in collaboration with my colleague, on how to learn country code country code to be level domains from public data.
Before starting, let me say a special thanks to the RIPE Academic Cooperation Initiative to let me be here today. It's a great initiative where academics can participate to RIPE meetings so thanks for this initiative and thanks for letting me be here today.
Let me introduce this topic. Why we want to learn domain names. The web is a vast and intricate network as we study the web a lot, and we decided from a ‑‑ started from a different perspective, from a technical perspective, from an economical perspective and ‑‑
(TECHNICAL FAULT ‑ POWER OUTAGE)
Plenary, part 2:
BRIAN NISBET: Folks, we are just trying to give the wonderful team a couple of minutes and one way or the other we are going to proceed with this talk, because we are not even sure if the hotel will be ready for us to have a coffee break at this point in time, give us a couple of minutes, we will be ready to continue with the talk, thank you for your patience and for getting us back online.
RAFFAELE SOMMESE: Thanks a lot. I am sorry, I was seriously meant to send you to coffee break ahead of time, apparently this will take a bit longer. What I was saying, I was saying basically, the main names ‑‑ domain names we want to learn because they are the building block for every study that we do on DNS and on the web, on the Internet, and there are two ways that we can get domain names, the first one is from zone files and from top list of domain names, and to explain you where we want to do ‑‑ to learn do do name names let's ex explain what we do in our OpenINTEL, is a research‑orientated large scale DNS platform measures around 65% of the second level domain space every day, more than 252 million domain names and since 2021 collected 9.1 trillion data point about the DNS and out of this we get 60 papers in collaboration with other academic institutions. So our goal is to be the long‑term memory of the DNS ecosystem, it's to answer that question like how the DNS look like ten years ago? What did that domain was pointing to ten years ago? So this kind of question it's the question that we are trying to answer in our platform.
And to do this, to measure the do name name ecosystem every day we rely on this list of do name names so from where we get this list of domain names? The first source of domain names is for sure the public top list, so there is this list, the former Alexa is now retired, Cisco, Tranco, Cloudflare, these lists represent the domain that are most queried on the Internet and they have a lot of definitions on how they are the most queried one, for example Cloudflare and Cisco uses the fact that they run an open resolver so collect the data and look to the popular domain name, there is Tranco use a different approach so all these domains have a way to identify the top one million domain as they consider so. What's the problem?
One million domain represent just 0.25% of the global DNS second level domain name space so we have a really biased view, we have just a view of the nichest part of the domain name population and this basically, these are great because these are public but it's just we are telling the story of the Internet by telling the story of the richest part of the Internet, saying that we are telling the story of the entire world, we are sure bias if you use this domain name.
The second source that we use for feeding our list, our measurement, is the so‑called open TLDs, in this case open ccTLDs, there are several ccTLDs that expose their zone to zone traffic so you can effectively run an XFR against this zone and get the full zone out and these are Switzerland, Estonia, Lichtenstein, and Sweden, I meant intentional because there are some ccTLDs and some gTLDs that misconfigure their zones and you can do ISFR, they are not publically saying they are sharing their zones, there are other top level domain, they share through open data so they have a catalogue on the website, you can download the file every day, case of French and Slovakia and these are great and fully representative of the Internet landscape of a country and they are public. First, it's really important that this data is public because the things that we learn in a public way is the things that we can share back with such community and with industry and I put here this is the latent definition where the word and in the old middle age, because this is really boundaries of our data sharing, all the measurement data that we feed from public sources we can only reshare to all the community, and you can go to this website and basically download our measurement data we measure every day and get the data for these sources.
Then we start to the data for which we need to sign a source of agreement and first one, it's ICANN cz DS. When ICANN introduced the Net of new gTLDs, dot whatever, they decided to prevent abuse and to prevent malicious registration operators of these gTLDs should start to start to share their zone file so they create this platform that is the centralised where people can register and put their personal information and can request access to this zone. And this is great but this data that you can get is not reshareable so you get for yourself, you get for your organisation but you cannot reshare to the ‑‑ so again, as the example that I made before, this data we measure, but unless the person that came to us has the same agreement with, the same access to the portal of cz we cannot reshare the data with them. And the next one, the fourth one, is the NDA ccTLDs or the beer ccTLDs is how I usually call them because it's easy, you just create an account and you go there and somehow they approve your access, it's not the end of the world, you don't need to provide a lot of data. Obtaining it's a complex task and sometimes involved negotiations, sometimes involved contracts, you need to go through NDA, sometimes involves beers because you need to have a direct contact with people, you need to know people to get access to the zone and in this way, we manage to get access to 13 ccTLDs. What's the problem? Is that all this contract that we get are under NDA, hence they are not reshareable.
And then there is the, from what I wanted a bit of joke side, the no way ccTLDs where we have Germany, nein, nein, nein privacy and the most funny was, where we ask the access for the zone, in principle we can give you access for the zone, there is a just one minor requirement we should ask all the Italian to grant us approval to give you access for the zone. If, you know, this is the ‑‑ it seems like when the lawyer try to stop from filing a case so it's exactly that, it's the Italian bureaucracy at its finest, I would argue.
Then, there is the, we should share ccTLDs. When the European Union created EU regulation, so the dot EU should share information and should share the content of the zone, of the data that they collect with public and private body of the European Union that works in cybersecurity and information security. We are at public body, we are a public body of the European Union, we work in ‑‑ we don't get access to EU.
What is the problem? Barriers in data sharing hurts the community because we cannot perform our job and let others use our data that are really precious for a work, different dataset together.
And the other problem is that if we just look to gTLDs, the extends beyond gTLDs, there are a lot of interesting element that represent a significant portion of a country Internet landscape, and all these lack of transparency in our Internet measurement, in our Internet research, represents an under‑representation of ccTLDs, leads to an under‑representation of the global and regional web ecosystem.
And on the other side, I understand the point of view of ccTLDs, ccTLDs says they are not subject to ICANN policy that say yeah you should share the zone, we mandate you to share the zone, they are subject to the local policy, they are subject to their government policy, so they have privacy concerns, they have.liability concerns, and on top of that, GDPR, really messed up the stuff because some people thinks that domain names are personally identifiable information and they cannot share this information because of GDPR.
So, that was our last hope ‑‑ that boy was our last hope, I think that there is another hope.
And the other hope, it's let's look on the web, let's see if there is other sources of domain name and if we look if there are other resources of domain name we can find two large sources, that are certificate transparency logs and Common Crawl data. So, the main question here is, how these sources are representative of the ccTLD landscape? How much we can tell of the ccTLD landscape just looking to these two sources.
So what we did is we start to collect this data sources and start to collect since 2017 and we started to collect Common Crawl data all of them and CT logs for one that doesn't know what it is, it's a mechanism for which every CA that is operating and want to be recognised by body earn browsers, need to report every certificate that is issued through them so every certificate is issued nowadays on the Internet is reported to certificate transparency log. So it's a large source of domain name. And Common Crawl is a crawling of large scale crawling of web data. So, we consolidate this dataset and amass this dataset and our methodology lies on the fact that we cross this dataset with the ground truth, you don't have ccTLDs, we don't have ccTLDs but we have some of them, so we have 19 over 300 that are out of there and 12 ccTLDs as I said before were obtained through agreement and 7 of them are public are in this work Cloud basically and that dataset goes from 2018 to 2023, it means we can do a nice analysis. And the nice result that we found is that half of the zones of ccTLDs that we have are already public and the coverage that we can see from public sources already span from 43 to 80% in 2023 and this coverage is increasing because more people are adopting TLS, are registering certificate so we can learn more domain name, went from 37% in 2018 on average to 59% in 2023, on average.
And if you think about already negates the privacy and the PI I concern that some ccTLD has because the information is already there and on the public Internet and to show you really the powerful of these way of collecting domain names, I want to put your attention on one specific ccTLD, that is .ru, it's one of the most challenging ccTLDs to get access to. But with public sources we can learn 70% of the domains that are are in .ru. So, if we break down the contribution from CT logs and Common Crawl, on average we have 59%. For the ccTLD that we have access to, and if we look to 24% of the domain names appears both in Common Crawl and in CT logs, 28% in CT logs and 7% in Common Crawl, that is a source that appears every two weeks, CT logs is a continuous source but still, there is value in amassing data and domain names because we add 7% to our coverage. Basically this coverage by Common Crawl is decreasing over time but not because it is scanning less over time but bus certificate adoption is increasing, so the number of domains that shows up in certificate transparency log is increasing.
Now, you may think okay you can know domain names like who knows how late after their registered, the truth is that we are pretty quick if you look to certificate transparency log. We did this analysis for one ccTLD that was .Slovakia, in the first 24 hours, and in the fifth five days 80% appears. That means that CT logs can provide us timely data about newly registered domain name so it's not just data about historical registered domain name but we can learn something it's just been registered.
Now, the sample, 19 over 300 I agree, doesn't sound like a huge sample. How are we at the ggTLDs, so let's try to generalise this result by looking at ggTLDs, let's see how much we get through ggTLDs by looking through public sources and surprising things we get similar number. From 38 to 80%, the average percentage is more or less the same. And in general, this prove our argument that looking at public sources you can already have a substantial coverage across different top level domain.
But what's the part that we don't see? What's in that 40% of domain name that we don't see? And on the 40% our initial policy these domains are not used for web purposes, they are just registered and just left lying around or used for mail or some other purposes. Not really the case. We saw that domain name showing up on CT logs and Common Crawl have a larger web adoption. Also, other part we have 70 .5%, so they are still used for web, they are not used for HTTPS because larger percent we have more for CT logs and Common Crawl, but assuming they are not used for web, it's not a great assumption.
So, our appeal to register with this data that we collected and registries of closed zone they should concede to open their zone, they should discuss what are the challenges in opening the zone, what are the legal challenges in opening that zone. There are security concerns, sure, for example if you want to prevent abusive registration and sneak a similar name to a popular one that has just been registered, this can be solved by introducing it lateby publishing the zone several days later. This still works, and if there are legal concerns, I argue that these things should be discussed at European level on how we can handle this data sharing, how we can really handle data sharing for cybersecurity and on NIS 2 is really going in this direction of we need more data sharing to do cybersecurity.
So, to conclude my presentation, with CT logs and Common Crawl you can see more than 50% of closed ccTLDs domains and in the project, in OpenINTEL project one thing we will surely do very quickly, we are working on that right now is to try to measure this domain name and make more measurement data available to the community because the community really needs them.
And we hope that this cruelty spark a discussion with ccTLDs for more transparency and data sharing efforts.
So, with this, I thank you for your attention and I am sorry for the technical inconvenience and I am open to get your question on data sharing. Thanks a lot.
CLARA WADE: A big round of applause for Raffaele. And we have a couple of minutes for questions.
SPEAKER: Marco Davids, SIDN. Thank you, Raffaele, for this presentation. Maybe a little bit off topic but I have a question nevertheless: Have you ever considered looking at web 3 domains such as Etherium and all the other ones to include them in your dataset?
RAFFAELE SOMMESE: We never actually measured them. We considered certain point ‑‑ we never investigate how we should integrate that in our measurement platform.
SPEAKER: Jim Reid, freelance consultant and random member of the RIPE community. I think your ideas about trying to get ccTLDs to become more open is a nice idea in theory but in practice it's not really going to work because lots of ccTLDs are subject to national laws and locally developed policies, and even though they are talking about things that privacy considerations there are other considerations too. I know that many of these ccTLDs assert compilation copyright over the registration database and the TLD zone file is therefore an instance of that compilation copyright and they have to protect that but if they made the zone file available to anybody they lose that intellectual property they have which in some cases is the only thing that the registry really does have is an asset so it's very difficult to get ‑‑ change that kind of mindset when that's the local policy and I think it's all very well for you to say ccTLDs should do this, the ccTLDs should do that, but what I hear from that is pretty much the same as the argument we will be having in ICANN for a long, long, long time over Whois, that people say Whois data must be published and other people say no, it can't because we have got these privacy considerations and GDPR considerations and all the rest of it. So I think if you ‑‑ you will have a screaming match between two groups of people with whole fundally mutual exclusive positions and there will be no hope of a compromise. What you should try to do value is figure out other ways to have the data you are trying to get, through the certificates, and another suggestion might be to try to get data from some of the open resolvers services but that might not be considered public from your point of view and therefore invalid but those are the approaches you should take rather than tell ccTLDs here is how you should run your business because frankly that's never going to get anywhere.
RAFFAELE SOMMESE: Thanks for the comment. Regarding public resolver, there are some providers that claim to have 97% coverage over the entire global DNS base. The problem there again is these data are not public so we cannot reuse T I agree with your argument some ccTLDs may have some economic incentives in not showing their zone. There are ccTLDs that provide service for example for trademark protection out of their service. However, I think there is still a way in which they can share that zone. If they share a few days later, if the service for trademark protection is in immediate side it may still work.
SPEAKER: Sebastian Castro. .IE registry. A couple of observations. One is, you are missing one category on your slides, which is lawyers got in the way.
RAFFAELE SOMMESE: Yes, definitely.
SPEAKER: When you have the handshake agreements and it gets messy because lawyers. The other thing is, so I predict that from public sources you only get to 85% coverage because there is a sort of underlying 15% of domains are raised but they are never activated so their DNS doesn't work and their web doesn't work, e‑mail doesn't work so they won't show up in any public source and it's for portfolio domain, portfolio building and intellectual property protection.
RAFFAELE SOMMESE: I agree with you, there are actually some domain names that never shows up also in zones because they don't publish DNS records. To which extent they are relevant to the academic community I will argue very little so we are not really interested to that but if we can get a good coverage that's an interesting perspective still.
CLARA WADE: All right. That concludes the questions. We have a few online.
"Dmitry Serbulov: I didn't understand why we need an open domain registry to help spamers."
RAFFAELE SOMMESE: I didn't get the question ‑‑ you mean if someone starts to crawl the zone and see, for example, there are new domains, start to send spam of the new domains. Again, they can learn in other ways, they can learn using other data sources, for example this one, and my argument is also that the benefit towards the academic community, the benefit that we can provide to the DNS ecosystem is stability, adoption of protocol like DNSSEC, how this is implemented and are stable, goes beyond these challenges of helping, basically, spammers.
CLARA WADE: Thank you. Next is from Elvis Velea: "Have you thought about opening a public DNS server and providing it as free service with the conditions that you will use the data?"
RAFFAELE SOMMESE: Not really.
CLARA WADE: Okay. And last one from the Vadim Mikhaylov from ccTLD .ru: "By what principle were 19 ccTLDs selected? I see 3 ccTLDs from Russian region but I'm not seeing many of EU in the slide with graphical result?"
RAFFAELE SOMMESE: Oh, they came out from agreement that we had in the past so that's, again once agreement that we had in the past with ccTLDs and we get access to the zone in the past, the ccTLDs, so we don't really have like, rechoose which ccTLD we can now access to, it's mostly someone gains access to the ccTLDs and we manage to do this analysis.
CLARA WADE: Thank you, and that concludes the questions. Thank you so much, Raffaele, for your patience and thanks to the Ops team for the quick recovery, they only took a few minutes so thank you for that. We are going to head over to the coffee break and be back for more content at 11 a.m.
LIVE CAPTIONING BY AOIFE DOWNES, RPR