<feed xmlns="http://www.w3.org/2005/Atom"><title>Routing knowledge</title><id>https://phabricator.wikimedia.org/phame/blog/feed/17/</id><link rel="self" type="application/atom+xml" href="https://phabricator.wikimedia.org/phame/blog/feed/17/" /><updated>2025-12-10T11:14:00+00:00</updated><subtitle>Header picture: https://commons.wikimedia.org/wiki/File:SunsetTracksCrop.JPG</subtitle><entry><title>Ganeti on modern network design</title><link href="/phame/live/17/post/312/ganeti_on_modern_network_design/" /><id>https://phabricator.wikimedia.org/phame/post/view/312/</id><author><name>ayounsi (Arzhel Younsi)</name></author><published>2023-12-15T09:57:40+00:00</published><updated>2024-06-19T12:47:42+00:00</updated><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><h2 class="remarkup-header">Context</h2>

<p>For reasons already mentioned in other docs (eg. Eqiad Expansion Network Design) we’re moving towards a network architecture where the servers’ layer 3 domains (subnets) are constrained in each rack. Currently (and in most of our core DCs) those layer 3 domains are stretched across all the racks of a given row. In that setting, a Ganeti cluster of a given row (where its hypervisors are spread across the row) leverages this <tt class="remarkup-monospaced">L2</tt> adjacency to be able to live migrate VMs between hypervisors.<br />
In other words, if work is going to be done on hypervisor1, all the VMs it hosts can be temporarily and transparently distributed across the other hypervisorX to prevent any disruptions. Having the same vlan trunked to all the hypervisors of the same row allows the VMs to move to a different hypervisor without requiring any IP renumbering and thus downtime.</p>

<p>There are multiple ways to have Ganeti fully operational on the new network design, all of course have their set of tradeoffs (cost, implementation or migration complexity, uptime).</p>

<h2 class="remarkup-header">Per rack clusters</h2>

<p>This is the easiest to implement, as we already have all the tooling and automation. This is also what we’re doing in the new POPs design. As each rack is its own domain, we can have one or more bridged hypervisors per rack. We currently have between 6 and 7 hypervisors per row, with 24 to 38 VMs per row.<br />
On one side of the spectrum there is 1 hypervisor per rack: fully prevents any kind of live migration (automatic or manual), all the VMs have the same constraints as physical servers. For example if the ToR or hypervisor needs any kind of maintenance, the VMs will go down during the maintenance window. 1 per rack also means a large number of “micro-cluster” making VM allocation more difficult.<br />
On the other far end, we could have all 6 or 7 hypervisors in the same rack. This option makes any live migration easy as well as hypervisor maintenance, but a ToR maintenance or failure means losing all 24 to 38 VMs. This could also be problematic in terms of overall server placement between racks (for both rack space and network usage).<br />
In between: 2 to 3 hypervisors clusters per rack mitigates the downsides of both extremes, but only mitigates them. Unless running hypervisors at 50% capacity, which is not economically viable, not all VMs could be drained. Similarly, when maintenance needs to happen on a ToR, many VMs will go down.</p>

<p>Even though we’re designing systems to be redundant between racks, rows and sites, some services don’t or can’t follow those principles. For example active/passive with no automatic failovers. Not being able to migrate VMs would increase the workload, especially during planned maintenance.</p>

<h2 class="remarkup-header"><tt class="remarkup-monospaced">L2</tt> abstraction at the ToR</h2>

<p>This option is in some way mimicking the current situation, but instead of using a proprietary Juniper technology to bridge the same vlans across rows, we use a more standardized technology: VXLAN.<br />
The main downsides to this solution are the increased license cost (Juniper/Dell SONiC require a special license to handle VXLAN), the lower interoperability between network vendors, as well as configuration and operational <a href="https://blog.ipspace.net/2023/09/l2-bad.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">complexity</a>, where it’s usually preferred to keep the network layer as lean as possible. This can be a temporary solution, for example during a migration phase but not a long term one.</p>

<h2 class="remarkup-header">Routed Ganeti</h2>

<p>This consists in having each Ganeti host behave as a basic router. This allows each VM to be independent at networking point of view, as the hypervisors will take care of propagating reachability information (IP routes) to the rest of the infrastructure.</p>

<p>Going that way, the requirements are:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Functional live migration</li>
<li class="remarkup-list-item">Minimal modification to our automation (eg. makevm cookbook)</li>
<li class="remarkup-list-item">Minimal modification to our Debian installer and guest OS</li>
<li class="remarkup-list-item">Existing VMs can be re-imaged into this new mode</li>
</ul>

<p>Moving away from <tt class="remarkup-monospaced">L2</tt> adjacency also means that LVS in their current form won’t be able to forward traffic to those VMs, the solution is IPIP support in LVS (<a href="https://phabricator.wikimedia.org/T348837" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_0"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T348837</span></span></a>).</p>

<h3 class="remarkup-header">Setup and investigation</h3>

<p>To get this working, I followed the current <a href="https://wikitech.wikimedia.org/wiki/Ganeti#Building_a_new_cluster" class="remarkup-link remarkup-link-ext" rel="noreferrer">Building a new cluster</a> steps by steps instructions with a couple adjustments.</p>

<p>First adjustment is that I used the following cluster init command:<br />
<tt class="remarkup-monospaced">sudo gnt-cluster init --no-ssh-init --enabled-hypervisors=kvm --vg-name=ganeti --master-netdev=eno1 --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,migration_bandwidth=64,migration_downtime=500,kernel_path= --nic-parameters=mode=routed,link=main ganeti-test01.svc.eqiad.wmnet</tt></p>

<p>In the <tt class="remarkup-monospaced">--master-netdev</tt> which I bind on the hypervisor’s primary (and only) NIC, in the <tt class="remarkup-monospaced">--nic-parameters link=main</tt> means use the default routing table.</p>

<p>As well as manually applied the few commands from the <a href="https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/ganeti/addnode.py#L150" class="remarkup-link remarkup-link-ext" rel="noreferrer">sre.ganeti.add_node</a> cookbook to bypass the checks specific to <tt class="remarkup-monospaced">L2</tt> Ganeti (<tt class="remarkup-monospaced">def is_valid_bridge()</tt>).</p>

<p>Then creating a VM (for example in the private range) requires just:<br />
<tt class="remarkup-monospaced">sudo gnt-instance add -t drbd -I hail --net 0:ip=10.66.2.10 --hypervisor-parameters=kvm:boot_order=network,spice_bind=127.0.0.1 -o debootstrap+default --no-install --no-wait-for-sync -g eqiad-test -B vcpus=1,memory=1024m --disk 0:size=10g testvm1001.eqiad.wmnet</tt></p>

<p>The VM is not yet ready to be started but we see here that the VM’s IP needs to be present for the init script to set up the static route. <tt class="remarkup-monospaced">spice_bind=127.0.0.1</tt> is only necessary to access the UI of my test VMs using <a href="https://wikitech.wikimedia.org/wiki/Ganeti#SPICE" class="remarkup-link remarkup-link-ext" rel="noreferrer">SPICE</a>.</p>

<h4 class="remarkup-header">Guest VM IPv4 connectivity</h4>

<p>When a VM is started, Ganeti calls the <a href="https://github.com/ganeti/ganeti/blob/master/tools/kvm-ifup.in" class="remarkup-link remarkup-link-ext" rel="noreferrer">kvm-ifup</a> bash script, and the <a href="https://github.com/ganeti/ganeti/blob/master/tools/net-common.in#L179" class="remarkup-link remarkup-link-ext" rel="noreferrer">setup_route</a> function. This takes care of attaching the VM interface to the proper routing table, as well as adding a static route to this routing table (“if you need to reach IP X, ask interface Y”). As it doesn’t seem possible to pass a custom script to Ganeti, modifying <tt class="remarkup-monospaced">/usr/lib/ganeti/3.0/usr/lib/ganeti/net-common</tt> using for example Puppet seems like the best approach to perform additional post VM startup actions.</p>

<p>So far that script needs the following modifications:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Disabling proxy_arp (enabled by default) by commenting out that command</li>
<li class="remarkup-list-item">Add an IP on the VM facing interface (with scope link)</li>
<li class="remarkup-list-item">Send a gratuitous ARP, for faster recovery after live migration</li>
</ul>

<div class="remarkup-code-block" data-code-lang="bash" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span>ip addr add <span class="m">10</span>.66.1.1/32 dev <span class="nv">$INTERFACE</span> scope link
arping -c1 -A -I <span class="nv">$INTERFACE</span> <span class="m">10</span>.66.1.1
<span class="c1">#echo 1 &gt; /proc/sys/net/ipv4/conf/$INTERFACE/proxy_arp (commented out)</span></pre></div>

<p><span class="remarkup-highlight">TODO</span> investigate improvements in addition to those commands such as:</p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">net.ipv4.conf.&lt;int&gt;.arp_ignore=3 
net.ipv4.conf.&lt;int&gt;.arp_notify=1</pre></div>

<p>Starting that test VM with a Basic Debian installer in “rescue mode” to have a prompt:<br />
<tt class="remarkup-monospaced">sudo gnt-instance start -H boot_order=cdrom,cdrom_image_path=/tmp/debian.iso testvm1001.eqiad.wmnet</tt></p>

<p>In the VM, setup its IP and routing configuration:</p>

<div class="remarkup-code-block" data-code-lang="c" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span><span class="n">ip</span> <span class="n">addr</span> <span class="n">add</span> <span class="mf">10.66.2.10</span><span class="o">/</span><span class="mi">32</span> <span class="n">dev</span> <span class="n">ens13</span>
<span class="n">ip</span> <span class="n">route</span> <span class="n">add</span> <span class="mf">10.66.1.1</span> <span class="n">dev</span> <span class="n">ens13</span> <span class="n">scope</span> <span class="n">link</span>
<span class="n">ip</span> <span class="n">route</span> <span class="n">add</span> <span class="k">default</span> <span class="n">via</span> <span class="mf">10.66.1.1</span></pre></div>

<p>This can of course look weird to anyone as we’re setting /32 NIC IP as well as a static route pointing to an interface. But that’s how the Linux kernel expects it to be configured.</p>

<p>We can then ping the VM on 10.66.2.10 from the hypervisor, as well as ping the VM from a different host than the hypervisor (as long as the 3rd party host has a route to the VM pointing as the hypervisor). Pings from the VM to that 3rd party host works as well after enabling forwarding on the host’s main NIC.</p>

<p><tt class="remarkup-monospaced">sysctl -w net.ipv4.conf.eno1.forwarding=1</tt><br />
This gives hope as we don’t need to rely on proxy_arp and we don’t need to change the guest OS much for them to work with IPv4. As long as DHCP behaves.</p>

<p>As Ganeti configures the static route pointing to the VM, and Ganeti supports only 1 v4 IP (with the parameter ip=10.66.2.10) it’s not possible to manually configure multiple IPs on the guest VMs without relying on a dynamic routing protocol (Eg. BGP) between the VM and the hypervisor. In our infra only 3 existing VMs are setup in that way: <tt class="remarkup-monospaced">lists1001.wikimedia.org,mx[1001,2001].wikimedia.org</tt>.</p>

<p>Live migration, although with hypervisors in the same VLAN, shows continuity in the reachability between VM and gateway.<br />
<tt class="remarkup-monospaced">sudo gnt-instance migrate testvm1001.eqiad.wmnet</tt></p>

<p>Similarly, two VMs on the same hypervisor can reach each other. Their interfaces always have the same “router” IP:</p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">10: tap0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
[...]
	inet 10.66.1.1/32 scope link tap0
11: tap1: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
[...]
	inet 10.66.1.1/32 scope link tap1</pre></div>

<p>Going one step further I set up a 3rd Ganeti node in Dallas (the first two are adjacent in Ashburn) as a proof of concept . Live migrating a VM from Dallas to Ashburn worked perfectly. Only 2 pings of the constant ping running from the VM to its gateway (10.66.1.1) were lost, which is beyond acceptable for a 31ms live migration. Not like we will want to do this with production VMs but if it works stretched that far, it will work within the same datacenter.</p>

<h4 class="remarkup-header">Guest VM DHCP</h4>

<p>When starting a VM, Debootstrap will initialize and ask for an IP using DHCP. The previously configured firewall rule permits the DHCP request to reach the hypervisor. We now have 2 options:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Run a DHCP server on the hypervisor and directly reply to the VM</li>
<li class="remarkup-list-item">Run a DHCP relay on the hypervisor to… relay the request to our DHCP server</li>
</ul>

<p>We could imagine for example packaging and automating <a href="https://github.com/grnet/nfdhcpd" class="remarkup-link remarkup-link-ext" rel="noreferrer">nfdhcpd</a> using spicerack or the makevm cookbooks. However I preferred the 2nd option as it leverages our current DHCP server and makes the hypervisor a “dumb” relay.<br />
For that I installed <a href="https://packages.debian.org/search?keywords=isc-dhcp-relay" class="remarkup-link remarkup-link-ext" rel="noreferrer">isc-dhcp-relay</a> and modified <tt class="remarkup-monospaced">/etc/default/isc-dhcp-relay</tt> so it points to our local DHCP server. The first limitation is that it binds to the existing interfaces at the daemon startup time. To workaround this limitation I added <tt class="remarkup-monospaced">service isc-dhcp-relay restart</tt> to the <tt class="remarkup-monospaced">net-common</tt> script so the demon gets restarted after the VM interface gets created.<br />
The 2nd limitation is due to the way we do DHCP relaying on our core routers. Those routers intercept the relayed packets and drop them.<br />
Using the Juniper configuration <tt class="remarkup-monospaced">forwarding-options dhcp-relay forward-only</tt> fixes the issue, but breaks DHCP relaying for regular “non routed” hosts. The path forward here is most likely to move away from DHCP option 82 and instead use DHCP option 97 (see T304677).</p>

<p><span class="remarkup-highlight">TODO</span> deploy a specific DHCP config snippet on the DHCP server to deliver the proper IP and route info to the VM (see <a href="https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/</a> ) but blocked by the issue above.</p>

<h4 class="remarkup-header">Guest VM IPv6 connectivity</h4>

<p>First we can start with a little security housekeeping by modifying the net-common script:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Disable learning any router-advertisements from the guest VMs</li>
<li class="remarkup-list-item">Disable NDP proxying (like ARP proxying but for IPv6)</li>
</ul>

<div class="remarkup-code-block" data-code-lang="bash" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span><span class="nb">echo</span> <span class="m">0</span>  &gt; /proc/sys/net/ipv6/conf/<span class="nv">$INTERFACE</span>/accept_ra
<span class="nb">echo</span> <span class="m">0</span>  &gt; /proc/sys/net/ipv6/conf/<span class="nv">$INTERFACE</span>/proxy_ndp</pre></div>

<p>While Ganeti is v4 aware (remember the <tt class="remarkup-monospaced">ip=10.66.2.10</tt> parameter for <tt class="remarkup-monospaced">gnt-instance add</tt>) this is not the case for IPv6, which is, so far, a blocker. Note that “attaching” an IP to a Ganeti instance object is especially needed for migration as the target hypervisor needs to know which static routes to create.</p>

<p>There are multiple possibles workarounds here, none of which are great:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Implement the missing feature (not really a workaround per-se, and significant work, so this is the least preferred option)</li>
<li class="remarkup-list-item">Rely on the “<a href="https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html?highlight=tags#tags-handling" class="remarkup-link remarkup-link-ext" rel="noreferrer">tags</a>” feature, which supports arbitrary key-value pairs, but this needs to be tested, especially to know if they are passed on to the <tt class="remarkup-monospaced">net-common</tt> script</li>
<li class="remarkup-list-item">Advertise the v6 prefixes over IPv4, with the major downside being that all guest VMs would need to run BGP…</li>
<li class="remarkup-list-item">Write a demon that inserts static route based on the NDB table (with safeguards)</li>
<li class="remarkup-list-item">Leverage our current mechanism of deriving the v6 IP from the v4 one</li>
</ul>

<p>Setting this blocker aside, v6 is better than v4 as it has more IPs. This leads to the question, should I assign a /128 per VM or a /64 ?</p>

<h5 class="remarkup-header">/128</h5>

<p>This can be configured dynamically using router advertisement. Testing <a href="https://github.com/radvd-project/radvd" class="remarkup-link remarkup-link-ext" rel="noreferrer">radvd</a> with this <tt class="remarkup-monospaced">/etc/radvd.conf</tt></p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7100::10/128 {
	AdvRouterAddr on;
  };
};</pre></div>

<p>On the VM side this configures the default gateway, leveraging the link-local IPs. It only requires setting the nic IP on the VM side (<tt class="remarkup-monospaced">2001:db8:cb00:7100::10/128</tt>) as well as the static route on the hypervisor side.</p>

<p>If we have to use a 3rd party stateful (config state) tool (radvd), we could as well use <a href="https://github.com/grnet/nfdhcpd" class="remarkup-link remarkup-link-ext" rel="noreferrer">nfdhcpd</a> but the latter seems abandoned.</p>

<p>To stay stateless on that side, the alternative is to modify the Debian <a href="https://github.com/wikimedia/operations-puppet/blob/7d8c67c621de889f32db8ad09fca32cd25f74e03/modules/install_server/files/autoinstall/scripts/late_command.sh#L97" class="remarkup-link remarkup-link-ext" rel="noreferrer">autoinstall</a> script so it sets the NIC v6 IP “manually”. In that case, radvd is only used to advertise the next hop IP:</p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7000::/52 {
	AdvOnLink off;
	AdvAutonomous off;
	AdvRouterAddr on;
  };
};</pre></div>

<p>At this point maybe it’s easier to treat it fully like IPv4, ditch radvd and manually configure fe80::1 on tap0 with the matching routes on the VM side. As the v6 IP is generated from the v4 IP and v6 prefix, it’s then also possible to add the Hypervisor’s side route using “net-common”.</p>

<h5 class="remarkup-header">/64</h5>

<p>More efficient in some ways, the following allows the guest OS to perform automatic IP configuration on the guest VM using SLAAC. Only the static route on the Ganeti side is needed.</p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7100::/64 {
	AdvRouterAddr on;
  };
};</pre></div>

<p>This is also compatible with our Debian <a href="https://github.com/wikimedia/operations-puppet/blob/7d8c67c621de889f32db8ad09fca32cd25f74e03/modules/install_server/files/autoinstall/scripts/late_command.sh#L97" class="remarkup-link remarkup-link-ext" rel="noreferrer">autoinstall</a> script as it would still apply the v4 to v6 mapping, in our case configure the IP 2001:db8:cb00:7100:10:66:0:10/64 on the VM’s primary interface. And allows for additional guest IPs (despite not being widely used). Assigning a prefix to a VM means changing the way we do IP allocation within our automation.</p>

<p>This however raises the question: how to update that radvd.conf file (as it needs to have entries for all VM) at each VM movement? This seems tedious for the “net-common” bash script. Upstream started looking into <a href="https://github.com/radvd-project/radvd/issues/208" class="remarkup-link remarkup-link-ext" rel="noreferrer">including config files</a> which would help. That’s why my preference is to use a static /128 IP per VM.</p>

<p>Overall there is significant work to be done regarding IPv6 which is outside of the scope of this project, tracked in <a href="/T102099" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_1"><span class="phui-tag-core phui-tag-color-object">T102099: Fix IPv6 autoconf issues once and for all, across the fleet.</span></a>, across the fleet and more globally in <a href="/T234207" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_2"><span class="phui-tag-core phui-tag-color-object">T234207: Investigate improvements to how puppet manages network interfaces</span></a>. We could for example use DHCPv6. However, this Ganeti work should, as much as possible, go in the same general direction than what’s planned for those tasks.</p>

<h4 class="remarkup-header">Hypervisor firewalling</h4>

<p>The test Ganeti hosts have been freshly migrated to nftables (see <a href="/T336497" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_3"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T336497: Add support for nftables in profile::firewall</span></span></a>). For the testing phase a single line in a newly created <tt class="remarkup-monospaced">/etc/nftables/input/10_ganeti_guestvm.nft</tt> was enough to permit traffic from the guest VMs to the hypervisor.<br />
<tt class="remarkup-monospaced">iifname &quot;tap*&quot; accept</tt></p>

<p>Run <tt class="remarkup-monospaced">sudo systemctl reload nftables.service</tt> so it’s taken into account, then <tt class="remarkup-monospaced">sudo nft list ruleset</tt> to confirm it.<br />
Before making it production ready (through Puppet) this needs to be tightened up by allowing only a few ports and protocols:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">DHCP (see Guest VM DHCP above)</li>
<li class="remarkup-list-item">BGP &amp; BFD (see Guest VM BGP below)</li>
</ul>

<p>The forwarding chain already allows all traffic by default, and a rule is already present to permit IPv6 neighbor discovery.</p>

<h4 class="remarkup-header">Guest VM routes redistribution</h4>

<p>At this point everything local to the hypervisor works. It’s time to make the infra know about the VMs.</p>

<p>As we already use Bird on multiple systems across the infra, it makes sense to use it here as well.<br />
The snippet below instructs Bird to import static v4 and v6 routes from the Linux routing table into Bird while watching every second for any change.</p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">protocol kernel kernel_v4 {
	learn;
	scan time 1;
	ipv4 {
    	import where krt_source = 4; # statics
	};
}
protocol kernel kernel_v6 {
	learn;
	scan time 1;
	ipv6 {
    	import where krt_source = 3; # statics
   };
}
[...]</pre></div>

<p>The other side of the same coin is the “regular” BGP configuration to the routers and additional safeguards filtering needed.</p>

<div class="remarkup-code-block" data-code-lang="diff" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span>[edit policy-options]
<span class="gi">+   prefix-list ganeti4 {</span>
<span class="gi">+   	10.66.2.0/24;</span>
<span class="gi">+   }</span>
[edit policy-options]
<span class="gi">+   policy-statement ganeti_import {</span>
<span class="gi">+   	term ganeti4 {</span>
<span class="gi">+       	from {</span>
<span class="gi">+           	prefix-list-filter ganeti4 longer;</span>
<span class="gi">+       	}</span>
<span class="gi">+       	then accept;</span>
<span class="gi">+   	}</span>
<span class="gi">+   	then reject;</span>
<span class="gi">+   }</span>
[edit protocols bgp]
<span class="gi">+	group Ganeti4 {</span>
<span class="gi">+    	type external;</span>
<span class="gi">+    	multihop {</span>
<span class="gi">+        	ttl 193;</span>
<span class="gi">+    	}</span>
<span class="gi">+    	local-address 208.80.153.192;</span>
<span class="gi">+    	import ganeti_import;</span>
<span class="gi">+    	family inet {</span>
<span class="gi">+        	unicast {</span>
<span class="gi">+            	prefix-limit {</span>
<span class="gi">+                	maximum 5;</span>
<span class="gi">+                	teardown 80;</span>
<span class="gi">+            	}</span>
<span class="gi">+        	}</span>
<span class="gi">+    	}</span>
<span class="gi">+    	export NONE;</span>
<span class="gi">+    	peer-as 64650;</span>
<span class="gi">+    	neighbor 10.192.48.73 {</span>
<span class="gi">+        	description ganeti-test2004;</span>
<span class="gi">+    	}</span>
<span class="gi">+	}</span></pre></div>

<p>Here BFD isn’t strictly needed as they’re unicast prefixes (at least for now). If the hypervisor goes down there is no need for faster failover as there is no alternative host anyway.</p>

<p>Testing it with IPv4 only, but the IPv6 behavior is expected to be similar. Running pings to the VM IP from bast4005 in ulsfo, with an interval of 0.5s. This means VM migration downtime and full convergence was achieved in less than 2 seconds.</p>

<div class="remarkup-code-block" data-code-lang="c" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span><span class="mi">64</span> <span class="n">bytes</span> <span class="n">from</span> <span class="mf">10.66.2.15</span><span class="o">:</span> <span class="n">icmp_seq</span><span class="o">=</span><span class="mi">30</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">60</span> <span class="n">time</span><span class="o">=</span><span class="mf">74.3</span> <span class="n">ms</span>   <span class="o">&lt;-</span> <span class="n">VM</span> <span class="n">in</span> <span class="n">Ashburn</span>
<span class="mi">64</span> <span class="n">bytes</span> <span class="n">from</span> <span class="mf">10.66.2.15</span><span class="o">:</span> <span class="n">icmp_seq</span><span class="o">=</span><span class="mi">31</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">60</span> <span class="n">time</span><span class="o">=</span><span class="mf">73.5</span> <span class="n">ms</span>
<span class="mi">64</span> <span class="n">bytes</span> <span class="n">from</span> <span class="mf">10.66.2.15</span><span class="o">:</span> <span class="n">icmp_seq</span><span class="o">=</span><span class="mi">32</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">60</span> <span class="n">time</span><span class="o">=</span><span class="mf">73.9</span> <span class="n">ms</span>
<span class="mi">64</span> <span class="n">bytes</span> <span class="n">from</span> <span class="mf">10.66.2.15</span><span class="o">:</span> <span class="n">icmp_seq</span><span class="o">=</span><span class="mi">36</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">60</span> <span class="n">time</span><span class="o">=</span><span class="mf">42.3</span> <span class="n">ms</span>   <span class="o">&lt;-</span> <span class="n">VM</span> <span class="n">in</span> <span class="n">Dallas</span>
<span class="mi">64</span> <span class="n">bytes</span> <span class="n">from</span> <span class="mf">10.66.2.15</span><span class="o">:</span> <span class="n">icmp_seq</span><span class="o">=</span><span class="mi">37</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">60</span> <span class="n">time</span><span class="o">=</span><span class="mf">41.9</span> <span class="n">ms</span>
<span class="mi">64</span> <span class="n">bytes</span> <span class="n">from</span> <span class="mf">10.66.2.15</span><span class="o">:</span> <span class="n">icmp_seq</span><span class="o">=</span><span class="mi">38</span> <span class="n">ttl</span><span class="o">=</span><span class="mi">60</span> <span class="n">time</span><span class="o">=</span><span class="mf">41.9</span> <span class="n">ms</span></pre></div>



<h4 class="remarkup-header">Guest VMs BGP</h4>

<p>As mentioned in the previous section, we rely on BGP on the end hosts to advertise Anycast prefixes for high availability and improved service latency. Some of those services are running in VMs, for example <a href="https://wikitech.wikimedia.org/wiki/Wikimedia_DNS" class="remarkup-link remarkup-link-ext" rel="noreferrer">Wikimedia DNS</a>.</p>

<p>For those services (that are likely to grow in numbers) the BGP sessions need to be established with the hypervisor, or in other terms with the VM&#039;s next hop gateway. This is how they&#039;re currently configured on hosts behind L3 switches.</p>

<p>Adding an extra hop (the hypervisor) in the AS-path (router &gt; switch &gt; hypervisor &gt; VM) means that an additional prepending is needed to the non Ganeti Anycast prefixes, like we did when we introduced the new switching fabric. This is in order to maintain a constant AS path length wherever the end host is located and thus offer proper balancing (otherwise traffic won’t reach longer as-path in normal operations).</p>

<p>Additional configuration needs to be added to Bird for this to work on the Ganeti side. The VM side&#039;s config can be left untouched.</p>

<p>First BFD becomes necessary for faster anycast failover between the hypervisor and the network, but not between the hypervisor and the VMs, as Bird will track the VM facing interface (tapX) and withdraw the prefixes if it goes down.<br />
To maintain a dynamic system, we should keep as few states on the Hypervisor as possible. That’s why here Bird is passive and waits for the VM to initiate the session. This could make establishing the session a bit longer after a migration, the time the VM notices it’s not speaking with the same hypervisor, shutdowns the session and re-create it. If this is deemed too long, BFD could be introduced as this layer too for faster recovery.</p>

<p>Security wise, this could permit any rogue user with root on a VM (or through misconfiguration) to pretend to be any allowed AS and advertise an IP permitted in the “VMs_import” filter. This risk is quite low but additional security mechanisms could be used like MD5.…. (at least until <a href="https://gitlab.nic.cz/labs/bird/-/blob/master/doc/roadmap.md#tcp-ao-if-it-appears-in-linux-and-bsd-upstream" class="remarkup-link remarkup-link-ext" rel="noreferrer">TCP-AO</a> is implemented), this wouldn’t prevent misconfigurations though. Another option is to pre-populate using Puppet the full list of BGP peers with their respective AS, with the significant downside of causing config/alerting fatigue and slower provisioning time.</p>

<div class="remarkup-code-block" data-code-lang="c" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span><span class="n">protocol</span> <span class="n">bgp</span> <span class="n">bgp_v4</span> <span class="p">{</span>
	<span class="n">ipv4</span> <span class="p">{</span>
    	<span class="n">import</span> <span class="n">filter</span> <span class="n">VMs_import</span><span class="p">;</span>
    	<span class="n">export</span> <span class="n">none</span><span class="p">;</span>
	<span class="p">};</span>
	<span class="n">local</span>  <span class="n">as</span> <span class="mi">64650</span><span class="p">;</span>
	<span class="n">neighbor</span> <span class="n">range</span> <span class="mf">10.66.2.0</span><span class="o">/</span><span class="mi">24</span>  <span class="n">external</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">protocol</span> <span class="n">bgp</span> <span class="n">bgp_v6</span> <span class="p">{</span>
	<span class="n">ipv6</span> <span class="p">{</span>
    	<span class="n">import</span> <span class="n">filter</span> <span class="n">VMs6_import</span><span class="p">;</span>
    	<span class="n">export</span> <span class="n">none</span><span class="p">;</span>
	<span class="p">};</span>
	<span class="n">local</span>  <span class="n">as</span> <span class="mi">64650</span><span class="p">;</span>
	<span class="n">neighbor</span> <span class="n">fe80</span><span class="o">::/</span><span class="mi">10</span> <span class="n">external</span><span class="p">;</span>
<span class="p">}</span></pre></div>

<p>This will require thorough testing before using it in production.</p>

<h4 class="remarkup-header">v4 and v6 prefixes allocations</h4>

<p>In addition to not painting ourselves in a corner with a bad addressing plan, this is important as prefixes allocation defines the scope of each Ganeti cluster. As each VM is routable it can technically live in any location of our network.</p>

<p>There are 2 options here:<br />
Either we use per DC prefixes, to mimic our current way of doing things. For example, use 10.66.2.0/23 for eqiad v4 private (with a total of 512 IPs and possibility to grow it). Unfortunately 208.80.154.0/23 is fully allocated, so any new v4 public IPs will need to come from a subnet re-sizing or a new larger prefix.<br />
V6 being much easier, the allocation only depends on if we allocate a /128 or /64 per VM. Private vs. public IPv6 could even be enforced at the hypervisor and come from the same pool of IPs. Probably not the best option as it goes against our current way of operating hosts, but it’s a possibility.<br />
A variant of this one is to group IPs (using sub-allocations) by Ganeti clusters this allows aggregating prefixes to reduce the size of routing tables, across the infra, but is not necessary due to the small amount of VMs we’re running.</p>

<p>The other option is to use a global pool of IPs. For example start naming the VMs testvm1.global.wmnet and assign the, an IP from a prefix outside of any of our POPs, like we have 10.3.0.0/24 for internal anycast. The major advantage is that the VM can be moved anywhere in our infra without having to be renumbered. The major inconvenience is that it becomes quite confusing and would require significant changes to our infrastructure while providing a false sense of security and increasing the blast radius of a single Ganeti cluster. It’s better to design a service with multiple VMs per site than rely on being able to move a VM from site to site.<br />
It’s not because we CAN do it that we SHOULD, but we could for example have a special long distance cluster for problematic applications that can’t be active/passive (if there are any).</p>

<h4 class="remarkup-header"><tt class="remarkup-monospaced">L2</tt> to <tt class="remarkup-monospaced">L3</tt> cluster migration</h4>

<p>Going that path will require a many-steps migration. First focusing on simple VMs (eg. not running BGP), in the private subnet, then extending the scope. Tooling will need to be adjusted first.<br />
A hard requirement is that VMs will need to be re-IPed. This means re-imaged, like we’re planning on doing for bare metal servers.<br />
I haven’t tested if a cluster can run both routed and bridged VMs at the same time. Even if it does, this sounds like a risky move, that’s why it is preferable to spin up a new cluster.</p>

<p>This cluster can start with two nodes, then, progressively, receive migrated VMs, freeing up space on other clusters allowing to re-purpose hypervisors, etc…</p>

<h2 class="remarkup-header"><tt class="remarkup-monospaced">L2</tt> abstraction at the hypervisor</h2>

<p>This solution differs from the previous one by using VXLAN (or any tunneling technology) to provide a <tt class="remarkup-monospaced">L2</tt> domain to the VMs. Instead of relying on Linux&#039;s ability to use a /32 prefix or /128 IPv6 on their virtual NIC, they will be assigned a regular /27-ish or /64 v6. The abstraction takes care of propagating reachability information between VMs. Other than that, it reuses most of the building blocks from the previous solution: <tt class="remarkup-monospaced">DHCP relay, BGP, hypervisor firewalling, router advertisement. It also offers shorter downtime during switchover as even if the VM is now live on a different hypervisor, traffic can still be bridged from the previous hypervisor until BGP converges (we’re talking about milliseconds to seconds here). This would have been the preferred option if simulating a </tt>L2` adjacency was required, but in the current state of things it only adds an extra layer making management and troubleshooting more complex.</p>

<h2 class="remarkup-header">Conclusion</h2>

<p>Within Wikimedia’s infrastructure (Debian based, all but 3 VMs having a single IP, BGP on the host needed), migrating the Ganeti clusters to work in a “routed” mode is a viable option to permit VM live migration between hypervisors spread over any number of L3 domains. The main downside is that this solution requires more preparation and deployment work compared to a <tt class="remarkup-monospaced">L2</tt> only solution and possibly a tunnel based one. It also only uses standards and open source components makes it a sustainable and low maintenance cost option as well.</p>

<h3 class="remarkup-header">Next steps</h3>

<p>The first next step is to get this document reviewed for any pitfall or oversight I could have made. Then shortly after get to a common agreement, including the few open questions if we stick to routed Ganeti:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">/128 or /64 for IPv6 VMs?</li>
<li class="remarkup-list-item">Prefix allocations?</li>
<li class="remarkup-list-item">How far should a Ganeti cluster and Ganeti group spread?</li>
</ul>

<p>Once this is decided we need to start allocating hardware resources as my test devices are decommissioned servers that need to be returned to DCops. List then start working on the prerequisites needed to make it happen: our automation, (host and network) BGP, Puppet, DHCP, etc (a few are already listed through the document).<br />
Timeline wise this needs to be prioritized to match the core DCs network refreshes, which means ideally fully ready in Dallas in maximum 6 months.</p>

<h2 class="remarkup-header">Other considerations</h2>

<h3 class="remarkup-header">Not going that way</h3>

<p>Creating distinct routing tables for public vs. private zones.</p>

<div class="remarkup-code-block" data-code-lang="bash" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span></span><span class="nb">echo</span> <span class="s2">&quot;100   private&quot;</span> &gt;&gt; /etc/iproute2/rt_tables
<span class="nb">echo</span> <span class="s2">&quot;200   public&quot;</span> &gt;&gt; /etc/iproute2/rt_tables</pre></div>

<p>This initially sounded like a good idea as we separate public vs. private vlans in our infrastructure, but the only reason we actually separate them is to be able to provide different IPs. All the firewalling is done on the hosts (and a bit on the routers).</p>

<h4 class="remarkup-header">Use Linux VRF to separate hypervisor from VM traffic</h4>

<p>A bit similar to the one above but with stricter separation between VM and hypervisor. This would have added extra security if we were providing VMs for untrusted customers (like a cloud provider).</p>

<h4 class="remarkup-header">Mixed clusters</h4>

<p>Test if a cluster can have both routed and bridged VMs in parallel. I didn’t spend time testing this option as even if it works, there is a risk of impacting production VMs.</p>

<h3 class="remarkup-header">Possible future work</h3>

<h4 class="remarkup-header">Dynamic Ganeti cluster VIP</h4>

<p>The master-netdev is the interface on which lives the cluster dedicated management IP. It’s currently using a row specific IP while being the management IP for a site cluster. If that IP becomes unreachable the cluster keeps functioning, but operations (create/delete/modify) can’t be performed. We might benefit from assigning the --master-netdev to the loopback interface and the cluster’s FQDN to a VIP, advertised by BGP like VMs IPs. That would allow for seamless VIP migration and thus easier hypervisor maintenance.</p>

<h4 class="remarkup-header">Apply the same mechanism to bare-metal servers</h4>

<p>This could for example help us save on public IPs instead of having a dedicated public rack or multiple racks with a public IPv4 prefix.</p>

<h4 class="remarkup-header">Add support for multiple IPs per host</h4>

<p>This would require patching Ganeti and thus might be a complex operation to support only 3 hosts (<tt class="remarkup-monospaced">lists1001.wikimedia.org,mx[1001,2001].wikimedia.org</tt>). It can however be done if no alternative exists. Leveraging BGP here again might be the easiest way to go.</p>

<h4 class="remarkup-header">Use fixed MAC address for tap* interfaces</h4>

<p>This has been suggested by Cathal “Unsure of whether it&#039;s an option but potentially all TAP interfaces could also be forced to the same MAC address?  Thus making no ARP update for this on the VM side required (similar to anycast gw idea in evpn).”.<br />
If this works as expected, it would make a live migration even faster and not require the “arping” mentioned earlier in this doc as from a VM point of view its gateway would look exactly the same.</p>

<h4 class="remarkup-header">Use iBGP between hypervisor and VMs</h4>

<p>This path would allow us to not add an extra BGP hop between the hypervisor and the end VM. Exacts tradeoffs to be investigated.</p>

<h3 class="remarkup-header">Possible limitations</h3>

<h4 class="remarkup-header">Applications or OS handling improperly /32 or /128 interface IPs</h4>

<p>Legacy softwares or specialized OS might choke on the seemingly odd NIC IP. IF it happens it will need to be tackled on a case by case basis. As the migration to routed Ganeti will take time, those specific cases could keep running on the “former” Ganeti until a solution is found.</p>

<h2 class="remarkup-header">Ressources</h2>

<p><a href="https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/</a><br />
<a href="https://vincent.bernat.ch/en/blog/2018-l3-routing-hypervisor" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://vincent.bernat.ch/en/blog/2018-l3-routing-hypervisor</a><br />
<a href="https://docs.ganeti.org/docs/ganeti/3.0/html/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://docs.ganeti.org/docs/ganeti/3.0/html/</a><br />
<a href="https://linux.die.net/man/5/radvd.conf" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://linux.die.net/man/5/radvd.conf</a><br />
<a href="https://bird.network.cz/?get_doc&amp;v=20&amp;f=bird-6.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://bird.network.cz/?get_doc&amp;v=20&amp;f=bird-6.html</a><br />
<a href="https://www.netfilter.org/projects/nftables/manpage.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://www.netfilter.org/projects/nftables/manpage.html</a><br />
<a href="https://docs.ganeti.org/docs/ganeti/2.2/html/design-2.1.html?highlight=routed#non-bridged-instances-support" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://docs.ganeti.org/docs/ganeti/2.2/html/design-2.1.html?highlight=routed#non-bridged-instances-support</a><br />
<a href="https://github.com/grnet/snf-network/blob/develop/docs/routed.rst" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://github.com/grnet/snf-network/blob/develop/docs/routed.rst</a><br />
<a href="http://blkperl.github.io/split-brain-ganeti.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">http://blkperl.github.io/split-brain-ganeti.html</a><br />
<a href="https://blog.cloudflare.com/virtual-networking-101-understanding-tap/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://blog.cloudflare.com/virtual-networking-101-understanding-tap/</a></p></div></content></entry><entry><title>Multi-platform network configuration</title><link href="/phame/live/17/post/304/multi-platform_network_configuration/" /><id>https://phabricator.wikimedia.org/phame/post/view/304/</id><author><name>ayounsi (Arzhel Younsi)</name></author><published>2023-07-13T14:31:52+00:00</published><updated>2025-12-10T11:14:00+00:00</updated><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>Network configuration is a quite rapidly evolving area which went through multiple phases. It’s also surprisingly tied to monitoring. Below is some historical context from the industry as well as what we’re doing in SRE.</p>

<h2 class="remarkup-header">Some context</h2>

<h3 class="remarkup-header">The past</h3>

<p>As there was no standardized or programmatic way to get data in or out of network devices, engineers had to be creative. The early ages of network automation consisted of scripts pretending to be a human operator.<br />
Those scripts would connect to devices (using ssh or telnet) send commands, <a href="https://github.com/willcurtis/ciscoexpectscript" class="remarkup-link remarkup-link-ext" rel="noreferrer">expecting</a> some prompts and scrapping whatever was sent back.<br />
You can imagine that this was extremely slow and error prone. Output layout would change from one version to the other, unexpected output would break scripts, CLIs would get overwhelmed by too much information entered at once, data would not get validated beforehand.</p>

<h4 class="remarkup-header">SNMP</h4>

<p><a href="https://en.wikipedia.org/wiki/Simple_Network_Management_Protocol" class="remarkup-link remarkup-link-ext" rel="noreferrer">SNMP</a> is a protocol aimed at improving the this situation. Despite the limitations it’s widely disparaged for, SNMP is still widely used to monitor devices, especially as it’s supported on virtually all devices.</p>

<p>It got popular for monitoring as none of its limitations are hard blockers to simply pull counters and states from a device. Security is not critical as it&#039;s read only, if a packet doesn’t arrive, it’s not a big deal, data will show up at the next pull.<br />
It’s another story for the configuration aspect. Security is critical (v3 got implemented quite late), being sure that a change got applied as well. These factors, as well as a difficult syntax to interact with, meant it never got wide adoption.</p>

<div class="remarkup-code-block" data-code-lang="console" data-sigil="remarkup-code-block"><div class="remarkup-code-header">Get a devices description over SNMP</div><pre class="remarkup-code"><span class="gp">$ /usr/bin/snmpget -v2c -c &lt;secret&gt; -OUQn -m SNMPv2-MIB &lt;hostname&gt; sysDescr.0</span>
<span class="go">.1.3.6.1.2.1.1.1.0 = Juniper Networks, Inc. qfx5100-48s-6q (...)</span></pre></div>

<p>SNMP, in short:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Transport: UDP (packet size limited and subject to packet loss)</li>
<li class="remarkup-list-item">Atomicity: None (each request is independent and can change 1 configuration option)</li>
<li class="remarkup-list-item">Message encoding: ASN.1 (quite complex to manually craft)</li>
<li class="remarkup-list-item">Data model: <a href="https://en.wikipedia.org/wiki/Structure_of_Management_Information" class="remarkup-link remarkup-link-ext" rel="noreferrer">SMI</a> (standard or vendor specific)</li>
<li class="remarkup-list-item">Mostly pull based (SNMP traps are less used, especially as its UDP)</li>
<li class="remarkup-list-item">Encryption:<ul class="remarkup-list">
<li class="remarkup-list-item">v2c: clear text “community” (not secure, most common for “get”)</li>
<li class="remarkup-list-item">v3: authentication and payload encryption (mostly used for “set” if used at all)</li>
</ul></li>
</ul>

<h3 class="remarkup-header">The present</h3>

<h4 class="remarkup-header">Monitoring</h4>

<p>We actively use SNMP for monitoring. Mostly through <a href="https://wikitech.wikimedia.org/wiki/LibreNMS" class="remarkup-link remarkup-link-ext" rel="noreferrer">LibreNMS</a> which is fully built around SNMP and provides a great UI which solves one difficult part of monitoring: how to display relevant information. And to a lesser extent with ad-hoc Icinga scripts pulling specific SNMP OIDs (values) for alerting only.<br />
The current system is working fine (and on all our platforms, even <a href="https://en.wikipedia.org/wiki/Power_distribution_unit" class="remarkup-link remarkup-link-ext" rel="noreferrer">PDUs</a>). <br />
Newer protocols have been implemented with features answering modern needs, like more frequent polling (some SNMP implementation discourage pulling data too often), or the ability to get virtually any metric from the devices. In addition to being more reliable.<br />
So if we have to spend engineering time (for example to monitor QoS accurately - <a href="https://phabricator.wikimedia.org/T326322" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_5"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T326322</span></span></a>), SNMP based tools might not be the best investment.</p>

<p>Regarding configuration, two combining elements played in our favor. First, using exclusively Juniper equipment allowed us to focus our efforts. Second, Juniper was a pioneer in NETCONF based device configuration as well as being “API first”.</p>

<h4 class="remarkup-header">NETCONF</h4>

<p><a href="https://en.wikipedia.org/wiki/NETCONF" class="remarkup-link remarkup-link-ext" rel="noreferrer">NETCONF</a> is in some way the 2nd attempt of the industry for a standardized way of configuring devices. It works in a more layered approach leveraging proven protocols. It is now the industry standard and has been extended in multiple ways.</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Transport: SSH (most common), HTTPS (more recent)</li>
<li class="remarkup-list-item">Encryption: handled by the transport layer</li>
<li class="remarkup-list-item">Message encoding: XML-RPC (most common), JSON-RPC (more recent)</li>
<li class="remarkup-list-item">Data model: YANG* or proprietary via an abstraction layer (eg. Junos set)</li>
<li class="remarkup-list-item">Supports locks, atomic changes (apply a set of changes or nothing), full configuration changes</li>
</ul>

<p>(*More on YANG later)</p>

<div class="remarkup-code-block" data-code-lang="xml" data-sigil="remarkup-code-block"><div class="remarkup-code-header">sending a configuration to a device then discarding it</div><pre class="remarkup-code" style=" max-height: 20em; overflow: auto;"><span></span><span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span>
<span class="nt">&lt;nc:rpc</span> <span class="na">xmlns:nc=</span><span class="s">&quot;urn:ietf:params:xml:ns:netconf:base:1.0&quot;</span> <span class="na">message-id=</span><span class="s">&quot;urn:uuid:xxx&quot;</span><span class="nt">&gt;</span>
<span class="nt">&lt;load-configuration</span> <span class="na">format=</span><span class="s">&quot;text&quot;</span> <span class="na">action=</span><span class="s">&quot;replace&quot;</span><span class="nt">&gt;</span>
<span class="nt">&lt;configuration-text&gt;</span>
My configuration
<span class="nt">&lt;/configuration-text&gt;</span>
<span class="nt">&lt;/load-configuration&gt;</span>
<span class="nt">&lt;/nc:rpc&gt;</span>
]]&gt;]]&gt;
<span class="nt">&lt;rpc-reply</span> <span class="na">xmlns=</span><span class="s">&quot;urn:ietf:params:xml:ns:netconf:base:1.0&quot;</span> <span class="na">xmlns:junos=</span><span class="s">&quot;http://xml.juniper.net/junos/21.2R0/junos&quot;</span> <span class="na">xmlns:nc=</span><span class="s">&quot;urn:ietf:params:xml:ns:netconf:base:1.0&quot;</span> <span class="na">message-id=</span><span class="s">&quot;urn:uuid:f9ab59d9-746f-423f-b534-67197941f3df&quot;</span><span class="nt">&gt;</span>
<span class="nt">&lt;load-configuration-results&gt;</span>
<span class="nt">&lt;rpc-error&gt;</span>
<span class="nt">&lt;error-severity&gt;</span>warning<span class="nt">&lt;/error-severity&gt;</span>
<span class="nt">&lt;error-path&gt;</span>[edit routing-options validation]<span class="nt">&lt;/error-path&gt;</span>
<span class="nt">&lt;error-message&gt;</span>mgd: statement has no contents; ignored<span class="nt">&lt;/error-message&gt;</span>
<span class="nt">&lt;error-info&gt;</span>
<span class="nt">&lt;bad-element&gt;</span>static<span class="nt">&lt;/bad-element&gt;</span>
<span class="nt">&lt;/error-info&gt;</span>
<span class="nt">&lt;/rpc-error&gt;</span>
<span class="nt">&lt;ok/&gt;</span>
<span class="nt">&lt;/load-configuration-results&gt;</span>
<span class="nt">&lt;/rpc-reply&gt;</span>
<span class="cp">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span><span class="nt">&lt;nc:rpc</span> <span class="na">xmlns:nc=</span><span class="s">&quot;urn:ietf:params:xml:ns:netconf:base:1.0&quot;</span> <span class="na">message-id=</span><span class="s">&quot;urn:uuid:cc664f8c-8815-41f8-82a8-c655bc6eda10&quot;</span><span class="nt">&gt;&lt;get-configuration</span> <span class="na">compare=</span><span class="s">&quot;rollback&quot;</span> <span class="na">rollback=</span><span class="s">&quot;0&quot;</span> <span class="na">format=</span><span class="s">&quot;text&quot;</span><span class="nt">/&gt;&lt;/nc:rpc&gt;</span>]]&gt;]]&gt;
<span class="nt">&lt;rpc-reply</span> <span class="na">xmlns=</span><span class="s">&quot;urn:ietf:params:xml:ns:netconf:base:1.0&quot;</span> <span class="na">xmlns:junos=</span><span class="s">&quot;http://xml.juniper.net/junos/21.2R0/junos&quot;</span> <span class="na">xmlns:nc=</span><span class="s">&quot;urn:ietf:params:xml:ns:netconf:base:1.0&quot;</span> <span class="na">message-id=</span><span class="s">&quot;urn:uuid:cc664f8c-8815-41f8-82a8-c655bc6eda10&quot;</span><span class="nt">&gt;</span>
<span class="nt">&lt;configuration-information&gt;</span>
<span class="nt">&lt;configuration-output&gt;</span>
<span class="nt">&lt;/configuration-output&gt;</span>
<span class="nt">&lt;/configuration-information&gt;</span>
<span class="nt">&lt;/rpc-reply&gt;</span></pre></div>

<p>Our current network automation (<a href="https://wikitech.wikimedia.org/wiki/Homer" class="remarkup-link remarkup-link-ext" rel="noreferrer">Homer</a>) leverages NETCONF through multiple abstraction layers. It fetches data from a couple sources of truth (<a href="https://wikitech.wikimedia.org/wiki/Netbox" class="remarkup-link remarkup-link-ext" rel="noreferrer">Netbox</a> and <a href="https://gerrit.wikimedia.org/g/operations/homer-public" class="remarkup-link remarkup-link-ext" rel="noreferrer">YAML</a> files). Using Jinja templates it formats this data in the most user readable structure (the standard Junos syntax). Last, Juniper’s <a href="https://github.com/Juniper/py-junos-eznc" class="remarkup-link remarkup-link-ext" rel="noreferrer">py-junos-eznc</a> library wraps the generated configuration and overall instructions in XML to send it “over” NETCONF (using <a href="https://github.com/ncclient/ncclient" class="remarkup-link remarkup-link-ext" rel="noreferrer">ncclient</a>) to the device. Then the device&#039;s configuration engine takes care of showing us diffs of what changed, handling rollback, etc.</p>

<p>We’ve opted to do full configuration replacement instead of specific sections only to prevent drifts between devices configurations and wanted state (eg. manual configuration. A Homer “run” will normalize any changes to match the state from our sources of truth. The downside to this approach is its slowness.</p>

<p>To remediate this slowness, we have implemented a more ad-hoc way of doing scope limited changes. For example for server ports configuration. Using Cumin, Cookbooks can send Netbox based configuration changes as well as get states by issuing Junos CLI commands over SSH. This path is made possible by Junos CLI allowing it to return any command output as JSON or XML.<br />
Those tools allowed us to clean-up/streamline our network configuration, remove toil, iterate faster on provisioning, troubleshooting, fix miss-configurations, and react efficiently to attacks at the cost of duplicating some of our automation tooling.</p>

<p>Nevertheless there is still plenty of room for improvement, in the tools itself (<a href="/T250415" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_6"><span class="phui-tag-core phui-tag-color-object">T250415: Homer: add parallelization support</span></a>, performance improvements) the workflows, the configuration standardization and verification, integration with other tools or platforms (<a href="/T328747" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_7"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T328747: Improve Homer output when Juniper device rejects config</span></span></a>. <a href="/T253194" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_8"><span class="phui-tag-core phui-tag-color-object">T253194: Homer CI: verify Junos syntax</span></a>)</p>

<h3 class="remarkup-header">The future</h3>

<p>We recently started looking at alternate vendors for our network switches. After thorough <a href="https://wikitech.wikimedia.org/wiki/Dell_Enterprise_Sonic_Evaluation" class="remarkup-link remarkup-link-ext" rel="noreferrer">evaluation</a> we have decided to start rolling out some SONiC based network devices in our production environment (we could call it phase 2 testing) to make sure larger deployments would be doable.</p>

<p>Even though designed with some level of multi-vendor support in mind, only one vendor was implemented in Homer on day one. Which makes it a mostly Junos tool.<br />
During the evaluation we had to make sure that any alternate vendor would either be automated either in a similar way as we’re currently doing (full configuration replacement + ad-hoc cli commands) or in a way that goes in the same direction as the whole industry (including Juniper). The former being a quick way to get started but is risky in the long run. The latter has a larger upfront engineering cost but is an investment on the future. Despite the drawback of not supporting NETCONF, SONiC falls in the 2nd category.<br />
Extensive documentation on its <a href="https://github.com/sonic-net/SONiC/blob/master/doc/mgmt/Management%20Framework.md" class="remarkup-link remarkup-link-ext" rel="noreferrer">management framework</a> is available online (being open source probably helps).</p>

<p><div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/shf6ikoin4634kshxlzc/PHID-FILE-lvcgigqpl4qndyh3n42k/Mgmt_Frmk_Arch.jpg" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_4"><img src="https://phab.wmfusercontent.org/file/data/shf6ikoin4634kshxlzc/PHID-FILE-lvcgigqpl4qndyh3n42k/Mgmt_Frmk_Arch.jpg" height="917" width="1726" loading="lazy" alt="Mgmt_Frmk_Arch.jpg (917×1 px, 273 KB)" /></a></div><br />
Source: <a href="https://github.com/sonic-net/SONiC/blob/master/doc/mgmt/images/Mgmt_Frmk_Arch.jpg" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://github.com/sonic-net/SONiC/blob/master/doc/mgmt/images/Mgmt_Frmk_Arch.jpg</a><br />
On this diagram we can see that there are 2 programmatic ways to interact with a device: REST(CONF) and gNMI. <strong>The CLI becomes a regular REST client</strong>, the same goes for Junos.</p>

<h4 class="remarkup-header">Quick digression on Datastores</h4>

<p>Before being formally called datastores, most network devices had the concept of “2 or more configuration’s versions”. We can think of Cisco’s historic startup vs. running configuration or Juniper’s more modern configuration history. In 2018 <a href="https://datatracker.ietf.org/doc/rfc8342/" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC8342</a> standardize and extends those 3 base datastores:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Startup: optional, rw, persistent</li>
<li class="remarkup-list-item">Candidate: optional, rw, volatile, can be messed with, no production impact</li>
<li class="remarkup-list-item">Running: required, rw, persistent</li>
</ul>

<p>With:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Intended: required, ro, volatile, can be same as running, apply any needed transformations</li>
<li class="remarkup-list-item">Operational: ro, all configuration and states data</li>
</ul>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><div class="remarkup-code-header">datastores from RFC8342</div><pre class="remarkup-code" style=" max-height: 40em; overflow: auto;">+-------------+                 +-----------+
| &lt;candidate&gt; |                 | &lt;startup&gt; |
|  (ct, rw)   |&lt;---+       +---&gt;| (ct, rw)  |
+-------------+    |       |    +-----------+
       |           |       |           |
       |         +-----------+         |
       +--------&gt;| &lt;running&gt; |&lt;--------+
                 | (ct, rw)  |
                 +-----------+
                       |
                       |        // configuration transformations,
                       |        // e.g., removal of nodes marked as
                       |        // &quot;inactive&quot;, expansion of
                       |        // templates
                       v
                 +------------+
                 | &lt;intended&gt; | // subject to validation
                 | (ct, ro)   |
                 +------------+
                       |        // changes applied, subject to
                       |        // local factors, e.g., missing
                       |        // resources, delays
                       |
  dynamic              |   +-------- learned configuration
  configuration        |   +-------- system configuration
  datastores -----+    |   +-------- default configuration
                  |    |   |
                  v    v   v
               +---------------+
               | &lt;operational&gt; | &lt;-- system state
               | (ct + cf, ro) |
               +---------------+

ct = config true; cf = config false
rw = read-write; ro = read-only
boxes denote named datastores</pre></div>

<p>It’s up to the upper protocols (RESTCONF, NETCONF, etc) to define ways to expose or copy between those datastores (commit, rollback, etc).</p>

<h4 class="remarkup-header">RESTCONF</h4>

<p>This more recent protocol <a href="https://datatracker.ietf.org/doc/rfc8040/" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC 8040</a> (2017) can be imagined as a NETCONF over HTTPS, leveraging the now popular REST architecture. Due to its younger age, it has less features but is regularly extended. For example <a href="https://datatracker.ietf.org/doc/rfc8527/" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC 8527</a> (March 2019) paves the way for config rollback style operations (there is only a non-standard <a href="https://support.yumaworks.com/support/solutions/articles/1000244321-how-to-use-the-confirmed-commit-procedure-in-restconf" class="remarkup-link remarkup-link-ext" rel="noreferrer">implementation</a> so far) as well as subscriptions (itself in <a href="https://datatracker.ietf.org/doc/rfc8650/" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC 8650</a>) by bringing the concept of datastores to RESTCONF.</p>

<p>RESTCONF, in short:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Transport: HTTPS</li>
<li class="remarkup-list-item">Encryption: handled by the transport layer (TLS)</li>
<li class="remarkup-list-item">Message encoding: XML or JSON</li>
<li class="remarkup-list-item">Data model: YANG*</li>
<li class="remarkup-list-item">Python libraries: many options, starting by the well known Python&#039;s <a href="https://requests.readthedocs.io/en/latest/" class="remarkup-link remarkup-link-ext" rel="noreferrer">Requests</a></li>
</ul>

<p>(*More on YANG later)</p>

<h4 class="remarkup-header">gNMI</h4>

<p>While NETCONF and RESTCONF were getting popular for network configuration, SNMP was still the only real option to get monitoring data from network devices. Despite NETCONF (<a href="https://www.rfc-editor.org/rfc/rfc5277.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC 5277</a>) and RESTCONF (<a href="https://datatracker.ietf.org/doc/rfc8650/" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC 8650</a>) being extended to support push/notifications the former never took off, and the latter is still too young to tell.</p>

<p>Streaming telemetry (also called push, subscribe, notification) is a feature aimed at replacing SNMP on the monitoring side. Instead of regularly polling a device like SNMP based monitoring does, a client subscribes to specific items/data on the device. Through a long lived client/server session, the device will send updated metrics when needed. Some examples on why it’s better than SNMP in this NANOG <a href="https://www.youtube.com/watch?v=McNm_WfQTHw" class="remarkup-link remarkup-link-ext" rel="noreferrer">presentation</a>.</p>

<p>gNMI (gRPC Network Management Interface), built by Google on top of <a href="https://en.wikipedia.org/wiki/GRPC" class="remarkup-link remarkup-link-ext" rel="noreferrer">gRPC</a>, is getting some traction in the industry for that purpose mostly due to its speed compared to the alternatives. That said, gNMI also supports regular “get” and “set” methods as defined in its <a href="https://github.com/openconfig/reference/blob/master/rpc/gnmi/gnmi-specification.md" class="remarkup-link remarkup-link-ext" rel="noreferrer">specification document</a> (there is no RFC). It also got further extended to support operational commands (restart, clear neighbors, manage certs, etc) through what’s called <a href="https://github.com/openconfig/gnoi" class="remarkup-link remarkup-link-ext" rel="noreferrer">gNOI</a> (gRPC Network Operations Interface).</p>

<p>gNMI, in short:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Transport: gRPC (HTTP/2)</li>
<li class="remarkup-list-item">Encryption: handled by the transport layer (TLS)</li>
<li class="remarkup-list-item">Message encoding: JSON, Protobuf</li>
<li class="remarkup-list-item">Data model: YANG*</li>
<li class="remarkup-list-item">Supports atomic changes (apply a set of changes or nothing, also called transactions)</li>
</ul>

<p>(*More on YANG later)</p>

<p>gNMI being a Google creation, most libraries and tools revolving around this protocol are written in Go. <a href="https://github.com/openconfig/gnmi" class="remarkup-link remarkup-link-ext" rel="noreferrer">The client library</a>, the standalone client <a href="https://github.com/google/gnxi" class="remarkup-link remarkup-link-ext" rel="noreferrer">tools</a> and the <a href="https://github.com/openconfig/gnmic" class="remarkup-link remarkup-link-ext" rel="noreferrer">swiss army knife (gNMIc)</a>. If there is a single tool to learn, it&#039;s gNMIc as it can do pretty much everything gNMI related and <a href="https://www.youtube.com/watch?v=v3CL2vrGD_8" class="remarkup-link remarkup-link-ext" rel="noreferrer">much more</a>.<br />
The Python ecosystem on the other hand is fairly dire. First any project needs to convert the gNMI protobuf specs into Python libraries or using the pre-made one available from <a href="https://github.com/openconfig/gnmi/tree/master/proto" class="remarkup-link remarkup-link-ext" rel="noreferrer">upstream</a>, managing multiple versions of the spec could potentially be a challenge.</p>

<ul class="remarkup-list">
<li class="remarkup-list-item"><a href="https://github.com/akarneliuk/pygnmi/" class="remarkup-link remarkup-link-ext" rel="noreferrer">pygnmi</a> seems the most interesting as it’s feature rich and actively-ish developed. gNOI is however <a href="https://github.com/akarneliuk/pygnmi/issues/109" class="remarkup-link remarkup-link-ext" rel="noreferrer">not</a> supported.</li>
<li class="remarkup-list-item"><a href="https://github.com/python-gnxi/python-gnmi-proto" class="remarkup-link remarkup-link-ext" rel="noreferrer">python-gnmi-proto</a> (hasn&#039;t been updated since 2021) takes a different approach, different grpc library, less abstraction (let the user handle the gRPC calls directly)</li>
</ul>

<p>As gNMI is multi-platform, we can also look at various vendor’s PoC, with the risk of them becoming platform specific in the future.</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Arista’s <a href="https://github.com/arista-northwest/gnmi-py" class="remarkup-link remarkup-link-ext" rel="noreferrer">gnmi-py</a> doesn’t support the “set” operations</li>
<li class="remarkup-list-item">Cisco’s <a href="https://github.com/cisco-ie/cisco-gnmi-python" class="remarkup-link remarkup-link-ext" rel="noreferrer">cisco-gnmi-python</a> seems feature rich as well.</li>
</ul>



<div class="remarkup-code-block" data-code-lang="console" data-sigil="remarkup-code-block"><div class="remarkup-code-header">Querying a gNMI device using gNMIc</div><pre class="remarkup-code"><span class="gp">$ gnmic -a lsw1-e8-eqiad.mgmt.eqiad.wmnet:8080 --username admin --password Wikimedia capabilities</span>
<span class="go">gNMI version: 0.7.0</span>
<span class="go">supported models:</span>
<span class="go">[...] # Mix of OpenConfig and SONiC specific YANG models</span>
<span class="go">supported encodings:</span>
<span class="go">  - JSON</span>
<span class="go">  - JSON_IETF</span>
<span class="go">  - PROTO</span></pre></div>



<div class="remarkup-code-block" data-code-lang="python" data-sigil="remarkup-code-block"><div class="remarkup-code-header">Querying a gNMI device using pyGNMI</div><pre class="remarkup-code"><span class="kn">from</span> <span class="nn">pygnmi.client</span> <span class="kn">import</span> <span class="n">gNMIclient</span>
<span class="n">host</span> <span class="o">=</span> <span class="p">(</span><span class="s">&#039;lsw1-e8-eqiad.mgmt.eqiad.wmnet&#039;</span><span class="p">,</span> <span class="s">&#039;8080&#039;</span><span class="p">)</span>
<span class="k">with</span> <span class="n">gNMIclient</span><span class="p">(</span><span class="n">target</span><span class="o">=</span><span class="n">host</span><span class="p">,</span> <span class="n">username</span><span class="o">=</span><span class="s">&#039;admin&#039;</span><span class="p">,</span> <span class="n">password</span><span class="o">=</span><span class="s">&#039;Wikimedia&#039;</span><span class="p">,</span> <span class="n">debug</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">as</span> <span class="n">gc</span><span class="p">:</span>
 	<span class="n">result</span> <span class="o">=</span> <span class="n">gc</span><span class="o">.</span><span class="n">capabilities</span><span class="p">()</span>
 	<span class="k">print</span><span class="p">(</span><span class="n">result</span><span class="p">)</span></pre></div>



<h4 class="remarkup-header">YANG</h4>

<p><a href="https://en.wikipedia.org/wiki/YANG" class="remarkup-link remarkup-link-ext" rel="noreferrer">YANG</a> is a standardized (<a href="https://datatracker.ietf.org/doc/html/rfc7950" class="remarkup-link remarkup-link-ext" rel="noreferrer">RFC 7950</a>) way to structure both configuration and operational (metrics) information, a bit like a DB schema. Similar to SMI it’s possible to define dependencies between modules (like SNMP MIBs), forming a tree. Even though the data model structure is standardized, there are both vendor specific and vendor agnostic modules. Many of those models are available on the <a href="https://github.com/YangModels/yang" class="remarkup-link remarkup-link-ext" rel="noreferrer">YangModels</a> Github. The <a href="https://github.com/openconfig/public" class="remarkup-link remarkup-link-ext" rel="noreferrer">OpenConfig</a> are the most notable vendor neutral models, but it’s of course up to the vendors to support them. <br />
At least <strong>Dell&#039;s SONiC aims at supporting OpenConfig</strong>. However using a mix of vendor-generic and vedor-specific models is often required to fully manage a given device as OpenConfig only covers common features.</p>

<div class="remarkup-code-block" data-code-lang="devicetree" data-sigil="remarkup-code-block"><div class="remarkup-code-header">example yang module in tree view (filtered on config elements for NTP only)</div><pre class="remarkup-code" style=" max-height: 20em; overflow: auto;"><span></span><span class="o">+--</span><span class="na">rw</span> <span class="na">system</span>
   <span class="o">+--</span><span class="na">rw</span> <span class="na">ntp</span>
      <span class="o">+--</span><span class="na">rw</span> <span class="na">config</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">enabled</span><span class="o">?</span>              <span class="na">boolean</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">ntp</span><span class="o">-</span><span class="na">source</span><span class="o">-</span><span class="na">address</span><span class="o">?</span>   <span class="nl">oc-inet</span><span class="p">:</span><span class="na">ip</span><span class="o">-</span><span class="na">address</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">enable</span><span class="o">-</span><span class="na">ntp</span><span class="o">-</span><span class="na">auth</span><span class="o">?</span>      <span class="na">boolean</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">trusted</span><span class="o">-</span><span class="na">key</span><span class="o">*</span>          <span class="na">uint16</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">source</span><span class="o">-</span><span class="na">interface</span><span class="o">*</span>     <span class="nl">oc-if</span><span class="p">:</span><span class="na">base</span><span class="o">-</span><span class="na">interface</span><span class="o">-</span><span class="na">ref</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">network</span><span class="o">-</span><span class="na">instance</span><span class="o">?</span>     <span class="o">-&gt;</span> <span class="o">/</span><span class="nl">oc-ni</span><span class="p">:</span><span class="na">network</span><span class="o">-</span><span class="na">instances</span><span class="o">/</span><span class="na">network</span><span class="o">-</span><span class="na">instance</span><span class="o">/</span><span class="kr">name</span>
      <span class="o">+--</span><span class="na">rw</span> <span class="na">ntp</span><span class="o">-</span><span class="na">keys</span>
      <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">ntp</span><span class="o">-</span><span class="na">key</span><span class="o">*</span> <span class="p">[</span><span class="na">key</span><span class="o">-</span><span class="na">id</span><span class="p">]</span>
      <span class="o">|</span>     <span class="o">+--</span><span class="na">rw</span> <span class="na">key</span><span class="o">-</span><span class="na">id</span>    <span class="o">-&gt;</span> <span class="p">..</span><span class="o">/</span><span class="na">config</span><span class="o">/</span><span class="na">key</span><span class="o">-</span><span class="na">id</span>
      <span class="o">|</span>     <span class="o">+--</span><span class="na">rw</span> <span class="na">config</span>
      <span class="o">|</span>     <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">key</span><span class="o">-</span><span class="na">id</span><span class="o">?</span>      <span class="na">uint16</span>
      <span class="o">|</span>     <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">key</span><span class="o">-</span><span class="na">type</span><span class="o">?</span>    <span class="na">identityref</span>
      <span class="o">|</span>     <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">key</span><span class="o">-</span><span class="na">value</span><span class="o">?</span>   <span class="na">string</span>
      <span class="o">|</span>     <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">encrypted</span><span class="o">?</span>   <span class="na">boolean</span>
      <span class="o">+--</span><span class="na">rw</span> <span class="na">servers</span>
         <span class="o">+--</span><span class="na">rw</span> <span class="na">server</span><span class="o">*</span> <span class="p">[</span><span class="na">address</span><span class="p">]</span>
            <span class="o">+--</span><span class="na">rw</span> <span class="na">address</span>    <span class="o">-&gt;</span> <span class="p">..</span><span class="o">/</span><span class="na">config</span><span class="o">/</span><span class="na">address</span>
            <span class="o">+--</span><span class="na">rw</span> <span class="na">config</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">address</span><span class="o">?</span>            <span class="nl">oc-inet</span><span class="p">:</span><span class="na">host</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">port</span><span class="o">?</span>               <span class="nl">oc-inet</span><span class="p">:</span><span class="na">port</span><span class="o">-</span><span class="na">number</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">version</span><span class="o">?</span>            <span class="na">uint8</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">association</span><span class="o">-</span><span class="na">type</span><span class="o">?</span>   <span class="na">enumeration</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">iburst</span><span class="o">?</span>             <span class="na">boolean</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">prefer</span><span class="o">?</span>             <span class="na">boolean</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">key</span><span class="o">-</span><span class="na">id</span><span class="o">?</span>             <span class="na">uint16</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">minpoll</span><span class="o">?</span>            <span class="na">uint8</span>
            <span class="o">|</span>  <span class="o">+--</span><span class="na">rw</span> <span class="na">maxpoll</span><span class="o">?</span>            <span class="na">uint8</span></pre></div>

<p>One important aspect is how to efficiently use those data models in our Python based automation world. Manually crafting Python structures to match the expected formats didn&#039;t seem appealing to me, even though that&#039;s how at least some other (<a href="https://github.com/aerleon/aerleon/blob/main/aerleon/lib/openconfig.py" class="remarkup-link remarkup-link-ext" rel="noreferrer">Aerleon</a>, <a href="https://github.com/saltstack/salt/blob/4f027308f87e8f8350775a2553e8cd4adab193a0/salt/modules/netbox.py#L556" class="remarkup-link remarkup-link-ext" rel="noreferrer">Salt</a>, <a href="https://github.com/ansible-collections/community.network/blob/312b5d87cbb73f17acd4d254889145228d35fb19/plugins/module_utils/network/exos/config/vlans/vlans.py#L25" class="remarkup-link remarkup-link-ext" rel="noreferrer">Ansible</a>) do it.<br />
One direction I investigated is the possibility to create Python bindings from YANG models, which would allow us to manipulate data as Python objects. The hope is for example to abstract type checking, dataset comparison, IDE auto-completion.</p>

<ul class="remarkup-list">
<li class="remarkup-list-item"><a href="https://github.com/robshakir/pyangbind" class="remarkup-link remarkup-link-ext" rel="noreferrer">Pyangbind</a>, plugin for pyang, <del><a href="https://github.com/robshakir/pyangbind/issues/292" class="remarkup-link remarkup-link-ext" rel="noreferrer">abandoned</a></del> (maybe coming back to life?)</li>
<li class="remarkup-list-item">Cisco’s <a href="https://github.com/CiscoDevNet/ydk-gen" class="remarkup-link remarkup-link-ext" rel="noreferrer">YDK</a> is actively maintained but complex to setup, furthermore it requires the whole SDK to be included in any application that want to use those bindings</li>
<li class="remarkup-list-item">For RESTCONF OpenAPI servers (built based on YANG data) it’s possible to use <a href="https://github.com/openapi-generators/openapi-python-client" class="remarkup-link remarkup-link-ext" rel="noreferrer">openapi-python-client</a> and in some way reverse-engineer the YANG models… not optimal</li>
<li class="remarkup-list-item"><a href="https://github.com/karlnewell/pyang-pydantic" class="remarkup-link remarkup-link-ext" rel="noreferrer">Pyang-pydantic</a>, another pyang plugin to generate pydantic models from YANG models.</li>
<li class="remarkup-list-item"><a href="https://pydantify.github.io/pydantify/" class="remarkup-link remarkup-link-ext" rel="noreferrer">pydantify</a> (relevant <a href="https://eprints.ost.ch/id/eprint/1089/" class="remarkup-link remarkup-link-ext" rel="noreferrer">paper</a>), more recent but experimental. Still the most promising option yet too young for my use-cases<ul class="remarkup-list">
<li class="remarkup-list-item">In addition to using it as a configuration builder, it could potentially also be used as a syntax checker or a way of showing differences between two configurations (eg. candidate and running)</li>
</ul></li>
</ul>

<p>Much easier in Go, where <a href="https://github.com/openconfig/ygot" class="remarkup-link remarkup-link-ext" rel="noreferrer">tools</a> exist to convert such models directly to Go bindings or Protobuf schemas. But wait! It’s <a href="https://karneliuk.com/2020/05/gnmi-part-3-using-grpc-to-collect-data-in-openconfig-yang-from-arista-eos-and-nokia-sr-os/" class="remarkup-link remarkup-link-ext" rel="noreferrer">possible</a> to convert gNMI Protobuf schemas to Python objects. A way that I didn&#039;t explore at the risk of being too convoluted for our use cases.</p>

<h2 class="remarkup-header">The plan</h2>

<p>Now that we have an overview of the modern and less modern ways of interacting with network devices, here is the current plan.</p>

<h3 class="remarkup-header">NETCONF vs. gNMI</h3>

<p>To start with, we need something that works with both Juniper and SONiC to try to hope for &quot;one protocol to rule them all&quot;. Thus, NETCONF is out of the game as not supported by SONiC.</p>

<p>RESTCONF&#039;s main advantage is its easier handling (as based on HTTP). On the other hand, gNMI is a faster and a “two in one” solution as it handles both monitoring and configuration.</p>

<p><span class="remarkup-highlight">After testing both, gNMI seems the best bet forward to me as its only downside (apparent opacity) is counterbalanced by good tools and libraries (gnmic, grpcio).</span></p>

<p>It is unlikely that one or the other becomes obsolete anytime soon, and even though as long as the data models don&#039;t (YANG), the required modifications to go from one transport to the other would only mean switching tools (easier said than done, but much better than changing data models). gNMI is also supported by all major vendors (Cisco, Arista, etc) so we don&#039;t get vendor lock-in by using it.</p>

<h3 class="remarkup-header">Authentication</h3>

<p>One thing that is sure though, is that <strong>both RESTCONF and gNMI require a good PKI infrastructure</strong> as they both require TLS. <a href="/T334594" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_9"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T334594: TLS certificates for network devices</span></span></a> is the first building block for our next-gen automation.</p>

<p>Then comes authentication. Until now we have done everything over SSH, both for CLI and NETCONF. Unfortunately RESTCONF and gNMI both require a TLS based authentication mechanism.<br />
While SONiC supports both basic HTTP (regular username/password) and client certificate (username in the certificate&#039;s CN field), Junos only supports basic HTTP (certificate is only an additional layer of security, but <a href="https://www.juniper.net/documentation/us/en/software/junos/grpc-network-services/topics/topic-map/grpc-services-configuring.html#task-configure-mutual-authentication-for-grpc" class="remarkup-link remarkup-link-ext" rel="noreferrer">doesn&#039;t authenticate the user itself</a>). Which means that we have to implement a mechanism to define passwords to <em>at least</em> users that will use the API (eg. Homer).<br />
On top of that, SONiC&#039;s support of users through the API is still limited (doesn&#039;t handle SSH keys, doesn&#039;t expose hashed-passwords).<br />
<strong>For both of those reasons, <a href="/T338028" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_10"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T338028: Users management on SONiC</span></span></a> is the next critical stepping stone to tackle</strong>, multiple paths and proposals are being discussed in the task.</p>

<div class="remarkup-note"><span class="remarkup-note-word">NOTE:</span> The points above are not hard blockers for tests and to start working on the automation itself, but are required for any production deployment.</div>



<h3 class="remarkup-header">Automation</h3>

<p>Only now comes the important part of the topic, our automation.</p>

<p>Even though gNMI might make Homer fast enough to obsolete some of the network related cookbook the workflow of each tools are distinct enough to not make it a short term goal. Keeping both tools separated will also help making the transition easier. This will of course bring the risk of duplicated code, like we currently have for Juniper.</p>

<p>The low number of options make the choice of Python library easier: <span class="remarkup-highlight">pyGNMI</span> does the job well.</p>

<p>Unfortunately the various YANG to Python bindings libraries are not ripe enough for prime time, which means <span class="remarkup-highlight">we will have to rely on Python dictionaries structures</span>. Those are not <em>that</em> bad once we&#039;re familiar with them, but we should keep an eye on Pydantify (especially the Pydantic 2 <a href="https://github.com/pydantify/pydantify/issues/14" class="remarkup-link remarkup-link-ext" rel="noreferrer">upgrade</a>).</p>

<p>Once we have the data-structures, we need to be able to compare them. So far this process was offloaded to Juniper&#039;s OS. Send the new data, ask for a diff, commit if fine. This is not strictly needed for simple actions, like configuring a single switch interface, but the more complex the change, the more needed it is to catch mistakes before it&#039;s too late. A basic implementation could rely on existing libraries such as dictdiff or deepdiff, but also on the pyGNNMI diff_openconfig feature once some of the bugs have been fixed (see my <a href="https://github.com/akarneliuk/pygnmi/pull/122" class="remarkup-link remarkup-link-ext" rel="noreferrer">couple</a> <a href="https://github.com/akarneliuk/pygnmi/pull/123" class="remarkup-link remarkup-link-ext" rel="noreferrer">PRs</a> and <a href="https://github.com/akarneliuk/pygnmi/issues/124" class="remarkup-link remarkup-link-ext" rel="noreferrer">somes</a> <a href="https://github.com/akarneliuk/pygnmi/issues/125" class="remarkup-link remarkup-link-ext" rel="noreferrer">issues</a>) .</p>

<h4 class="remarkup-header">Cookbooks</h4>

<p>An initial proof of concept approach is available on Gerrit <a href="https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/924896" class="remarkup-link remarkup-link-ext" rel="noreferrer">CR924896</a> it shows that it works fine but a few things are needed to be production grade quality:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item"><a href="/T340045" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_11"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T340045: Package pyGNMI and dictdiffer to be used by cookbooks</span></span></a></li>
<li class="remarkup-list-item">Migrate the OpenConfig/gNMI cookbook functions to Spicerack (not needed on day 1)</li>
<li class="remarkup-list-item">Implement a diff feature as mentioned above</li>
</ul>

<div class="remarkup-note"><span class="remarkup-note-word">NOTE:</span> At this point we could also look at migrating some of the existing Juniper &quot;read only&quot; cookbook functions to gNMI. Especially if they follow the OpenConfig model.</div>



<h4 class="remarkup-header">Homer</h4>

<p>The initial scaffolding to support gNMI has been done (see <a href="https://gerrit.wikimedia.org/r/c/operations/software/homer/+/927736" class="remarkup-link remarkup-link-ext" rel="noreferrer">CR927736</a>). Some adjustments are needed but its logic has been validated.<br />
One point not cleared up yet, is how to run it from our own laptops. The current Homer/Junos/NETCONF leverages SSH and thus is able to automatically use our jump-hosts. HTTP&#039;s equivalent is Socks5 (see for example <a href="/T319426" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_12"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T319426: [cookbooks] Add ssh socks5 proxy support</span></span></a>) but <span class="remarkup-highlight">its support in gRPC is unlikely to happen anytime soon</span> (see <a href="https://github.com/grpc/grpc/issues/30347" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://github.com/grpc/grpc/issues/30347</a>)</p>

<p>Next steps are to iterate to add support for various SONiC configuration elements one after the other. The easiest approach is to manually configure a device, fetch its configuration on the OpenConfig format, then &quot;templatize&quot; it. Which is the approach we took when working on Juniper devices. Starting with the easy bits. Re-using the diff feature.</p>

<p>While this goal progresses, <span class="remarkup-highlight">we will benefit from transitioning more data from YAML files to Netbox</span>, thanks to <a href="/T336275" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_13"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T336275: Upgrade Netbox to 4.x</span></span></a> and <a href="/T305126" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_14"><span class="phui-tag-core phui-tag-color-object">T305126: Make more extensive use of Netbox custom fields</span></a>.</p>

<p>The last special bit is <del>Capirca</del> <a href="https://github.com/aerleon/aerleon" class="remarkup-link remarkup-link-ext" rel="noreferrer">Aerleon</a>. The ACL generation library. Despite claiming to support OpenConfig ACLs, some features are missing and the output format doesn&#039;t fit the OpenConfig YANG model. <span class="remarkup-highlight">My pull requests to fix that are either merged or being reviewed</span> (<a href="https://github.com/aerleon/aerleon/pull/311" class="remarkup-link remarkup-link-ext" rel="noreferrer">#311</a>, <a href="https://github.com/aerleon/aerleon/pull/312" class="remarkup-link remarkup-link-ext" rel="noreferrer">#312</a>, <a href="https://github.com/aerleon/aerleon/pull/313" class="remarkup-link remarkup-link-ext" rel="noreferrer">#313</a>).</p>

<h3 class="remarkup-header">Monitoring</h3>

<p>Still an area to explore, and less urgent for us. gNMIc as a Prometheus <a href="https://gnmic.openconfig.net/user_guide/outputs/prometheus_output/" class="remarkup-link remarkup-link-ext" rel="noreferrer">intermediary</a> is a promising option.</p>

<h2 class="remarkup-header">Conclusion</h2>

<p>This is not an easy path, but the way forward is relatively clear at this point. More issues will undoubtedly show up as we progress. Achieving this goal will bring 3 key benefits:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Unified (and modern!) protocol for configuration and monitoring</li>
<li class="remarkup-list-item">Multi-vendors support</li>
<li class="remarkup-list-item">Faster operations</li>
</ul>

<p><em>Illustration photo by Benjamin Elliott on <a href="https://unsplash.com/fr/photos/vc9u77c0LO4" class="remarkup-link remarkup-link-ext" rel="noreferrer">Unsplash</a></em></p></div></content></entry><entry><title>Netbox news</title><link href="/phame/live/17/post/289/netbox_news/" /><id>https://phabricator.wikimedia.org/phame/post/view/289/</id><author><name>ayounsi (Arzhel Younsi)</name></author><published>2022-06-28T11:27:43+00:00</published><updated>2022-06-28T13:32:06+00:00</updated><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p><a href="https://github.com/netbox-community/netbox/" class="remarkup-link remarkup-link-ext" rel="noreferrer">Netbox</a> is a tool used by all SREs, either directly or abstracted through cookbooks and various scripts. Managed by <a href="/tag/infrastructure-foundations/" class="phui-tag-view phui-tag-type-shade phui-tag-violet phui-tag-shade phui-tag-icon-view " data-sigil="hovercard" data-meta="0_26"><span class="phui-tag-core "><span class="visual-only phui-icon-view phui-font-fa fa-users" data-meta="0_25" aria-hidden="true"></span>Infrastructure-Foundations</span></a>, it went through a major (and much needed!) upgrade this past quarter, led by John Bond, myself and with the help of Riccardo.</p>

<p>For historical context, around the release of Netbox 2.10 the project was forked, creating the new project, Nautobot.  Netbox 2.10.4 was the last version which was compatible with both Netbox and the new fork as such we remained on this version until we could evaluate our needs going forward.  After discussions we decided to stay on Netbox (see why).</p>

<p>Here is a rundown of the current and future improvements. This work was tracked in <a href="/T296452" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_15"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T296452: Upgrade Netbox to 3.2</span></span></a></p>

<h3 class="remarkup-header">Infrastructure</h3>

<ul class="remarkup-list">
<li class="remarkup-list-item">100% on bullseye servers (it used to be on buster)</li>
<li class="remarkup-list-item">Behind our CDNs</li>
<li class="remarkup-list-item">Active/passive frontends</li>
<li class="remarkup-list-item">Separate internal vs. external endpoints (creation of netbox.discovery.wmnet)</li>
<li class="remarkup-list-item">Documentation fully refactored and cleaned up: have a look at <a href="https://wikitech.wikimedia.org/wiki/Netbox" class="remarkup-link remarkup-link-ext" rel="noreferrer">Netbox - Wikitech</a></li>
<li class="remarkup-list-item">New “single pane of glass” Grafana dashboard for Netbox health monitoring <a href="https://grafana.wikimedia.org/d/DvXT6LCnk/netbox" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://grafana.wikimedia.org/d/DvXT6LCnk/netbox</a><ul class="remarkup-list">
<li class="remarkup-list-item">With the addition of django monitoring - <a href="https://phabricator.wikimedia.org/T243928" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_16"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T243928</span></span></a></li>
<li class="remarkup-list-item">This led to the improvement of the existing Postgres dashboard - <a href="https://grafana.wikimedia.org/d/000000469/postgres" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://grafana.wikimedia.org/d/000000469/postgres</a></li>
<li class="remarkup-list-item">Leverages the new prometheus::blackbox::check::http created by Filippo and John</li>
</ul></li>
</ul>

<p>All the above improvements will contribute to having a rock solid source of truth as we come to rely on it more and more across SRE.</p>

<h3 class="remarkup-header">New changes already visible and used in prod</h3>

<ul class="remarkup-list">
<li class="remarkup-list-item">Relooked UI<ul class="remarkup-list">
<li class="remarkup-list-item">Including a dark mode</li>
<li class="remarkup-list-item">Better filtering on pretty much all the pages</li>
</ul></li>
<li class="remarkup-list-item">Group support for Ganeti clusters - <a href="https://phabricator.wikimedia.org/T262446" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_17"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T262446</span></span></a><ul class="remarkup-list">
<li class="remarkup-list-item">Helps model our infrastructure better,</li>
<li class="remarkup-list-item">Ties in John’s work to expose Netbox data in Puppet - see <a href="https://phabricator.wikimedia.org/T229397" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_18"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T229397</span></span></a>,</li>
<li class="remarkup-list-item">Will allow hosts that rely on row/rack data to not need hardcoded values anymore (eg. kubernetes)</li>
</ul></li>
<li class="remarkup-list-item">Improved reports (reports are automated functions that alert on data inconsistencies based on our own rules and conventions):<ul class="remarkup-list">
<li class="remarkup-list-item">Network interfaces MTU miss configurations</li>
<li class="remarkup-list-item">Better error logging for reports</li>
</ul></li>
<li class="remarkup-list-item">End to end path tracing now traverse circuits<ul class="remarkup-list">
<li class="remarkup-list-item">This allows us to see exactly where a given network interface leads to, for example <a href="https://netbox.wikimedia.org/dcim/interfaces/21226/trace/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://netbox.wikimedia.org/dcim/interfaces/21226/trace/</a></li>
</ul></li>
<li class="remarkup-list-item">Improved contact management<ul class="remarkup-list">
<li class="remarkup-list-item">It is now possible to clearly document the NOC escalation order, for example <a href="https://netbox.wikimedia.org/circuits/providers/64/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://netbox.wikimedia.org/circuits/providers/64/</a></li>
</ul></li>
<li class="remarkup-list-item">AS numbers model<ul class="remarkup-list">
<li class="remarkup-list-item">Help centralize data in our single source of truth (before we had to maintain a dedicated wiki page <a href="https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations</a>)</li>
<li class="remarkup-list-item">This will help in future network automation efforts, see <a href="https://phabricator.wikimedia.org/T305126#7941476" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_19"><span class="phui-tag-core phui-tag-color-object">T305126#7941476</span></a></li>
</ul></li>
<li class="remarkup-list-item">Improved custom fields support (custom fields is a way to extend the default models, to store WMF specific data, for example procurement task on servers)<ul class="remarkup-list">
<li class="remarkup-list-item">Extended to most of the models (for example add a purchase date to an inventory item)</li>
<li class="remarkup-list-item">Objects can now be used as custom “fields”, (for example, link a row to a Ganeti cluster)</li>
</ul></li>
</ul>

<hr class="remarkup-hr" />

<p>While we’ve been busy with the above, there is much more to evaluate and possibly implement. A complete list is available on <a href="https://wikitech.wikimedia.org/wiki/Netbox#Future_improvements" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://wikitech.wikimedia.org/wiki/Netbox#Future_improvements</a> many of them are good first bugs if you’re interested in learning more and contributing to our setup.</p>

<p>Here are some highlights:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">GraphQL API - T310577<ul class="remarkup-list">
<li class="remarkup-list-item">Some scripts are well known for their slowness (eg. DNS cookbook), GraphQL should speed them up</li>
</ul></li>
<li class="remarkup-list-item">Basic change rollback - <a href="https://phabricator.wikimedia.org/T310589" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_20"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T310589</span></span></a><ul class="remarkup-list">
<li class="remarkup-list-item">If deemed viable, this will help quickly rollback accidental edits and deletions</li>
</ul></li>
<li class="remarkup-list-item">Use Custom Model Validation - <a href="https://phabricator.wikimedia.org/T310590" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_21"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T310590</span></span></a><ul class="remarkup-list">
<li class="remarkup-list-item">Long time requested by DCops, this will help reduce entry mistakes (eg. typoes) before the happen (compared to waiting for a report to trigger)</li>
</ul></li>
<li class="remarkup-list-item">Make more extensive use of Netbox custom fields - <a href="https://phabricator.wikimedia.org/T305126" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_22"><span class="phui-tag-core phui-tag-color-object">T305126</span></a><ul class="remarkup-list">
<li class="remarkup-list-item">This will allow us to move data from yaml files and free from “description” fields to structured Netbox fields</li>
</ul></li>
<li class="remarkup-list-item">Represent sub-interface and bridge device associations in Netbox - <a href="https://phabricator.wikimedia.org/T296832" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_23"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T296832</span></span></a><ul class="remarkup-list">
<li class="remarkup-list-item">This will allow us to document some edge cases of some server’s configuration,</li>
<li class="remarkup-list-item">And modeling our network devices better,</li>
<li class="remarkup-list-item">Both the above helping us to improve our automation</li>
</ul></li>
<li class="remarkup-list-item">Using a central Redis instance - <a href="https://phabricator.wikimedia.org/T311385" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_24"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T311385</span></span></a><ul class="remarkup-list">
<li class="remarkup-list-item">Prerequisite for active/active frontends (faster and more reliable)</li>
</ul></li>
</ul>

<p>We hope that this background work will improve your SRE experience through making this abstraction to our real life infrastructure faster, more complete and more reliable.</p>

<p>As usual, don’t hesitate to reach out to <a href="/tag/infrastructure-foundations/" class="phui-tag-view phui-tag-type-shade phui-tag-violet phui-tag-shade phui-tag-icon-view " data-sigil="hovercard" data-meta="0_28"><span class="phui-tag-core "><span class="visual-only phui-icon-view phui-font-fa fa-users" data-meta="0_27" aria-hidden="true"></span>Infrastructure-Foundations</span></a>, for any issues, requests or suggestions.</p>

<hr class="remarkup-hr" />

<div class="remarkup-note"><span class="remarkup-note-word">NOTE:</span> The following text was sent to ops-private@, saving it here for public archival</div>

<p><em>Header image from <a href="https://unsplash.com/photos/mTkXSSScrzw" class="remarkup-link remarkup-link-ext" rel="noreferrer">Unsplash</a></em></p></div></content></entry><entry><title>Internal anycast</title><link href="/phame/live/17/post/190/internal_anycast/" /><id>https://phabricator.wikimedia.org/phame/post/view/190/</id><author><name>ayounsi (Arzhel Younsi)</name></author><published>2020-08-07T09:48:06+00:00</published><updated>2020-08-07T18:25:31+00:00</updated><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><p>This project brought two major changes to our infrastructure. Firstly, servers that used to be fronted by LVS for load balancing are now peering directly with our routers. Secondly, we have started using IP anycast for a highly critical service: recursive DNS.</p>

<h2 class="remarkup-header">Load balancing</h2>

<p>At the infrastructure level, load balancing means sending clients requests to more than one backend server. There are many different ways to achieve this each one with their advantages and drawbacks.</p>

<p>Any users accessing Wikimedia’s websites will go through those following two layers.</p>

<h3 class="remarkup-header">GeoDNS</h3>

<ol class="remarkup-list">
<li class="remarkup-list-item">A client asks our DNS for the IP of a given service (eg. <a href="https://www.wikipedia.org" class="remarkup-link remarkup-link-ext" rel="noreferrer">www.wikipedia.org</a>)</li>
<li class="remarkup-list-item">Our authoritative DNS server looks up the client IP using an IP to geolocation database (in our case MaxMind), which in turn gives a rough idea of this IP’s location (country or state)</li>
<li class="remarkup-list-item">Finally, our DNS server checks our <a href="https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/master/geo-maps" class="remarkup-link remarkup-link-ext" rel="noreferrer">manually curated list</a> of “location-POP mapping”, and replies with the IP of the nearest POP</li>
</ol>

<p>As a result, depending on their estimated location, users are balanced to different caching POPs).</p>

<p>See, for example, in San Francisco and Singapore:</p>

<div class="remarkup-code-block" data-code-lang="console" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span class="gp">$ host www.wikipedia.org</span>
<span class="go">www.wikipedia.org is an alias for dyna.wikimedia.org.</span>
<span class="go">dyna.wikimedia.org has address 198.35.26.96</span>
<span class="go">dyna.wikimedia.org has IPv6 address 2620:0:863:ed1a::1</span></pre></div>

<div class="remarkup-code-block" data-code-lang="console" data-sigil="remarkup-code-block"><pre class="remarkup-code"><span class="gp">$ host www.wikipedia.org</span>
<span class="go">www.wikipedia.org is an alias for dyna.wikimedia.org.</span>
<span class="go">dyna.wikimedia.org has address 103.102.166.224</span>
<span class="go">dyna.wikimedia.org has IPv6 address 2001:df2:e500:ed1a::1</span></pre></div>



<h3 class="remarkup-header">LVS</h3>

<p>To reach those IPs (eg. 2620:0:863:ed1a::1 or 2001:df2:e500:ed1a::1), users’ requests will cross the Internet and eventually hit our routers and our Linux load-balancer: the <a href="https://wikitech.wikimedia.org/wiki/LVS" class="remarkup-link remarkup-link-ext" rel="noreferrer">Linux Virtual Server</a>.</p>

<p>LVS’ peers with our routers using BGP to advertise (“claim”) those specific IPs and forwards inbound traffic towards them to a pool of backend servers (called “origin servers”). Decisions about which server to forward the traffic to are made based on:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Their administrative state: pooled/de-pooled, which is set in an etcd-backed store. See <a href="https://config-master.wikimedia.org/pybal/eqiad/text-https" class="remarkup-link remarkup-link-ext" rel="noreferrer">eqiad/text-https</a> for example</li>
<li class="remarkup-list-item">The health of the service, using regular health checks (active probing)</li>
<li class="remarkup-list-item">The source and destination IP and port (hashing)</li>
</ul>

<p>The first two are handled by <a href="https://wikitech.wikimedia.org/wiki/PyBal" class="remarkup-link remarkup-link-ext" rel="noreferrer">PyBal</a>, our homemade LVS manager and “battle-tested” tool.</p>

<p>The last one is to ensure that packets from a user’s session are always forwarded to the same backend server. If they were randomly balanced, backend servers would not know what the packets are about, as they don’t share states between each other (very costly). There are <a href="https://phabricator.wikimedia.org/T86651" class="remarkup-link" rel="noreferrer">thoughts</a> about replacing the scheduler.</p>

<h3 class="remarkup-header">Bypassing the LVS</h3>

<p>Our routers (in our case Juniper MXs but it’s similar through all the major vendors) support multiple types of load-balancing. The one we’re interested in now is called <a href="https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-balancing-bgp-session.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">BGP multipath</a>. To achieve this, every end server maintains a BGP session with the routers and advertises the same load-balanced IP.</p>

<p>BGP default behavior is to only pick one path (one backend server in our case) and keep the other ones as backups. Flipping the multipath knob, the router will start doing what’s called ECMP (Equal Cost Multiple Paths) and similar to LVS, it will decide which server to forward the packets to based on:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">The servers advertising the IPs (passive)</li>
<li class="remarkup-list-item">The source and destination IP and port (<a href="https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/hash-key-edit-forwarding-options.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">configurable</a>)</li>
</ul>

<p>This has the obvious advantage of being a more lightweight solution. Getting rid of a middle layer, which means less hardware, less software, and an easier configuration.</p>

<p>On the other hand, there are some limitations:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Service health-check probing is replaced by self-monitoring, a daemon on the end server stops advertising the IP if it detects an issue at the service level. This allows some failure scenarios where the end service is locally healthy but can’t be reached remotely (eg. firewalling issues)</li>
<li class="remarkup-list-item">Less control on the <a href="https://www.juniper.net/documentation/en_US/junos/topics/concept/hash-computation-mpcs-understanding.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">hashing algorithm</a>, since it is controlled by a proprietary software (the router’s OS). Not a big deal until we hit bugs</li>
<li class="remarkup-list-item">No etcd integration (yet?), thus depooling a backend server is only possible by disabling the BGP session or stopping the service being self-monitored</li>
</ul>

<p>The goal here is not to get rid of those LVS, but instead find a better load balancing solution for those &quot;small&quot; services, on which LVS may depend on. For example DNS.</p>

<p>Let’s checkout anycast before looking at the end result.</p>

<h2 class="remarkup-header">What is anycast?</h2>

<p><a href="https://en.wikipedia.org/wiki/Anycast" class="remarkup-link remarkup-link-ext" rel="noreferrer">Anycast</a> is one of the few ways the IP stack can route traffic from a source to a destination. In a good old unicast setup, the destination IP is unique on the network. But, what happens if there are several of them? You can imagine it as a larger scale version of BGP multipath (mentioned previously).<br />
<div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/zfp6twbdj4a2maphejoi/PHID-FILE-olkdymy6q7spmvdw77xm/image1.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_29"><img src="https://phab.wmfusercontent.org/file/data/zfp6twbdj4a2maphejoi/PHID-FILE-olkdymy6q7spmvdw77xm/image1.png" height="618" width="577" loading="lazy" alt="image1.png (618×577 px, 25 KB)" /></a></div><br />
When a router receives a packet destined for an IP for which multiple paths exist, it will go through a list of criteria (known as <a href="https://www.juniper.net/documentation/en_US/junos/topics/reference/general/routing-protocols-address-representation.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">path selection</a>) in order to decide which next hop is the best. For BGP, the main criteria is the “distance” measured in AS PATH length.</p>

<p>The way our infrastructure is designed, each service is assigned an AS number (eg. 64605 for anycast, 64600 for LVS), the same goes for each site (eg. 65001 for Ashburn, 65004 for San Francisco).</p>

<p>For example, traffic going from a host to a service hosted in the same site will have a distance (AS PATH length) of 1 (SITE-&gt;SERVICE). The distance for the same host to the same service in another site would be 2 (SITE-&gt;SITE-&gt;SERVICE).</p>

<p>In the example below (edited for readability), cr3-ulsfo has 3 options to reach 10.3.0.1—the first one having the shortest distance is the preferred route.</p>

<div class="remarkup-code-block" data-code-lang="text" data-sigil="remarkup-code-block"><pre class="remarkup-code">cr3-ulsfo&gt; show route 10.3.0.1 terse
A Destination  Next hop      AS path
* 10.3.0.1/32  198.35.26.7    64605 I
 10.3.0.1/32  198.35.26.197   (65002) 64605 I
 10.3.0.1/32  198.35.26.197   (65001 65002) 64605 I</pre></div>



<h3 class="remarkup-header">Reasons for using anycast</h3>

<p>If for some reason the prefered path becomes unavailable, it will transparently fail over to the next one within milliseconds, making the service significantly more resilient. One must obviously keep those traffic pattern changes in mind while designing the service. In our infrastructure, edge (caching) sites will fallback to core sites if the local service is down, but not the other way around.</p>

<p>To give a more concrete example, using an internal service:</p>

<p>If an anycast endpoint is in Ashburn, all clients in Ashburn will prefer it. If that endpoint goes down, and we have a similar endpoint in Dallas, Ashburn clients will automatically &quot;reroute&quot; to Dallas.</p>

<p>It is easy to see the reliability improvements of the above solution compared to more traditional ones, like, for instance, when servers had two nameserver entries in their resolv.conf file. Unfortunately, resolv.conf is configured to try the nameservers sequentially and has a default timeout of 5 seconds, with a minimum possible value of one second. This means that an outage can lead to servers being unable to resolve DNS for a number of seconds before they failover to the second nameserver. Some services are more sensitive to these failures than others and we have observed real issues with such outages. More details on task <a href="https://phabricator.wikimedia.org/T162818" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_34"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T162818</span></span></a>.</p>

<p>In addition to resilience, Anycast does a good job to keep latency at a minimum, as a shorter AS PATH usually means lower latency. While this is true within a controlled network, the Internet is another can of worms, but it is still the best option for services which can&#039;t do GeoIP (eg. authoritative DNS servers).</p>

<p>Internally, we don&#039;t have to maintain a mapping of which server is the best one for a given POP. In our case, all hosts use 10.3.0.1 as DNS. Set it, and forget it.</p>

<p>Of course it&#039;s not all upside, Anycast comes with one major risk, especially for a stateful protocol such as TCP: flapping. External factors (topology changes, incorrect load balancing) can cause packets of a given session to get redirected to a different backend server. As the new server did not take part in the initial TCP handshake, it will have no local state and reject (RST) the connection. Fixing it requires keeping states on the routers or sharing them between backend servers. Both are incredibly costly solutions. Remember, we’re trying to keep that step as lightweight as possible. Thankfully, experience and <a href="https://archive.nanog.org/meetings/nanog37/presentations/matt.levine.pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">studies</a> have shown that even on the dynamic network that is the Internet, those situations are uncommon.</p>

<p>Another limitation is monitoring, as a source is not able to target a specific destination host (the network decides), monitoring needs to run from at least as many vantage points as end nodes.</p>

<h2 class="remarkup-header">Our implementation</h2>

<p>Tracked in: <a href="https://phabricator.wikimedia.org/T186550" class="phui-tag-view phui-tag-type-object " data-sigil="hovercard" data-meta="0_35"><span class="phui-tag-core-closed"><span class="phui-tag-core phui-tag-color-object">T186550</span></span></a><br />
Documented in: <a href="https://wikitech.wikimedia.org/wiki/Anycast#Internal" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://wikitech.wikimedia.org/wiki/Anycast#Internal</a></p>

<p>Firstly, we need a daemon that runs on our Debian servers and talks BGP to our routers. We chose the BIRD Internet Routing Daemon for this, because it is both well tested and supports BFD out of the box.</p>

<p><a href="https://en.wikipedia.org/wiki/Bidirectional_Forwarding_Detection" class="remarkup-link remarkup-link-ext" rel="noreferrer">BFD</a> (Bidirectional Forwarding Detection) is a very fast and lightweight failure detection tool. As BGP&#039;s keepalives timers are not designed to be quick (<a href="https://tools.ietf.org/html/rfc4271" class="remarkup-link remarkup-link-ext" rel="noreferrer">90s by default</a>), we need something to ensure the routers will notice the server going down fast enough, in our case after 3*300ms.</p>

<p>At this point, we could already call it a day. We have the server advertising the Anycast IP via Bird to the router and a failover mechanism if the server fails.</p>

<p>But, what if the server stays healthy while the service itself dies?</p>

<p>To cover that failure scenario we found a useful and lightweight tool on GitHub called <a href="https://github.com/unixsurfer/anycast_healthchecker" class="remarkup-link remarkup-link-ext" rel="noreferrer">anycast_healthchecker</a>.</p>

<p>Every second, the health-checker monitors the health of the anycasted service using a custom script. If any issue is detected, it will instruct Bird to stop advertising the relevant IP.</p>

<p><div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/26lgt53yfjby64ngry6u/PHID-FILE-wf66ckqwh2hzj3acxa7m/image2.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_30"><img src="https://phab.wmfusercontent.org/file/data/26lgt53yfjby64ngry6u/PHID-FILE-wf66ckqwh2hzj3acxa7m/image2.png" height="537" width="561" loading="lazy" alt="image2.png (537×561 px, 19 KB)" /></a></div></p>

<p>Covering another failure scenario, the Bird process is linked (at the systemd level) to anycast_healthchecker, so that if the latter dies, the former will stop, BFD will detect a failure, and the router will stop advertising the IP as well.</p>

<p>On the monitoring front we have Icinga checking for the Bird and anycast_healthchecker processes, the router’s BGP sessions as well as the Anycasted IP.<br />
<div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/bvelx4hs6tqpfckp266p/PHID-FILE-frlqmk3m3g3w7jwflvoe/image5.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_31"><img src="https://phab.wmfusercontent.org/file/data/bvelx4hs6tqpfckp266p/PHID-FILE-frlqmk3m3g3w7jwflvoe/image5.png" height="76" width="398" loading="lazy" alt="image5.png (76×398 px, 6 KB)" /></a></div><br />
<div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/a3l4atmrz5arzqs5rn4g/PHID-FILE-63i4ppwhqd2wrxmmqpym/image4.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_32"><img src="https://phab.wmfusercontent.org/file/data/a3l4atmrz5arzqs5rn4g/PHID-FILE-63i4ppwhqd2wrxmmqpym/image4.png" height="77" width="653" loading="lazy" alt="image4.png (77×653 px, 10 KB)" /></a></div><br />
<div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/66hbdayrqnp2n6mk5fdu/PHID-FILE-nrdxycagzo5nxhvcg4yw/image3.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_33"><img src="https://phab.wmfusercontent.org/file/data/66hbdayrqnp2n6mk5fdu/PHID-FILE-nrdxycagzo5nxhvcg4yw/image3.png" height="55" width="673" loading="lazy" alt="image3.png (55×673 px, 6 KB)" /></a></div></p>

<p>As mentioned previously, this check will only fail if all of the possible Anycast endpoints are down (from a monitoring host point of view), this is why this is a paging alert.</p>

<p>All of the above is deployed via <a href="https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/profile/manifests/bird/" class="remarkup-link remarkup-link-ext" rel="noreferrer">Puppet</a> so only a few lines of Puppet/Hiera configuration is needed, <a href="https://wikitech.wikimedia.org/wiki/Anycast#How_to_deploy_a_new_service?" class="remarkup-link remarkup-link-ext" rel="noreferrer">see here</a>. If you&#039;re curious about the router side, it&#039;s over <a href="https://wikitech.wikimedia.org/wiki/Anycast#How_are_the_routers_configured?" class="remarkup-link remarkup-link-ext" rel="noreferrer">there</a>.</p>

<h2 class="remarkup-header">What&#039;s next?</h2>

<p>This setup has been working flawlessly for a few months now, and is going to grow progressively.</p>

<p>On the &quot;small&quot; improvements side, or wishlist, we want to be able to monitor the Anycast endpoints from various vantage points, or make it v6 ready.</p>

<p>On the larger side, the next big step is to roll Anycast for our authoritative DNS servers (the ones answering for all the *.wikipedia.org hostnames). The outline of the plan can be seen on the <a href="https://phabricator.wikimedia.org/T98006#5416434" class="remarkup-link" rel="noreferrer">tracking task</a>.</p>

<p>Our goal is to do externally what we have been doing internally. Each datacenter will advertise to their transit and peering neighbors the same IPs. The internet will take care of routing users to the optimal site. Add some safety mechanisms (eg. automatic IPs withdrawals), proper monitoring and voila!</p>

<hr class="remarkup-hr" />

<p>Photo by Clint Adair on Unsplash</p></div></content></entry><entry><title>RPKI Origin Validation</title><link href="/phame/live/17/post/186/rpki_origin_validation/" /><id>https://phabricator.wikimedia.org/phame/post/view/186/</id><author><name>ayounsi (Arzhel Younsi)</name></author><published>2020-08-10T13:02:48+00:00</published><updated>2020-09-05T12:08:22+00:00</updated><content type="xhtml"><div xmlns="http://www.w3.org/1999/xhtml"><h2 class="remarkup-header">The problem</h2>

<p><a href="https://en.wikipedia.org/wiki/Border_Gateway_Protocol" class="remarkup-link remarkup-link-ext" rel="noreferrer">BGP</a> binds the internet together by standardizing a way for all networks to tell their neighbors “If you want to reach IP X, send the packets to network Y”. BGP is great for its resiliency and scalability, but less so for its security.<br />
How can we know which network (<a href="https://en.wikipedia.org/wiki/Autonomous_system_(Internet)" class="remarkup-link remarkup-link-ext" rel="noreferrer">Autonomous System</a>) is the legitimate owner of an IP? Without that information, IPs can easily get <a href="https://en.wikipedia.org/wiki/BGP_hijacking" class="remarkup-link remarkup-link-ext" rel="noreferrer">hijacked</a>, either <a href="https://arstechnica.com/information-technology/2019/06/bgp-mishap-sends-european-mobile-traffic-through-china-telecom-for-2-hours/" class="remarkup-link remarkup-link-ext" rel="noreferrer">accidentally</a> or <a href="https://btc-hijack.ethz.ch/files/btc_hijack.pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">maliciously</a>.</p>

<p>Since the late 90s, databases named <a href="https://en.wikipedia.org/wiki/Internet_Routing_Registry" class="remarkup-link remarkup-link-ext" rel="noreferrer">Internet Routing Registries</a> (IRR) have been trying to <a href="http://www.irr.net/docs/faq.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">fulfill that (single) source of truth role</a>. Unfortunately, they are subject to a lot of issues: fragmentation (many existing databases, not all equally  well-maintained), security (some databases allow anyone to “claim” a prefix) and complexity (for the network operators). They also contain a lot of <a href="https://bgpmon.net/how-accurate-are-the-internet-route-registries-irr/" class="remarkup-link remarkup-link-ext" rel="noreferrer">inaccurate data</a> that have accumulated over time.</p>

<p>The first question that comes to mind is “why not fix what already exists instead of re-inventing the wheel?” Some efforts are <a href="https://2019.apricot.net/assets/files/APKS756/apricot2019_snijders_routing_security_roadmap_1551228895%20(2).pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">being made</a> on that, especially since IRR have a broader scope than just associating IPs to operators. Reciprocally, the Resource Public Key Infrastructure&#039;s (<a href="https://en.wikipedia.org/wiki/Resource_Public_Key_Infrastructure" class="remarkup-link remarkup-link-ext" rel="noreferrer">RPKI</a>) scope is focused on enforcing IP/AS ownership, <a href="https://conference.apnic.net/44/assets/files/APCS549/Global-IRR-and-RPKI-a-problem-statement.pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">not replacing IRRs</a>.<br />
Second question is, how to make sure RPKI data doesn’t become similarly inaccurate? I believe that IRRs became stale/outdated because only a few providers were rejecting prefixes based on this information. Hopefully the documentation, simplicity and existing tooling for RPKI will democratize its adoption and make inaccuracies easy to spot and quick to remedy.</p>

<h2 class="remarkup-header">The solution</h2>

<p>From an operator perspective, RPKI works with 2 interdependent parts:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Signing: tell the world which prefixes have been delegated to one&#039;s AS</li>
<li class="remarkup-list-item">Validation: prevent one&#039;s network from routing traffic to hijacked networks</li>
</ul>

<h3 class="remarkup-header">Signing</h3>

<p>Just a brief summary as there are a lot of resources <a href="https://ripe78.ripe.net/wp-content/uploads/presentations/43-Running-Your-Own-CA-RIPE78.pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">available online</a>.</p>

<p>Some points to highlight though:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">It’s the one step to make your prefixes safer.</li>
<li class="remarkup-list-item">It’s very easy to do through <a href="https://rpki.readthedocs.io/en/latest/rpki/implementation-models.html#hosted-rpki" class="remarkup-link remarkup-link-ext" rel="noreferrer">RIR’s online signing tools</a><ul class="remarkup-list">
<li class="remarkup-list-item">although these tools are of <a href="https://www.lacnic.net/innovaportal/file/3635/1/lacnic-cloudflares-rpki-validator.pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">varying quality</a></li>
</ul></li>
<li class="remarkup-list-item">One should setup monitoring for their ROAs (especially expiration). Some RIRs offer that option (e.g. RIPE).</li>
<li class="remarkup-list-item">One should not create a ROA with prefixes more specific than your routing policy (e.g. if you only ever advertises a /22 don&#039;t add the more specific /24 to your ROA)</li>
</ul>

<h3 class="remarkup-header">Validation</h3>

<h4 class="remarkup-header">How does it work?</h4>

<p>Everything starts with a validator, also called RPKI Relying Party software. Many open source implementations exist, in various languages with relatively similar features: <a href="https://github.com/RIPE-NCC/rpki-validator-3" class="remarkup-link remarkup-link-ext" rel="noreferrer">RPKI Validator</a>, <a href="https://github.com/cloudflare/cfrpki#octorpki" class="remarkup-link remarkup-link-ext" rel="noreferrer">OctoRPKI</a>, <a href="https://github.com/NLnetLabs/routinator" class="remarkup-link remarkup-link-ext" rel="noreferrer">Routinator</a>, <a href="https://github.com/dragonresearch/rpki.net" class="remarkup-link remarkup-link-ext" rel="noreferrer">RPKI Toolkit</a>, <a href="https://nicmx.github.io/FORT-validator/" class="remarkup-link remarkup-link-ext" rel="noreferrer">FORT Validator</a>, <a href="https://www.rpki-client.org/" class="remarkup-link remarkup-link-ext" rel="noreferrer">rpki-client</a>.<br />
RPKI works as a chain of trust, and the 1st level of that chain are the RIRs. To know how to reach that 1st level (the Trust Anchors), the validator needs a file called a Trust Anchor Locator (TAL), which is a pointer to each RIR’s RPKI repository or any repository you trust, as well as their public key.<br />
TALs are present on each RIR’s website and validators <a href="https://github.com/cloudflare/cfrpki/tree/master/cmd/octorpki/tals" class="remarkup-link remarkup-link-ext" rel="noreferrer">include</a> <a href="https://github.com/RIPE-NCC/rpki-validator-3/tree/master/rpki-validator/src/main/resources/packaging/generic/workdirs/preconfigured-tals" class="remarkup-link remarkup-link-ext" rel="noreferrer">most</a> <a href="https://github.com/NLnetLabs/routinator/tree/master/tals" class="remarkup-link remarkup-link-ext" rel="noreferrer">of</a> <a href="https://github.com/dragonresearch/rpki.net/tree/master/rp/rcynic/sample-trust-anchors" class="remarkup-link remarkup-link-ext" rel="noreferrer">them</a>. ARIN’s is an exception as users are required to agree to the ARIN <a href="https://www.arin.net/resources/manage/rpki/tal/" class="remarkup-link remarkup-link-ext" rel="noreferrer">Relying Party Agreement</a>.<br />
The RPKI repository contains either ROAs (Route Origin Authorization) or pointers to other repositories, themselves containing ROAs trusted by the upstream repository.</p>

<p>ROAs are signed database items saying “Prefix X is allowed to be advertised by AS Z”. Those items also have a creation and expiration date. There is no limitation on how many prefixes can be advertised by an AS, or the other way around.</p>

<p>Validators will fetch the ROAs from all the available repositories using rsync or <a href="https://datatracker.ietf.org/doc/rfc8182/" class="remarkup-link remarkup-link-ext" rel="noreferrer">RRDP</a> (over HTTPS).<br />
<div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/34wkbymopeuectlqn3ke/PHID-FILE-5wbhio56pxsfprfzrip7/image3.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_36"><img src="https://phab.wmfusercontent.org/file/data/34wkbymopeuectlqn3ke/PHID-FILE-5wbhio56pxsfprfzrip7/image3.png" height="319" width="396" loading="lazy" alt="image3.png (319×396 px, 13 KB)" /></a></div></p>

<p>Validators will then verify the ROAs (i.e. make sure the format and the signature are correct). Everything that can’t be verified at this level will be ignored. For example, if a ROA expired, it will be ignored (as if it were not in the repository). This prevents any risk of a prefix becoming unreachable if its owner forgets to “renew” its ROA.<br />
The validator is decoupled from the router for performance reasons. Routers usually have high routing performances, but very little resources for any other tasks.</p>

<p>Now that we have a curated and verified list of prefixes/ASNs pairs, we have to communicate it to the router. For that the Validator uses the <a href="https://tools.ietf.org/html/rfc6810" class="remarkup-link remarkup-link-ext" rel="noreferrer">RTR</a> (RPKI-To-Router) protocol. Most of the time this is embedded in the validator, but standalone applications like <a href="https://github.com/cloudflare/gortr" class="remarkup-link remarkup-link-ext" rel="noreferrer">GoRTR</a> also exist.<br />
The RFC recommends encrypted transport such as SSH and TLS however they do not insist on encryption. This risk is mitigated by ensuring that &quot;If unprotected TCP is the transport, the cache and routers MUST be on the same trusted and controlled network&quot;.<br />
Like everything, it is also recommended to run more than one validator, to ensure no interruption in prefix validation. Note that here as well, the risk of unreachable prefixes is prevented by a timeout period (for example <a href="https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/record-lifetime-edit-routing-options-validation.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">1 hour on Junos</a>) where if the router is unable to reach the validator it will begin ignoring validation.</p>

<p>The router will check every prefix learned (asynchronously, thus not impacting the BGP convergence time) against its internal RPKI database, which is periodically synced with the validator. Each BGP prefix will then have one of the 4 labels:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Valid, the BGP prefix is originating from an AS and the proper matching ROA is on file</li>
<li class="remarkup-list-item">Invalid, the BGP prefix is covered by a ROA, but the origin AS is not in any ROA</li>
<li class="remarkup-list-item">Unknown, the BGP prefix doesn’t have any matching ROA (most of the Internet so far)</li>
<li class="remarkup-list-item">Unverified, the router didn&#039;t check that prefix against its database yet</li>
</ul>

<p>The most important and useful one here is the Invalid state, which indicates a misconfiguration at best, a malicious hijack at worse.</p>

<p>The last step is for the router to do something useful with this new information.<br />
For the Invalid prefixes received from your peers:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Lower the local preference. This can be useful as a proof of concept, but only protects against a very narrow situation: where a prefix of similar size but different origin AS is learned from multiple peers. This could potentially save the day after a misconfiguration in the <a href="https://en.wikipedia.org/wiki/Default-free_zone" class="remarkup-link remarkup-link-ext" rel="noreferrer">DFZ</a> (Default-Free Zone), but would not protect from a malicious actor advertising a more specific prefix.</li>
<li class="remarkup-list-item">Discard. More difficult decision, especially as long as <a href="https://nusenu.github.io/RPKI-Observatory/unreachable-networks.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">RPKI unreachable</a> prefixes exist, which are IPs on the Internet without any larger/smaller prefix than the Invalid ones. This will cause them to become “invisible” to your network. Operators have to consider if this risk is worth the extra security? To help make that call, <a href="https://mailman.nanog.org/pipermail/nanog/2019-February/099522.html" class="remarkup-link remarkup-link-ext" rel="noreferrer">Pmacct can now</a> show how much traffic, if any, is being exchanged between your network and those IPs.</li>
</ul>

<p>A good first move is to discard invalids on your <a href="https://en.wikipedia.org/wiki/Internet_exchange_point" class="remarkup-link remarkup-link-ext" rel="noreferrer">IXP</a> facing links, for two reasons:</p>

<ol class="remarkup-list">
<li class="remarkup-list-item">Eliminates the risk of unreachable prefixes, as traffic will reroute through transit links. Worse case scenario is now sub-optimal routing. It’s also easier to reach out to peers to ask them to fix their ROAs.</li>
<li class="remarkup-list-item">Prefixes learned from IXPs usually have a very short AS path. A rogue prefix originating from there would most likely be preferred over one learned through a transit.</li>
</ol>

<p>For the Invalid prefixes advertised to your downstreams:</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Add a BGP community to the valid and invalids prefixes you’re forwarding down. This allows customers to make their own routing decisions, without having to deploy a full RPKI infra. This means your customers are placing their trust in you as such you must not blindly forward communities you have learned from untrusted peers.</li>
<li class="remarkup-list-item">Discard. This can be done progressively, customer after customer before a global discard.</li>
</ul>

<p>Tadam, the Internet is a bit more secure, thank you!<br />
(Check your NOC mailbox just in case.)</p>

<h4 class="remarkup-header">Monitoring</h4>

<p>Before you start dropping prefixes, better make sure everything is healthy.<br />
Additionally, people who will respond to alerts and “I can’t reach X” emails should be trained on how to react.</p>

<p>Some validators (such as OctoRPKI or Routinator) provide Prometheus endpoints exposing various metrics.<br />
For the router side, a <a href="https://tools.ietf.org/html/draft-ymbk-rpki-rtr-protocol-mib-00" class="remarkup-link remarkup-link-ext" rel="noreferrer">draft RFC</a> for a RTR MIB exists, but I’m not aware of any implementations. Syslog is more or less <a href="https://apps.juniper.net/syslog-explorer/#msg=RPD_RV_SESSIONDOWN&amp;sw=Junos%20OS&amp;rel=19.1R1" class="remarkup-link remarkup-link-ext" rel="noreferrer">an option</a> as well. Tools like <a href="https://github.com/Juniper/py-junos-eznc" class="remarkup-link remarkup-link-ext" rel="noreferrer">junos-pyez</a> or <a href="http://napalm.readthedocs.io/" class="remarkup-link remarkup-link-ext" rel="noreferrer">NAPALM</a> with some custom parsing seems to be the most complete option so far.</p>

<h4 class="remarkup-header">What doesn’t it protect against?</h4>

<h5 class="remarkup-header">Your own prefixes</h5>

<p>First of all, doing validation doesn’t protect your own prefixes, as it only impacts outbound traffic. The two things you can do for that is sign your prefixes and advocating for more people to deploy RPKI.</p>

<h5 class="remarkup-header">Transit (or any intermediary AS) not doing validation</h5>

<p><div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/njzekbcwtblzhu2bze24/PHID-FILE-bpkxn6dgptrrfqtiqyd3/image4.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_37"><img src="https://phab.wmfusercontent.org/file/data/njzekbcwtblzhu2bze24/PHID-FILE-bpkxn6dgptrrfqtiqyd3/image4.png" height="391" width="325" loading="lazy" alt="image4.png (391×325 px, 20 KB)" /></a></div><br />
In the above diagram, even if Wikimedia discards the malicious /24, it would send traffic for 192.0.2.1 to its transit provider (as it’s the best next-hop for the /23). The transit not doing any validation would naturally forward that traffic to the malicious AS as it is advertising a more specific.<br />
How to mitigate it?</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Peer with as many networks as possible (the shorter the AS path, the better)</li>
<li class="remarkup-list-item">Peer with networks doing RPKI validation (maybe we should keep a hall of fame somewhere?)</li>
<li class="remarkup-list-item">Advertise only /24s (and /48s v6)</li>
</ul>

<blockquote><p>Disaggregating has the adverse effect of increasing the size of the global routing table, which in many cases is frowned upon by the operators community. Decision to do so needs not to be taken lightly.</p></blockquote>



<h5 class="remarkup-header">AS forgery</h5>

<p>RPKI only ensure that the prefix is being advertised by the proper AS#. A malicious network could either change the AS# from the prefixes it’s advertising (to pretend to be the valid source AS) or fake a longer AS PATH (pretending to be transit for the target network). Very unlikely to be the result of a misconfiguration.<br />
How to mitigate it?</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Peer with as many networks as possible (the shorter the AS path, the better)</li>
<li class="remarkup-list-item">Advertise only /24s (and /48s v6), with the same warning as above</li>
<li class="remarkup-list-item">Monitor prefixes AS PATHS and contact the rogue network’s upstream</li>
</ul>

<h5 class="remarkup-header">Man in the middle</h5>

<p><div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/w5dxyjhivijopizowtdh/PHID-FILE-gjdwbk7kwj4jfs3lpv4y/image1.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_38"><img src="https://phab.wmfusercontent.org/file/data/w5dxyjhivijopizowtdh/PHID-FILE-gjdwbk7kwj4jfs3lpv4y/image1.png" height="247" width="871" loading="lazy" alt="image1.png (247×871 px, 35 KB)" /></a></div><br />
Slightly related to the point above, RPKI is vulnerable to all kinds of MitM attacks as it only validates the source AS, not the whole path.<br />
Take the example above. A malicious network could advertise the same prefix while still maintaining an AS path going to the Legit AS (and forwarding the traffic). This is more sneaky and complex than AS forgery, as the target network still receives traffic.<br />
How to mitigate it?</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Peer with as many networks as possible (the shorter the AS path, the better)</li>
</ul>

<h5 class="remarkup-header">More specific ROA</h5>

<p><div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/vldwrhrzvbopaicfnjf7/PHID-FILE-koyuvt3etxdkutmcxudv/image2.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_39"><img src="https://phab.wmfusercontent.org/file/data/vldwrhrzvbopaicfnjf7/PHID-FILE-koyuvt3etxdkutmcxudv/image2.png" height="247" width="818" loading="lazy" alt="image2.png (247×818 px, 34 KB)" /></a></div><br />
A variation of the previous point. If you allow your AS to advertize a more specific prefix, but don’t actually advertise it, a malicious AS doesn’t even need to have a shorter AS path to MitM-it.<br />
How to mitigate it?</p>

<ul class="remarkup-list">
<li class="remarkup-list-item">Ensure the ROA strictly matches what you’re advertising in BGP</li>
</ul>

<p>Some efforts (such as <a href="https://en.wikipedia.org/wiki/BGPsec" class="remarkup-link remarkup-link-ext" rel="noreferrer">BGPsec</a>) are being made to perform full path validation, but nothing is production ready yet.</p>

<h4 class="remarkup-header">TL;DR;</h4>

<ul class="remarkup-list">
<li class="remarkup-list-item">Deploy more than one validator</li>
<li class="remarkup-list-item">Keep them on the same trusted network as your routers (or use encryption)</li>
<li class="remarkup-list-item">Monitor validators and routers</li>
<li class="remarkup-list-item">Write documentation and train your staff</li>
<li class="remarkup-list-item">Check if any traffic would be null-routed eg. pmacct</li>
<li class="remarkup-list-item">Peer with many networks (short AS Paths)</li>
<li class="remarkup-list-item">Discard Invalids starting with IXPs</li>
<li class="remarkup-list-item">Share your experience</li>
</ul>

<h2 class="remarkup-header">Where are we now?</h2>

<p>From <a href="https://rpki-monitor.antd.nist.gov/" class="remarkup-link remarkup-link-ext" rel="noreferrer">NIST</a> we see Invalids represent 0.23% of equivalent /24s. With RPKI unreachable being an even smaller subset.<br />
<div class="phabricator-remarkup-embed-layout-center"><a href="https://phab.wmfusercontent.org/file/data/fjnzlf4eoqu4oaogpvwm/PHID-FILE-veqeg5po67jgqqfc2fny/global.bgp_prefix_space.png" class="phabricator-remarkup-embed-image-full" data-sigil="lightboxable" data-meta="0_40"><img src="https://phab.wmfusercontent.org/file/data/fjnzlf4eoqu4oaogpvwm/PHID-FILE-veqeg5po67jgqqfc2fny/global.bgp_prefix_space.png" height="500" width="700" loading="lazy" alt="global.bgp_prefix_space.png (500×700 px, 18 KB)" /></a></div></p>

<h3 class="remarkup-header">As seen on the Internet</h3>

<ul class="remarkup-list">
<li class="remarkup-list-item">AMS-IX route servers <a href="https://www.ams-ix.net/ams/documentation/ams-ix-route-servers#section-29115" class="remarkup-link remarkup-link-ext" rel="noreferrer">rejects by default</a> prefixes with invalid origins.</li>
<li class="remarkup-list-item">Telia <a href="https://blog.teliacarrier.com/2020/02/05/dropping-rpki-invalid-prefixes/" class="remarkup-link remarkup-link-ext" rel="noreferrer">rejects invalids</a></li>
<li class="remarkup-list-item">AT&amp;T <a href="https://www.youtube.com/watch?v=DkUZvlj1wCk" class="remarkup-link remarkup-link-ext" rel="noreferrer">rejects invalids</a></li>
<li class="remarkup-list-item">Netnod <a href="https://www.netnod.se/blog/netnods-new-route-server-platform" class="remarkup-link remarkup-link-ext" rel="noreferrer">rejects invalids</a></li>
<li class="remarkup-list-item">SEACOM and Workonline <a href="https://seclists.org/nanog/2019/Apr/113" class="remarkup-link remarkup-link-ext" rel="noreferrer">reject invalids</a> (but don’t use ARIN’s TAL)</li>
<li class="remarkup-list-item">Google <a href="https://ripe78.ripe.net/presentations/54-Route-Filtering-at-the-Edge-AS15169-Connect-%40RIPE.pdf" class="remarkup-link remarkup-link-ext" rel="noreferrer">is planning</a> to reject invalids for its peerings links</li>
<li class="remarkup-list-item">NTT <a href="https://www.gin.ntt.net/ntt-improves-security-of-the-internet-with-rpki-origin-validation-deployment/" class="remarkup-link remarkup-link-ext" rel="noreferrer">rejects invalids</a></li>
<li class="remarkup-list-item">A <a href="https://blog.benjojo.co.uk/post/state-of-rpki-in-2018" class="remarkup-link remarkup-link-ext" rel="noreferrer">more comprehensive list</a> of networks rejecting invalids (methodology not detailed)</li>
<li class="remarkup-list-item">And the now famous <a href="https://isbgpsafeyet.com/" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://isbgpsafeyet.com/</a></li>
</ul>

<h3 class="remarkup-header">Wikimedia</h3>

<p>All the work done at the Foundation is public by default, RPKI is no exception. You can find the main tracking task on our <a href="https://phabricator.wikimedia.org/T220669" class="remarkup-link" rel="noreferrer">Phabricator</a> instance, all related code changes on <a href="https://gerrit.wikimedia.org/r/q/topic:%22rpki%22" class="remarkup-link remarkup-link-ext" rel="noreferrer">Gerrit</a> the doc obviously on a <a href="https://wikitech.wikimedia.org/wiki/RPKI" class="remarkup-link remarkup-link-ext" rel="noreferrer">wiki</a>, and graphs on <a href="https://grafana.wikimedia.org/d/UwUa77GZk/rpki" class="remarkup-link remarkup-link-ext" rel="noreferrer">Grafana</a>.</p>

<p>Back in April, after looking at all the available validators, we decided to use Routinator for two main reasons: its RTR daemon is embedded into the validator (no need to maintain several tools) and its development was active with an explicit roadmap.<br />
In parallel to the implementation side, we wrote a <a href="https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/rpkicounter/files/rpkicounter.py" class="remarkup-link remarkup-link-ext" rel="noreferrer">Prometheus exporter</a> comparing our real time webrequests to a <a href="https://as286.net/data/ana-invalids.txt" class="remarkup-link remarkup-link-ext" rel="noreferrer">list</a> of invalid prefixes, giving us the percentage of requests coming from unreachable prefixes. This was then exposed in real-time on our <a href="https://grafana.wikimedia.org/d/UwUa77GZk/rpki?orgId=1" class="remarkup-link remarkup-link-ext" rel="noreferrer">Grafana dashboard</a> and used to hoover around 0.01%.</p>

<p>Last July, we started to reject invalids on IXP links, where our AS-paths are the shortest. In addition to having an <a href="https://www.peeringdb.com/net/1365" class="remarkup-link remarkup-link-ext" rel="noreferrer">open peering policy</a>, this  contributed to making more than half of our traffic “safer”.</p>

<p>On January 15th, we flipped the switch to <strong>discard invalid prefixes</strong> on transit links as well. One of our concerns is legit providers suddenly been unable to reach Wikipedia after a misconfiguration. Not our fault, but users would still be widely impacted. <br />
On the other hand, our hope is that having such a popular website enforcing RPKI adds a considerable amount of trust in the system, accelerating the ongoing adoption of RPKI validation.</p>

<h4 class="remarkup-header">Aftermatch</h4>

<p>As of the time of publishing this article, we have been contacted by 13 providers reporting Wikipedia being unreachable for them or one of their customers. Quickly fixed after explaining them what was the issue.</p>

<hr class="remarkup-hr" />

<p>Edit: August 11, added mentions of FORT Validator and rpki-client.</p>

<p>Header: <a href="https://commons.wikimedia.org/wiki/File:DublinAirport31mar2007-03.jpg" class="remarkup-link remarkup-link-ext" rel="noreferrer">https://commons.wikimedia.org/wiki/File:DublinAirport31mar2007-03.jpg</a></p></div></content></entry></feed>