Routing knowledge

Ganeti on modern network design

2023-12-15T09:57:40+00:00

Context

For reasons already mentioned in other docs (eg. Eqiad Expansion Network Design) we’re moving towards a network architecture where the servers’ layer 3 domains (subnets) are constrained in each rack. Currently (and in most of our core DCs) those layer 3 domains are stretched across all the racks of a given row. In that setting, a Ganeti cluster of a given row (where its hypervisors are spread across the row) leverages this L2 adjacency to be able to live migrate VMs between hypervisors.
In other words, if work is going to be done on hypervisor1, all the VMs it hosts can be temporarily and transparently distributed across the other hypervisorX to prevent any disruptions. Having the same vlan trunked to all the hypervisors of the same row allows the VMs to move to a different hypervisor without requiring any IP renumbering and thus downtime.

There are multiple ways to have Ganeti fully operational on the new network design, all of course have their set of tradeoffs (cost, implementation or migration complexity, uptime).

Per rack clusters

This is the easiest to implement, as we already have all the tooling and automation. This is also what we’re doing in the new POPs design. As each rack is its own domain, we can have one or more bridged hypervisors per rack. We currently have between 6 and 7 hypervisors per row, with 24 to 38 VMs per row.
On one side of the spectrum there is 1 hypervisor per rack: fully prevents any kind of live migration (automatic or manual), all the VMs have the same constraints as physical servers. For example if the ToR or hypervisor needs any kind of maintenance, the VMs will go down during the maintenance window. 1 per rack also means a large number of “micro-cluster” making VM allocation more difficult.
On the other far end, we could have all 6 or 7 hypervisors in the same rack. This option makes any live migration easy as well as hypervisor maintenance, but a ToR maintenance or failure means losing all 24 to 38 VMs. This could also be problematic in terms of overall server placement between racks (for both rack space and network usage).
In between: 2 to 3 hypervisors clusters per rack mitigates the downsides of both extremes, but only mitigates them. Unless running hypervisors at 50% capacity, which is not economically viable, not all VMs could be drained. Similarly, when maintenance needs to happen on a ToR, many VMs will go down.

Even though we’re designing systems to be redundant between racks, rows and sites, some services don’t or can’t follow those principles. For example active/passive with no automatic failovers. Not being able to migrate VMs would increase the workload, especially during planned maintenance.

`L2` abstraction at the ToR

This option is in some way mimicking the current situation, but instead of using a proprietary Juniper technology to bridge the same vlans across rows, we use a more standardized technology: VXLAN.
The main downsides to this solution are the increased license cost (Juniper/Dell SONiC require a special license to handle VXLAN), the lower interoperability between network vendors, as well as configuration and operational complexity, where it’s usually preferred to keep the network layer as lean as possible. This can be a temporary solution, for example during a migration phase but not a long term one.

Routed Ganeti

This consists in having each Ganeti host behave as a basic router. This allows each VM to be independent at networking point of view, as the hypervisors will take care of propagating reachability information (IP routes) to the rest of the infrastructure.

Going that way, the requirements are:

Functional live migration
Minimal modification to our automation (eg. makevm cookbook)
Minimal modification to our Debian installer and guest OS
Existing VMs can be re-imaged into this new mode

Moving away from L2 adjacency also means that LVS in their current form won’t be able to forward traffic to those VMs, the solution is IPIP support in LVS (T348837).

Setup and investigation

To get this working, I followed the current Building a new cluster steps by steps instructions with a couple adjustments.

First adjustment is that I used the following cluster init command:
sudo gnt-cluster init --no-ssh-init --enabled-hypervisors=kvm --vg-name=ganeti --master-netdev=eno1 --hypervisor-parameters kvm:kvm_path=/usr/bin/qemu-system-x86_64,kvm_flag=enabled,serial_speed=115200,migration_bandwidth=64,migration_downtime=500,kernel_path= --nic-parameters=mode=routed,link=main ganeti-test01.svc.eqiad.wmnet

In the --master-netdev which I bind on the hypervisor’s primary (and only) NIC, in the --nic-parameters link=main means use the default routing table.

As well as manually applied the few commands from the sre.ganeti.add_node cookbook to bypass the checks specific to L2 Ganeti (def is_valid_bridge()).

Then creating a VM (for example in the private range) requires just:
sudo gnt-instance add -t drbd -I hail --net 0:ip=10.66.2.10 --hypervisor-parameters=kvm:boot_order=network,spice_bind=127.0.0.1 -o debootstrap+default --no-install --no-wait-for-sync -g eqiad-test -B vcpus=1,memory=1024m --disk 0:size=10g testvm1001.eqiad.wmnet

The VM is not yet ready to be started but we see here that the VM’s IP needs to be present for the init script to set up the static route. spice_bind=127.0.0.1 is only necessary to access the UI of my test VMs using SPICE.

Guest VM IPv4 connectivity

When a VM is started, Ganeti calls the kvm-ifup bash script, and the setup_route function. This takes care of attaching the VM interface to the proper routing table, as well as adding a static route to this routing table (“if you need to reach IP X, ask interface Y”). As it doesn’t seem possible to pass a custom script to Ganeti, modifying /usr/lib/ganeti/3.0/usr/lib/ganeti/net-common using for example Puppet seems like the best approach to perform additional post VM startup actions.

So far that script needs the following modifications:

Disabling proxy_arp (enabled by default) by commenting out that command
Add an IP on the VM facing interface (with scope link)
Send a gratuitous ARP, for faster recovery after live migration

ip addr add 10.66.1.1/32 dev $INTERFACE scope link
arping -c1 -A -I $INTERFACE 10.66.1.1
#echo 1 > /proc/sys/net/ipv4/conf/$INTERFACE/proxy_arp (commented out)

TODO investigate improvements in addition to those commands such as:

net.ipv4.conf..arp_ignore=3 
net.ipv4.conf..arp_notify=1

Starting that test VM with a Basic Debian installer in “rescue mode” to have a prompt:
sudo gnt-instance start -H boot_order=cdrom,cdrom_image_path=/tmp/debian.iso testvm1001.eqiad.wmnet

In the VM, setup its IP and routing configuration:

ip addr add 10.66.2.10/32 dev ens13
ip route add 10.66.1.1 dev ens13 scope link
ip route add default via 10.66.1.1

This can of course look weird to anyone as we’re setting /32 NIC IP as well as a static route pointing to an interface. But that’s how the Linux kernel expects it to be configured.

We can then ping the VM on 10.66.2.10 from the hypervisor, as well as ping the VM from a different host than the hypervisor (as long as the 3rd party host has a route to the VM pointing as the hypervisor). Pings from the VM to that 3rd party host works as well after enabling forwarding on the host’s main NIC.

sysctl -w net.ipv4.conf.eno1.forwarding=1
This gives hope as we don’t need to rely on proxy_arp and we don’t need to change the guest OS much for them to work with IPv4. As long as DHCP behaves.

As Ganeti configures the static route pointing to the VM, and Ganeti supports only 1 v4 IP (with the parameter ip=10.66.2.10) it’s not possible to manually configure multiple IPs on the guest VMs without relying on a dynamic routing protocol (Eg. BGP) between the VM and the hypervisor. In our infra only 3 existing VMs are setup in that way: lists1001.wikimedia.org,mx[1001,2001].wikimedia.org.

Live migration, although with hypervisors in the same VLAN, shows continuity in the reachability between VM and gateway.
sudo gnt-instance migrate testvm1001.eqiad.wmnet

Similarly, two VMs on the same hypervisor can reach each other. Their interfaces always have the same “router” IP:

10: tap0:  mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
[...]
	inet 10.66.1.1/32 scope link tap0
11: tap1:  mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
[...]
	inet 10.66.1.1/32 scope link tap1

Going one step further I set up a 3rd Ganeti node in Dallas (the first two are adjacent in Ashburn) as a proof of concept . Live migrating a VM from Dallas to Ashburn worked perfectly. Only 2 pings of the constant ping running from the VM to its gateway (10.66.1.1) were lost, which is beyond acceptable for a 31ms live migration. Not like we will want to do this with production VMs but if it works stretched that far, it will work within the same datacenter.

Guest VM DHCP

When starting a VM, Debootstrap will initialize and ask for an IP using DHCP. The previously configured firewall rule permits the DHCP request to reach the hypervisor. We now have 2 options:

Run a DHCP server on the hypervisor and directly reply to the VM
Run a DHCP relay on the hypervisor to… relay the request to our DHCP server

We could imagine for example packaging and automating nfdhcpd using spicerack or the makevm cookbooks. However I preferred the 2nd option as it leverages our current DHCP server and makes the hypervisor a “dumb” relay.
For that I installed isc-dhcp-relay and modified /etc/default/isc-dhcp-relay so it points to our local DHCP server. The first limitation is that it binds to the existing interfaces at the daemon startup time. To workaround this limitation I added service isc-dhcp-relay restart to the net-common script so the demon gets restarted after the VM interface gets created.
The 2nd limitation is due to the way we do DHCP relaying on our core routers. Those routers intercept the relayed packets and drop them.
Using the Juniper configuration forwarding-options dhcp-relay forward-only fixes the issue, but breaks DHCP relaying for regular “non routed” hosts. The path forward here is most likely to move away from DHCP option 82 and instead use DHCP option 97 (see T304677).

TODO deploy a specific DHCP config snippet on the DHCP server to deliver the proper IP and route info to the VM (see https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/ ) but blocked by the issue above.

Guest VM IPv6 connectivity

First we can start with a little security housekeeping by modifying the net-common script:

Disable learning any router-advertisements from the guest VMs
Disable NDP proxying (like ARP proxying but for IPv6)

echo 0  > /proc/sys/net/ipv6/conf/$INTERFACE/accept_ra
echo 0  > /proc/sys/net/ipv6/conf/$INTERFACE/proxy_ndp

While Ganeti is v4 aware (remember the ip=10.66.2.10 parameter for gnt-instance add) this is not the case for IPv6, which is, so far, a blocker. Note that “attaching” an IP to a Ganeti instance object is especially needed for migration as the target hypervisor needs to know which static routes to create.

There are multiple possibles workarounds here, none of which are great:

Implement the missing feature (not really a workaround per-se, and significant work, so this is the least preferred option)
Rely on the “tags” feature, which supports arbitrary key-value pairs, but this needs to be tested, especially to know if they are passed on to the net-common script
Advertise the v6 prefixes over IPv4, with the major downside being that all guest VMs would need to run BGP…
Write a demon that inserts static route based on the NDB table (with safeguards)
Leverage our current mechanism of deriving the v6 IP from the v4 one

Setting this blocker aside, v6 is better than v4 as it has more IPs. This leads to the question, should I assign a /128 per VM or a /64 ?

/128

This can be configured dynamically using router advertisement. Testing radvd with this /etc/radvd.conf

interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7100::10/128 {
	AdvRouterAddr on;
  };
};

On the VM side this configures the default gateway, leveraging the link-local IPs. It only requires setting the nic IP on the VM side (2001:db8:cb00:7100::10/128) as well as the static route on the hypervisor side.

If we have to use a 3rd party stateful (config state) tool (radvd), we could as well use nfdhcpd but the latter seems abandoned.

To stay stateless on that side, the alternative is to modify the Debian autoinstall script so it sets the NIC v6 IP “manually”. In that case, radvd is only used to advertise the next hop IP:

interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7000::/52 {
	AdvOnLink off;
	AdvAutonomous off;
	AdvRouterAddr on;
  };
};

At this point maybe it’s easier to treat it fully like IPv4, ditch radvd and manually configure fe80::1 on tap0 with the matching routes on the VM side. As the v6 IP is generated from the v4 IP and v6 prefix, it’s then also possible to add the Hypervisor’s side route using “net-common”.

/64

More efficient in some ways, the following allows the guest OS to perform automatic IP configuration on the guest VM using SLAAC. Only the static route on the Ganeti side is needed.

interface tap0 {
  IgnoreIfMissing on;
  AdvSendAdvert on;
  AdvDefaultPreference high;
  prefix 2001:db8:cb00:7100::/64 {
	AdvRouterAddr on;
  };
};

This is also compatible with our Debian autoinstall script as it would still apply the v4 to v6 mapping, in our case configure the IP 2001:db8:cb00:7100:10:66:0:10/64 on the VM’s primary interface. And allows for additional guest IPs (despite not being widely used). Assigning a prefix to a VM means changing the way we do IP allocation within our automation.

This however raises the question: how to update that radvd.conf file (as it needs to have entries for all VM) at each VM movement? This seems tedious for the “net-common” bash script. Upstream started looking into including config files which would help. That’s why my preference is to use a static /128 IP per VM.

Overall there is significant work to be done regarding IPv6 which is outside of the scope of this project, tracked in T102099: Fix IPv6 autoconf issues once and for all, across the fleet., across the fleet and more globally in T234207: Investigate improvements to how puppet manages network interfaces. We could for example use DHCPv6. However, this Ganeti work should, as much as possible, go in the same general direction than what’s planned for those tasks.

Hypervisor firewalling

The test Ganeti hosts have been freshly migrated to nftables (see T336497: Add support for nftables in profile::firewall). For the testing phase a single line in a newly created /etc/nftables/input/10_ganeti_guestvm.nft was enough to permit traffic from the guest VMs to the hypervisor.
iifname "tap*" accept

Run sudo systemctl reload nftables.service so it’s taken into account, then sudo nft list ruleset to confirm it.
Before making it production ready (through Puppet) this needs to be tightened up by allowing only a few ports and protocols:

DHCP (see Guest VM DHCP above)
BGP & BFD (see Guest VM BGP below)

The forwarding chain already allows all traffic by default, and a rule is already present to permit IPv6 neighbor discovery.

Guest VM routes redistribution

At this point everything local to the hypervisor works. It’s time to make the infra know about the VMs.

As we already use Bird on multiple systems across the infra, it makes sense to use it here as well.
The snippet below instructs Bird to import static v4 and v6 routes from the Linux routing table into Bird while watching every second for any change.

protocol kernel kernel_v4 {
	learn;
	scan time 1;
	ipv4 {
    	import where krt_source = 4; # statics
	};
}
protocol kernel kernel_v6 {
	learn;
	scan time 1;
	ipv6 {
    	import where krt_source = 3; # statics
   };
}
[...]

The other side of the same coin is the “regular” BGP configuration to the routers and additional safeguards filtering needed.

[edit policy-options]
+   prefix-list ganeti4 {
+   	10.66.2.0/24;
+   }
[edit policy-options]
+   policy-statement ganeti_import {
+   	term ganeti4 {
+       	from {
+           	prefix-list-filter ganeti4 longer;
+       	}
+       	then accept;
+   	}
+   	then reject;
+   }
[edit protocols bgp]
+	group Ganeti4 {
+    	type external;
+    	multihop {
+        	ttl 193;
+    	}
+    	local-address 208.80.153.192;
+    	import ganeti_import;
+    	family inet {
+        	unicast {
+            	prefix-limit {
+                	maximum 5;
+                	teardown 80;
+            	}
+        	}
+    	}
+    	export NONE;
+    	peer-as 64650;
+    	neighbor 10.192.48.73 {
+        	description ganeti-test2004;
+    	}
+	}

Here BFD isn’t strictly needed as they’re unicast prefixes (at least for now). If the hypervisor goes down there is no need for faster failover as there is no alternative host anyway.

Testing it with IPv4 only, but the IPv6 behavior is expected to be similar. Running pings to the VM IP from bast4005 in ulsfo, with an interval of 0.5s. This means VM migration downtime and full convergence was achieved in less than 2 seconds.

64 bytes from 10.66.2.15: icmp_seq=30 ttl=60 time=74.3 ms   <- VM in Ashburn
64 bytes from 10.66.2.15: icmp_seq=31 ttl=60 time=73.5 ms
64 bytes from 10.66.2.15: icmp_seq=32 ttl=60 time=73.9 ms
64 bytes from 10.66.2.15: icmp_seq=36 ttl=60 time=42.3 ms   <- VM in Dallas
64 bytes from 10.66.2.15: icmp_seq=37 ttl=60 time=41.9 ms
64 bytes from 10.66.2.15: icmp_seq=38 ttl=60 time=41.9 ms

Guest VMs BGP

As mentioned in the previous section, we rely on BGP on the end hosts to advertise Anycast prefixes for high availability and improved service latency. Some of those services are running in VMs, for example Wikimedia DNS.

For those services (that are likely to grow in numbers) the BGP sessions need to be established with the hypervisor, or in other terms with the VM's next hop gateway. This is how they're currently configured on hosts behind L3 switches.

Adding an extra hop (the hypervisor) in the AS-path (router > switch > hypervisor > VM) means that an additional prepending is needed to the non Ganeti Anycast prefixes, like we did when we introduced the new switching fabric. This is in order to maintain a constant AS path length wherever the end host is located and thus offer proper balancing (otherwise traffic won’t reach longer as-path in normal operations).

Additional configuration needs to be added to Bird for this to work on the Ganeti side. The VM side's config can be left untouched.

First BFD becomes necessary for faster anycast failover between the hypervisor and the network, but not between the hypervisor and the VMs, as Bird will track the VM facing interface (tapX) and withdraw the prefixes if it goes down.
To maintain a dynamic system, we should keep as few states on the Hypervisor as possible. That’s why here Bird is passive and waits for the VM to initiate the session. This could make establishing the session a bit longer after a migration, the time the VM notices it’s not speaking with the same hypervisor, shutdowns the session and re-create it. If this is deemed too long, BFD could be introduced as this layer too for faster recovery.

Security wise, this could permit any rogue user with root on a VM (or through misconfiguration) to pretend to be any allowed AS and advertise an IP permitted in the “VMs_import” filter. This risk is quite low but additional security mechanisms could be used like MD5.…. (at least until TCP-AO is implemented), this wouldn’t prevent misconfigurations though. Another option is to pre-populate using Puppet the full list of BGP peers with their respective AS, with the significant downside of causing config/alerting fatigue and slower provisioning time.

protocol bgp bgp_v4 {
	ipv4 {
    	import filter VMs_import;
    	export none;
	};
	local  as 64650;
	neighbor range 10.66.2.0/24  external;
}
protocol bgp bgp_v6 {
	ipv6 {
    	import filter VMs6_import;
    	export none;
	};
	local  as 64650;
	neighbor fe80::/10 external;
}

This will require thorough testing before using it in production.

v4 and v6 prefixes allocations

In addition to not painting ourselves in a corner with a bad addressing plan, this is important as prefixes allocation defines the scope of each Ganeti cluster. As each VM is routable it can technically live in any location of our network.

There are 2 options here:
Either we use per DC prefixes, to mimic our current way of doing things. For example, use 10.66.2.0/23 for eqiad v4 private (with a total of 512 IPs and possibility to grow it). Unfortunately 208.80.154.0/23 is fully allocated, so any new v4 public IPs will need to come from a subnet re-sizing or a new larger prefix.
V6 being much easier, the allocation only depends on if we allocate a /128 or /64 per VM. Private vs. public IPv6 could even be enforced at the hypervisor and come from the same pool of IPs. Probably not the best option as it goes against our current way of operating hosts, but it’s a possibility.
A variant of this one is to group IPs (using sub-allocations) by Ganeti clusters this allows aggregating prefixes to reduce the size of routing tables, across the infra, but is not necessary due to the small amount of VMs we’re running.

The other option is to use a global pool of IPs. For example start naming the VMs testvm1.global.wmnet and assign the, an IP from a prefix outside of any of our POPs, like we have 10.3.0.0/24 for internal anycast. The major advantage is that the VM can be moved anywhere in our infra without having to be renumbered. The major inconvenience is that it becomes quite confusing and would require significant changes to our infrastructure while providing a false sense of security and increasing the blast radius of a single Ganeti cluster. It’s better to design a service with multiple VMs per site than rely on being able to move a VM from site to site.
It’s not because we CAN do it that we SHOULD, but we could for example have a special long distance cluster for problematic applications that can’t be active/passive (if there are any).

`L2` to `L3` cluster migration

Going that path will require a many-steps migration. First focusing on simple VMs (eg. not running BGP), in the private subnet, then extending the scope. Tooling will need to be adjusted first.
A hard requirement is that VMs will need to be re-IPed. This means re-imaged, like we’re planning on doing for bare metal servers.
I haven’t tested if a cluster can run both routed and bridged VMs at the same time. Even if it does, this sounds like a risky move, that’s why it is preferable to spin up a new cluster.

This cluster can start with two nodes, then, progressively, receive migrated VMs, freeing up space on other clusters allowing to re-purpose hypervisors, etc…

`L2` abstraction at the hypervisor

This solution differs from the previous one by using VXLAN (or any tunneling technology) to provide a L2 domain to the VMs. Instead of relying on Linux's ability to use a /32 prefix or /128 IPv6 on their virtual NIC, they will be assigned a regular /27-ish or /64 v6. The abstraction takes care of propagating reachability information between VMs. Other than that, it reuses most of the building blocks from the previous solution: DHCP relay, BGP, hypervisor firewalling, router advertisement. It also offers shorter downtime during switchover as even if the VM is now live on a different hypervisor, traffic can still be bridged from the previous hypervisor until BGP converges (we’re talking about milliseconds to seconds here). This would have been the preferred option if simulating a L2` adjacency was required, but in the current state of things it only adds an extra layer making management and troubleshooting more complex.

Conclusion

Within Wikimedia’s infrastructure (Debian based, all but 3 VMs having a single IP, BGP on the host needed), migrating the Ganeti clusters to work in a “routed” mode is a viable option to permit VM live migration between hypervisors spread over any number of L3 domains. The main downside is that this solution requires more preparation and deployment work compared to a L2 only solution and possibly a tunnel based one. It also only uses standards and open source components makes it a sustainable and low maintenance cost option as well.

Next steps

The first next step is to get this document reviewed for any pitfall or oversight I could have made. Then shortly after get to a common agreement, including the few open questions if we stick to routed Ganeti:

/128 or /64 for IPv6 VMs?
Prefix allocations?
How far should a Ganeti cluster and Ganeti group spread?

Once this is decided we need to start allocating hardware resources as my test devices are decommissioned servers that need to be returned to DCops. List then start working on the prerequisites needed to make it happen: our automation, (host and network) BGP, Puppet, DHCP, etc (a few are already listed through the document).
Timeline wise this needs to be prioritized to match the core DCs network refreshes, which means ideally fully ready in Dallas in maximum 6 months.

Other considerations

Not going that way

Creating distinct routing tables for public vs. private zones.

echo "100   private" >> /etc/iproute2/rt_tables
echo "200   public" >> /etc/iproute2/rt_tables

This initially sounded like a good idea as we separate public vs. private vlans in our infrastructure, but the only reason we actually separate them is to be able to provide different IPs. All the firewalling is done on the hosts (and a bit on the routers).

Use Linux VRF to separate hypervisor from VM traffic

A bit similar to the one above but with stricter separation between VM and hypervisor. This would have added extra security if we were providing VMs for untrusted customers (like a cloud provider).

Mixed clusters

Test if a cluster can have both routed and bridged VMs in parallel. I didn’t spend time testing this option as even if it works, there is a risk of impacting production VMs.

Possible future work

Dynamic Ganeti cluster VIP

The master-netdev is the interface on which lives the cluster dedicated management IP. It’s currently using a row specific IP while being the management IP for a site cluster. If that IP becomes unreachable the cluster keeps functioning, but operations (create/delete/modify) can’t be performed. We might benefit from assigning the --master-netdev to the loopback interface and the cluster’s FQDN to a VIP, advertised by BGP like VMs IPs. That would allow for seamless VIP migration and thus easier hypervisor maintenance.

Apply the same mechanism to bare-metal servers

This could for example help us save on public IPs instead of having a dedicated public rack or multiple racks with a public IPv4 prefix.

Add support for multiple IPs per host

This would require patching Ganeti and thus might be a complex operation to support only 3 hosts (lists1001.wikimedia.org,mx[1001,2001].wikimedia.org). It can however be done if no alternative exists. Leveraging BGP here again might be the easiest way to go.

Use fixed MAC address for tap* interfaces

This has been suggested by Cathal “Unsure of whether it's an option but potentially all TAP interfaces could also be forced to the same MAC address? Thus making no ARP update for this on the VM side required (similar to anycast gw idea in evpn).”.
If this works as expected, it would make a live migration even faster and not require the “arping” mentioned earlier in this doc as from a VM point of view its gateway would look exactly the same.

Use iBGP between hypervisor and VMs

This path would allow us to not add an extra BGP hop between the hypervisor and the end VM. Exacts tradeoffs to be investigated.

Possible limitations

Applications or OS handling improperly /32 or /128 interface IPs

Legacy softwares or specialized OS might choke on the seemingly odd NIC IP. IF it happens it will need to be tackled on a case by case basis. As the migration to routed Ganeti will take time, those specific cases could keep running on the “former” Ganeti until a solution is found.

Ressources

https://blog.fhrnet.eu/2020/03/07/dhcp-server-on-a-32-subnet/
https://vincent.bernat.ch/en/blog/2018-l3-routing-hypervisor
https://docs.ganeti.org/docs/ganeti/3.0/html/
https://linux.die.net/man/5/radvd.conf
https://bird.network.cz/?get_doc&v=20&f=bird-6.html
https://www.netfilter.org/projects/nftables/manpage.html
https://docs.ganeti.org/docs/ganeti/2.2/html/design-2.1.html?highlight=routed#non-bridged-instances-support
https://github.com/grnet/snf-network/blob/develop/docs/routed.rst
http://blkperl.github.io/split-brain-ganeti.html
https://blog.cloudflare.com/virtual-networking-101-understanding-tap/

Multi-platform network configuration

2023-07-13T14:31:52+00:00

Network configuration is a quite rapidly evolving area which went through multiple phases. It’s also surprisingly tied to monitoring. Below is some historical context from the industry as well as what we’re doing in SRE.

Some context

The past

As there was no standardized or programmatic way to get data in or out of network devices, engineers had to be creative. The early ages of network automation consisted of scripts pretending to be a human operator.
Those scripts would connect to devices (using ssh or telnet) send commands, expecting some prompts and scrapping whatever was sent back.
You can imagine that this was extremely slow and error prone. Output layout would change from one version to the other, unexpected output would break scripts, CLIs would get overwhelmed by too much information entered at once, data would not get validated beforehand.

SNMP

SNMP is a protocol aimed at improving the this situation. Despite the limitations it’s widely disparaged for, SNMP is still widely used to monitor devices, especially as it’s supported on virtually all devices.

It got popular for monitoring as none of its limitations are hard blockers to simply pull counters and states from a device. Security is not critical as it's read only, if a packet doesn’t arrive, it’s not a big deal, data will show up at the next pull.
It’s another story for the configuration aspect. Security is critical (v3 got implemented quite late), being sure that a change got applied as well. These factors, as well as a difficult syntax to interact with, meant it never got wide adoption.

Get a devices description over SNMP

$ /usr/bin/snmpget -v2c -c  -OUQn -m SNMPv2-MIB  sysDescr.0
.1.3.6.1.2.1.1.1.0 = Juniper Networks, Inc. qfx5100-48s-6q (...)

SNMP, in short:

Transport: UDP (packet size limited and subject to packet loss)
Atomicity: None (each request is independent and can change 1 configuration option)
Message encoding: ASN.1 (quite complex to manually craft)
Data model: SMI (standard or vendor specific)
Mostly pull based (SNMP traps are less used, especially as its UDP)
Encryption:
- v2c: clear text “community” (not secure, most common for “get”)
- v3: authentication and payload encryption (mostly used for “set” if used at all)

The present

Monitoring

We actively use SNMP for monitoring. Mostly through LibreNMS which is fully built around SNMP and provides a great UI which solves one difficult part of monitoring: how to display relevant information. And to a lesser extent with ad-hoc Icinga scripts pulling specific SNMP OIDs (values) for alerting only.
The current system is working fine (and on all our platforms, even PDUs).
Newer protocols have been implemented with features answering modern needs, like more frequent polling (some SNMP implementation discourage pulling data too often), or the ability to get virtually any metric from the devices. In addition to being more reliable.
So if we have to spend engineering time (for example to monitor QoS accurately - T326322), SNMP based tools might not be the best investment.

Regarding configuration, two combining elements played in our favor. First, using exclusively Juniper equipment allowed us to focus our efforts. Second, Juniper was a pioneer in NETCONF based device configuration as well as being “API first”.

NETCONF

NETCONF is in some way the 2nd attempt of the industry for a standardized way of configuring devices. It works in a more layered approach leveraging proven protocols. It is now the industry standard and has been extended in multiple ways.

Transport: SSH (most common), HTTPS (more recent)
Encryption: handled by the transport layer
Message encoding: XML-RPC (most common), JSON-RPC (more recent)
Data model: YANG* or proprietary via an abstraction layer (eg. Junos set)
Supports locks, atomic changes (apply a set of changes or nothing), full configuration changes

(*More on YANG later)

sending a configuration to a device then discarding it


 xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:xxx">
 format="text" action="replace">

My configuration



]]>]]>
 xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns:junos="http://xml.juniper.net/junos/21.2R0/junos" xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:f9ab59d9-746f-423f-b534-67197941f3df">


warning
[edit routing-options validation]
mgd: statement has no contents; ignored

static





 xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:cc664f8c-8815-41f8-82a8-c655bc6eda10"> compare="rollback" rollback="0" format="text"/>]]>]]>
 xmlns="urn:ietf:params:xml:ns:netconf:base:1.0" xmlns:junos="http://xml.juniper.net/junos/21.2R0/junos" xmlns:nc="urn:ietf:params:xml:ns:netconf:base:1.0" message-id="urn:uuid:cc664f8c-8815-41f8-82a8-c655bc6eda10">

Our current network automation (Homer) leverages NETCONF through multiple abstraction layers. It fetches data from a couple sources of truth (Netbox and YAML files). Using Jinja templates it formats this data in the most user readable structure (the standard Junos syntax). Last, Juniper’s py-junos-eznc library wraps the generated configuration and overall instructions in XML to send it “over” NETCONF (using ncclient) to the device. Then the device's configuration engine takes care of showing us diffs of what changed, handling rollback, etc.

We’ve opted to do full configuration replacement instead of specific sections only to prevent drifts between devices configurations and wanted state (eg. manual configuration. A Homer “run” will normalize any changes to match the state from our sources of truth. The downside to this approach is its slowness.

To remediate this slowness, we have implemented a more ad-hoc way of doing scope limited changes. For example for server ports configuration. Using Cumin, Cookbooks can send Netbox based configuration changes as well as get states by issuing Junos CLI commands over SSH. This path is made possible by Junos CLI allowing it to return any command output as JSON or XML.
Those tools allowed us to clean-up/streamline our network configuration, remove toil, iterate faster on provisioning, troubleshooting, fix miss-configurations, and react efficiently to attacks at the cost of duplicating some of our automation tooling.

Nevertheless there is still plenty of room for improvement, in the tools itself (T250415: Homer: add parallelization support, performance improvements) the workflows, the configuration standardization and verification, integration with other tools or platforms (T328747: Improve Homer output when Juniper device rejects config. T253194: Homer CI: verify Junos syntax)

The future

We recently started looking at alternate vendors for our network switches. After thorough evaluation we have decided to start rolling out some SONiC based network devices in our production environment (we could call it phase 2 testing) to make sure larger deployments would be doable.

Even though designed with some level of multi-vendor support in mind, only one vendor was implemented in Homer on day one. Which makes it a mostly Junos tool.
During the evaluation we had to make sure that any alternate vendor would either be automated either in a similar way as we’re currently doing (full configuration replacement + ad-hoc cli commands) or in a way that goes in the same direction as the whole industry (including Juniper). The former being a quick way to get started but is risky in the long run. The latter has a larger upfront engineering cost but is an investment on the future. Despite the drawback of not supporting NETCONF, SONiC falls in the 2nd category.
Extensive documentation on its management framework is available online (being open source probably helps).

Source: https://github.com/sonic-net/SONiC/blob/master/doc/mgmt/images/Mgmt_Frmk_Arch.jpg
On this diagram we can see that there are 2 programmatic ways to interact with a device: REST(CONF) and gNMI. The CLI becomes a regular REST client, the same goes for Junos.

Quick digression on Datastores

Before being formally called datastores, most network devices had the concept of “2 or more configuration’s versions”. We can think of Cisco’s historic startup vs. running configuration or Juniper’s more modern configuration history. In 2018 RFC8342 standardize and extends those 3 base datastores:

Startup: optional, rw, persistent
Candidate: optional, rw, volatile, can be messed with, no production impact
Running: required, rw, persistent

With:

Intended: required, ro, volatile, can be same as running, apply any needed transformations
Operational: ro, all configuration and states data

datastores from RFC8342

+-------------+                 +-----------+
|  |                 |  |
|  (ct, rw)   |<---+       +--->| (ct, rw)  |
+-------------+    |       |    +-----------+
       |           |       |           |
       |         +-----------+         |
       +-------->|  |<--------+
                 | (ct, rw)  |
                 +-----------+
                       |
                       |        // configuration transformations,
                       |        // e.g., removal of nodes marked as
                       |        // "inactive", expansion of
                       |        // templates
                       v
                 +------------+
                 |  | // subject to validation
                 | (ct, ro)   |
                 +------------+
                       |        // changes applied, subject to
                       |        // local factors, e.g., missing
                       |        // resources, delays
                       |
  dynamic              |   +-------- learned configuration
  configuration        |   +-------- system configuration
  datastores -----+    |   +-------- default configuration
                  |    |   |
                  v    v   v
               +---------------+
               |  | <-- system state
               | (ct + cf, ro) |
               +---------------+

ct = config true; cf = config false
rw = read-write; ro = read-only
boxes denote named datastores

It’s up to the upper protocols (RESTCONF, NETCONF, etc) to define ways to expose or copy between those datastores (commit, rollback, etc).

RESTCONF

This more recent protocol RFC 8040 (2017) can be imagined as a NETCONF over HTTPS, leveraging the now popular REST architecture. Due to its younger age, it has less features but is regularly extended. For example RFC 8527 (March 2019) paves the way for config rollback style operations (there is only a non-standard implementation so far) as well as subscriptions (itself in RFC 8650) by bringing the concept of datastores to RESTCONF.

RESTCONF, in short:

Transport: HTTPS
Encryption: handled by the transport layer (TLS)
Message encoding: XML or JSON
Data model: YANG*
Python libraries: many options, starting by the well known Python's Requests

(*More on YANG later)

gNMI

While NETCONF and RESTCONF were getting popular for network configuration, SNMP was still the only real option to get monitoring data from network devices. Despite NETCONF (RFC 5277) and RESTCONF (RFC 8650) being extended to support push/notifications the former never took off, and the latter is still too young to tell.

Streaming telemetry (also called push, subscribe, notification) is a feature aimed at replacing SNMP on the monitoring side. Instead of regularly polling a device like SNMP based monitoring does, a client subscribes to specific items/data on the device. Through a long lived client/server session, the device will send updated metrics when needed. Some examples on why it’s better than SNMP in this NANOG presentation.

gNMI (gRPC Network Management Interface), built by Google on top of gRPC, is getting some traction in the industry for that purpose mostly due to its speed compared to the alternatives. That said, gNMI also supports regular “get” and “set” methods as defined in its specification document (there is no RFC). It also got further extended to support operational commands (restart, clear neighbors, manage certs, etc) through what’s called gNOI (gRPC Network Operations Interface).

gNMI, in short:

Transport: gRPC (HTTP/2)
Encryption: handled by the transport layer (TLS)
Message encoding: JSON, Protobuf
Data model: YANG*
Supports atomic changes (apply a set of changes or nothing, also called transactions)

(*More on YANG later)

gNMI being a Google creation, most libraries and tools revolving around this protocol are written in Go. The client library, the standalone client tools and the swiss army knife (gNMIc). If there is a single tool to learn, it's gNMIc as it can do pretty much everything gNMI related and much more.
The Python ecosystem on the other hand is fairly dire. First any project needs to convert the gNMI protobuf specs into Python libraries or using the pre-made one available from upstream, managing multiple versions of the spec could potentially be a challenge.

pygnmi seems the most interesting as it’s feature rich and actively-ish developed. gNOI is however not supported.
python-gnmi-proto (hasn't been updated since 2021) takes a different approach, different grpc library, less abstraction (let the user handle the gRPC calls directly)

As gNMI is multi-platform, we can also look at various vendor’s PoC, with the risk of them becoming platform specific in the future.

Arista’s gnmi-py doesn’t support the “set” operations
Cisco’s cisco-gnmi-python seems feature rich as well.

Querying a gNMI device using gNMIc

$ gnmic -a lsw1-e8-eqiad.mgmt.eqiad.wmnet:8080 --username admin --password Wikimedia capabilities
gNMI version: 0.7.0
supported models:
[...] # Mix of OpenConfig and SONiC specific YANG models
supported encodings:
  - JSON
  - JSON_IETF
  - PROTO

Querying a gNMI device using pyGNMI

from pygnmi.client import gNMIclient
host = ('lsw1-e8-eqiad.mgmt.eqiad.wmnet', '8080')
with gNMIclient(target=host, username='admin', password='Wikimedia', debug=True) as gc:
 	result = gc.capabilities()
 	print(result)

YANG

YANG is a standardized (RFC 7950) way to structure both configuration and operational (metrics) information, a bit like a DB schema. Similar to SMI it’s possible to define dependencies between modules (like SNMP MIBs), forming a tree. Even though the data model structure is standardized, there are both vendor specific and vendor agnostic modules. Many of those models are available on the YangModels Github. The OpenConfig are the most notable vendor neutral models, but it’s of course up to the vendors to support them.
At least Dell's SONiC aims at supporting OpenConfig. However using a mix of vendor-generic and vedor-specific models is often required to fully manage a given device as OpenConfig only covers common features.

example yang module in tree view (filtered on config elements for NTP only)

+--rw system
   +--rw ntp
      +--rw config
      |  +--rw enabled?              boolean
      |  +--rw ntp-source-address?   oc-inet:ip-address
      |  +--rw enable-ntp-auth?      boolean
      |  +--rw trusted-key*          uint16
      |  +--rw source-interface*     oc-if:base-interface-ref
      |  +--rw network-instance?     -> /oc-ni:network-instances/network-instance/name
      +--rw ntp-keys
      |  +--rw ntp-key* [key-id]
      |     +--rw key-id    -> ../config/key-id
      |     +--rw config
      |     |  +--rw key-id?      uint16
      |     |  +--rw key-type?    identityref
      |     |  +--rw key-value?   string
      |     |  +--rw encrypted?   boolean
      +--rw servers
         +--rw server* [address]
            +--rw address    -> ../config/address
            +--rw config
            |  +--rw address?            oc-inet:host
            |  +--rw port?               oc-inet:port-number
            |  +--rw version?            uint8
            |  +--rw association-type?   enumeration
            |  +--rw iburst?             boolean
            |  +--rw prefer?             boolean
            |  +--rw key-id?             uint16
            |  +--rw minpoll?            uint8
            |  +--rw maxpoll?            uint8

One important aspect is how to efficiently use those data models in our Python based automation world. Manually crafting Python structures to match the expected formats didn't seem appealing to me, even though that's how at least some other (Aerleon, Salt, Ansible) do it.
One direction I investigated is the possibility to create Python bindings from YANG models, which would allow us to manipulate data as Python objects. The hope is for example to abstract type checking, dataset comparison, IDE auto-completion.

Pyangbind, plugin for pyang, ~~abandoned~~ (maybe coming back to life?)
Cisco’s YDK is actively maintained but complex to setup, furthermore it requires the whole SDK to be included in any application that want to use those bindings
For RESTCONF OpenAPI servers (built based on YANG data) it’s possible to use openapi-python-client and in some way reverse-engineer the YANG models… not optimal
Pyang-pydantic, another pyang plugin to generate pydantic models from YANG models.
pydantify (relevant paper), more recent but experimental. Still the most promising option yet too young for my use-cases
- In addition to using it as a configuration builder, it could potentially also be used as a syntax checker or a way of showing differences between two configurations (eg. candidate and running)

Much easier in Go, where tools exist to convert such models directly to Go bindings or Protobuf schemas. But wait! It’s possible to convert gNMI Protobuf schemas to Python objects. A way that I didn't explore at the risk of being too convoluted for our use cases.

The plan

Now that we have an overview of the modern and less modern ways of interacting with network devices, here is the current plan.

NETCONF vs. gNMI

To start with, we need something that works with both Juniper and SONiC to try to hope for "one protocol to rule them all". Thus, NETCONF is out of the game as not supported by SONiC.

RESTCONF's main advantage is its easier handling (as based on HTTP). On the other hand, gNMI is a faster and a “two in one” solution as it handles both monitoring and configuration.

After testing both, gNMI seems the best bet forward to me as its only downside (apparent opacity) is counterbalanced by good tools and libraries (gnmic, grpcio).

It is unlikely that one or the other becomes obsolete anytime soon, and even though as long as the data models don't (YANG), the required modifications to go from one transport to the other would only mean switching tools (easier said than done, but much better than changing data models). gNMI is also supported by all major vendors (Cisco, Arista, etc) so we don't get vendor lock-in by using it.

Authentication

One thing that is sure though, is that both RESTCONF and gNMI require a good PKI infrastructure as they both require TLS. T334594: TLS certificates for network devices is the first building block for our next-gen automation.

Then comes authentication. Until now we have done everything over SSH, both for CLI and NETCONF. Unfortunately RESTCONF and gNMI both require a TLS based authentication mechanism.
While SONiC supports both basic HTTP (regular username/password) and client certificate (username in the certificate's CN field), Junos only supports basic HTTP (certificate is only an additional layer of security, but doesn't authenticate the user itself). Which means that we have to implement a mechanism to define passwords to at least users that will use the API (eg. Homer).
On top of that, SONiC's support of users through the API is still limited (doesn't handle SSH keys, doesn't expose hashed-passwords).
For both of those reasons, T338028: Users management on SONiC is the next critical stepping stone to tackle, multiple paths and proposals are being discussed in the task.

NOTE: The points above are not hard blockers for tests and to start working on the automation itself, but are required for any production deployment.

Automation

Only now comes the important part of the topic, our automation.

Even though gNMI might make Homer fast enough to obsolete some of the network related cookbook the workflow of each tools are distinct enough to not make it a short term goal. Keeping both tools separated will also help making the transition easier. This will of course bring the risk of duplicated code, like we currently have for Juniper.

The low number of options make the choice of Python library easier: pyGNMI does the job well.

Unfortunately the various YANG to Python bindings libraries are not ripe enough for prime time, which means we will have to rely on Python dictionaries structures. Those are not that bad once we're familiar with them, but we should keep an eye on Pydantify (especially the Pydantic 2 upgrade).

Once we have the data-structures, we need to be able to compare them. So far this process was offloaded to Juniper's OS. Send the new data, ask for a diff, commit if fine. This is not strictly needed for simple actions, like configuring a single switch interface, but the more complex the change, the more needed it is to catch mistakes before it's too late. A basic implementation could rely on existing libraries such as dictdiff or deepdiff, but also on the pyGNNMI diff_openconfig feature once some of the bugs have been fixed (see my couple PRs and somes issues) .

Cookbooks

An initial proof of concept approach is available on Gerrit CR924896 it shows that it works fine but a few things are needed to be production grade quality:

T340045: Package pyGNMI and dictdiffer to be used by cookbooks
Migrate the OpenConfig/gNMI cookbook functions to Spicerack (not needed on day 1)
Implement a diff feature as mentioned above

NOTE: At this point we could also look at migrating some of the existing Juniper "read only" cookbook functions to gNMI. Especially if they follow the OpenConfig model.

Homer

The initial scaffolding to support gNMI has been done (see CR927736). Some adjustments are needed but its logic has been validated.
One point not cleared up yet, is how to run it from our own laptops. The current Homer/Junos/NETCONF leverages SSH and thus is able to automatically use our jump-hosts. HTTP's equivalent is Socks5 (see for example T319426: [cookbooks] Add ssh socks5 proxy support) but its support in gRPC is unlikely to happen anytime soon (see https://github.com/grpc/grpc/issues/30347)

Next steps are to iterate to add support for various SONiC configuration elements one after the other. The easiest approach is to manually configure a device, fetch its configuration on the OpenConfig format, then "templatize" it. Which is the approach we took when working on Juniper devices. Starting with the easy bits. Re-using the diff feature.

While this goal progresses, we will benefit from transitioning more data from YAML files to Netbox, thanks to T336275: Upgrade Netbox to 4.x and T305126: Make more extensive use of Netbox custom fields.

The last special bit is ~~Capirca~~ Aerleon. The ACL generation library. Despite claiming to support OpenConfig ACLs, some features are missing and the output format doesn't fit the OpenConfig YANG model. My pull requests to fix that are either merged or being reviewed (#311, #312, #313).

Monitoring

Still an area to explore, and less urgent for us. gNMIc as a Prometheus intermediary is a promising option.

Conclusion

This is not an easy path, but the way forward is relatively clear at this point. More issues will undoubtedly show up as we progress. Achieving this goal will bring 3 key benefits:

Unified (and modern!) protocol for configuration and monitoring
Multi-vendors support
Faster operations

Illustration photo by Benjamin Elliott on Unsplash

Netbox news

2022-06-28T11:27:43+00:00

Netbox is a tool used by all SREs, either directly or abstracted through cookbooks and various scripts. Managed by Infrastructure-Foundations, it went through a major (and much needed!) upgrade this past quarter, led by John Bond, myself and with the help of Riccardo.

For historical context, around the release of Netbox 2.10 the project was forked, creating the new project, Nautobot. Netbox 2.10.4 was the last version which was compatible with both Netbox and the new fork as such we remained on this version until we could evaluate our needs going forward. After discussions we decided to stay on Netbox (see why).

Here is a rundown of the current and future improvements. This work was tracked in T296452: Upgrade Netbox to 3.2

Infrastructure

100% on bullseye servers (it used to be on buster)
Behind our CDNs
Active/passive frontends
Separate internal vs. external endpoints (creation of netbox.discovery.wmnet)
Documentation fully refactored and cleaned up: have a look at Netbox - Wikitech
New “single pane of glass” Grafana dashboard for Netbox health monitoring https://grafana.wikimedia.org/d/DvXT6LCnk/netbox
- With the addition of django monitoring - T243928
- This led to the improvement of the existing Postgres dashboard - https://grafana.wikimedia.org/d/000000469/postgres
- Leverages the new prometheus::blackbox::check::http created by Filippo and John

All the above improvements will contribute to having a rock solid source of truth as we come to rely on it more and more across SRE.

New changes already visible and used in prod

Relooked UI
- Including a dark mode
- Better filtering on pretty much all the pages
Group support for Ganeti clusters - T262446
- Helps model our infrastructure better,
- Ties in John’s work to expose Netbox data in Puppet - see T229397,
- Will allow hosts that rely on row/rack data to not need hardcoded values anymore (eg. kubernetes)
Improved reports (reports are automated functions that alert on data inconsistencies based on our own rules and conventions):
- Network interfaces MTU miss configurations
- Better error logging for reports
End to end path tracing now traverse circuits
- This allows us to see exactly where a given network interface leads to, for example https://netbox.wikimedia.org/dcim/interfaces/21226/trace/
Improved contact management
- It is now possible to clearly document the NOC escalation order, for example https://netbox.wikimedia.org/circuits/providers/64/
AS numbers model
- Help centralize data in our single source of truth (before we had to maintain a dedicated wiki page https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations)
- This will help in future network automation efforts, see T305126#7941476
Improved custom fields support (custom fields is a way to extend the default models, to store WMF specific data, for example procurement task on servers)
- Extended to most of the models (for example add a purchase date to an inventory item)
- Objects can now be used as custom “fields”, (for example, link a row to a Ganeti cluster)

While we’ve been busy with the above, there is much more to evaluate and possibly implement. A complete list is available on https://wikitech.wikimedia.org/wiki/Netbox#Future_improvements many of them are good first bugs if you’re interested in learning more and contributing to our setup.

Here are some highlights:

GraphQL API - T310577
- Some scripts are well known for their slowness (eg. DNS cookbook), GraphQL should speed them up
Basic change rollback - T310589
- If deemed viable, this will help quickly rollback accidental edits and deletions
Use Custom Model Validation - T310590
- Long time requested by DCops, this will help reduce entry mistakes (eg. typoes) before the happen (compared to waiting for a report to trigger)
Make more extensive use of Netbox custom fields - T305126
- This will allow us to move data from yaml files and free from “description” fields to structured Netbox fields
Represent sub-interface and bridge device associations in Netbox - T296832
- This will allow us to document some edge cases of some server’s configuration,
- And modeling our network devices better,
- Both the above helping us to improve our automation
Using a central Redis instance - T311385
- Prerequisite for active/active frontends (faster and more reliable)

We hope that this background work will improve your SRE experience through making this abstraction to our real life infrastructure faster, more complete and more reliable.

As usual, don’t hesitate to reach out to Infrastructure-Foundations, for any issues, requests or suggestions.

NOTE: The following text was sent to ops-private@, saving it here for public archival

Header image from Unsplash

Internal anycast

2020-08-07T09:48:06+00:00

This project brought two major changes to our infrastructure. Firstly, servers that used to be fronted by LVS for load balancing are now peering directly with our routers. Secondly, we have started using IP anycast for a highly critical service: recursive DNS.

Load balancing

At the infrastructure level, load balancing means sending clients requests to more than one backend server. There are many different ways to achieve this each one with their advantages and drawbacks.

Any users accessing Wikimedia’s websites will go through those following two layers.

GeoDNS

A client asks our DNS for the IP of a given service (eg. www.wikipedia.org)
Our authoritative DNS server looks up the client IP using an IP to geolocation database (in our case MaxMind), which in turn gives a rough idea of this IP’s location (country or state)
Finally, our DNS server checks our manually curated list of “location-POP mapping”, and replies with the IP of the nearest POP

As a result, depending on their estimated location, users are balanced to different caching POPs).

See, for example, in San Francisco and Singapore:

$ host www.wikipedia.org
www.wikipedia.org is an alias for dyna.wikimedia.org.
dyna.wikimedia.org has address 198.35.26.96
dyna.wikimedia.org has IPv6 address 2620:0:863:ed1a::1

$ host www.wikipedia.org
www.wikipedia.org is an alias for dyna.wikimedia.org.
dyna.wikimedia.org has address 103.102.166.224
dyna.wikimedia.org has IPv6 address 2001:df2:e500:ed1a::1

LVS

To reach those IPs (eg. 2620:0:863:ed1a::1 or 2001:df2:e500:ed1a::1), users’ requests will cross the Internet and eventually hit our routers and our Linux load-balancer: the Linux Virtual Server.

LVS’ peers with our routers using BGP to advertise (“claim”) those specific IPs and forwards inbound traffic towards them to a pool of backend servers (called “origin servers”). Decisions about which server to forward the traffic to are made based on:

Their administrative state: pooled/de-pooled, which is set in an etcd-backed store. See eqiad/text-https for example
The health of the service, using regular health checks (active probing)
The source and destination IP and port (hashing)

The first two are handled by PyBal, our homemade LVS manager and “battle-tested” tool.

The last one is to ensure that packets from a user’s session are always forwarded to the same backend server. If they were randomly balanced, backend servers would not know what the packets are about, as they don’t share states between each other (very costly). There are thoughts about replacing the scheduler.

Bypassing the LVS

Our routers (in our case Juniper MXs but it’s similar through all the major vendors) support multiple types of load-balancing. The one we’re interested in now is called BGP multipath. To achieve this, every end server maintains a BGP session with the routers and advertises the same load-balanced IP.

BGP default behavior is to only pick one path (one backend server in our case) and keep the other ones as backups. Flipping the multipath knob, the router will start doing what’s called ECMP (Equal Cost Multiple Paths) and similar to LVS, it will decide which server to forward the packets to based on:

The servers advertising the IPs (passive)
The source and destination IP and port (configurable)

This has the obvious advantage of being a more lightweight solution. Getting rid of a middle layer, which means less hardware, less software, and an easier configuration.

On the other hand, there are some limitations:

Service health-check probing is replaced by self-monitoring, a daemon on the end server stops advertising the IP if it detects an issue at the service level. This allows some failure scenarios where the end service is locally healthy but can’t be reached remotely (eg. firewalling issues)
Less control on the hashing algorithm, since it is controlled by a proprietary software (the router’s OS). Not a big deal until we hit bugs
No etcd integration (yet?), thus depooling a backend server is only possible by disabling the BGP session or stopping the service being self-monitored

The goal here is not to get rid of those LVS, but instead find a better load balancing solution for those "small" services, on which LVS may depend on. For example DNS.

Let’s checkout anycast before looking at the end result.

What is anycast?

Anycast is one of the few ways the IP stack can route traffic from a source to a destination. In a good old unicast setup, the destination IP is unique on the network. But, what happens if there are several of them? You can imagine it as a larger scale version of BGP multipath (mentioned previously).

When a router receives a packet destined for an IP for which multiple paths exist, it will go through a list of criteria (known as path selection) in order to decide which next hop is the best. For BGP, the main criteria is the “distance” measured in AS PATH length.

The way our infrastructure is designed, each service is assigned an AS number (eg. 64605 for anycast, 64600 for LVS), the same goes for each site (eg. 65001 for Ashburn, 65004 for San Francisco).

For example, traffic going from a host to a service hosted in the same site will have a distance (AS PATH length) of 1 (SITE->SERVICE). The distance for the same host to the same service in another site would be 2 (SITE->SITE->SERVICE).

In the example below (edited for readability), cr3-ulsfo has 3 options to reach 10.3.0.1—the first one having the shortest distance is the preferred route.

cr3-ulsfo> show route 10.3.0.1 terse
A Destination  Next hop      AS path
* 10.3.0.1/32  198.35.26.7    64605 I
 10.3.0.1/32  198.35.26.197   (65002) 64605 I
 10.3.0.1/32  198.35.26.197   (65001 65002) 64605 I

Reasons for using anycast

If for some reason the prefered path becomes unavailable, it will transparently fail over to the next one within milliseconds, making the service significantly more resilient. One must obviously keep those traffic pattern changes in mind while designing the service. In our infrastructure, edge (caching) sites will fallback to core sites if the local service is down, but not the other way around.

To give a more concrete example, using an internal service:

If an anycast endpoint is in Ashburn, all clients in Ashburn will prefer it. If that endpoint goes down, and we have a similar endpoint in Dallas, Ashburn clients will automatically "reroute" to Dallas.

It is easy to see the reliability improvements of the above solution compared to more traditional ones, like, for instance, when servers had two nameserver entries in their resolv.conf file. Unfortunately, resolv.conf is configured to try the nameservers sequentially and has a default timeout of 5 seconds, with a minimum possible value of one second. This means that an outage can lead to servers being unable to resolve DNS for a number of seconds before they failover to the second nameserver. Some services are more sensitive to these failures than others and we have observed real issues with such outages. More details on task T162818.

In addition to resilience, Anycast does a good job to keep latency at a minimum, as a shorter AS PATH usually means lower latency. While this is true within a controlled network, the Internet is another can of worms, but it is still the best option for services which can't do GeoIP (eg. authoritative DNS servers).

Internally, we don't have to maintain a mapping of which server is the best one for a given POP. In our case, all hosts use 10.3.0.1 as DNS. Set it, and forget it.

Of course it's not all upside, Anycast comes with one major risk, especially for a stateful protocol such as TCP: flapping. External factors (topology changes, incorrect load balancing) can cause packets of a given session to get redirected to a different backend server. As the new server did not take part in the initial TCP handshake, it will have no local state and reject (RST) the connection. Fixing it requires keeping states on the routers or sharing them between backend servers. Both are incredibly costly solutions. Remember, we’re trying to keep that step as lightweight as possible. Thankfully, experience and studies have shown that even on the dynamic network that is the Internet, those situations are uncommon.

Another limitation is monitoring, as a source is not able to target a specific destination host (the network decides), monitoring needs to run from at least as many vantage points as end nodes.

Our implementation

Tracked in: T186550
Documented in: https://wikitech.wikimedia.org/wiki/Anycast#Internal

Firstly, we need a daemon that runs on our Debian servers and talks BGP to our routers. We chose the BIRD Internet Routing Daemon for this, because it is both well tested and supports BFD out of the box.

BFD (Bidirectional Forwarding Detection) is a very fast and lightweight failure detection tool. As BGP's keepalives timers are not designed to be quick (90s by default), we need something to ensure the routers will notice the server going down fast enough, in our case after 3*300ms.

At this point, we could already call it a day. We have the server advertising the Anycast IP via Bird to the router and a failover mechanism if the server fails.

But, what if the server stays healthy while the service itself dies?

To cover that failure scenario we found a useful and lightweight tool on GitHub called anycast_healthchecker.

Every second, the health-checker monitors the health of the anycasted service using a custom script. If any issue is detected, it will instruct Bird to stop advertising the relevant IP.

Covering another failure scenario, the Bird process is linked (at the systemd level) to anycast_healthchecker, so that if the latter dies, the former will stop, BFD will detect a failure, and the router will stop advertising the IP as well.

On the monitoring front we have Icinga checking for the Bird and anycast_healthchecker processes, the router’s BGP sessions as well as the Anycasted IP.

As mentioned previously, this check will only fail if all of the possible Anycast endpoints are down (from a monitoring host point of view), this is why this is a paging alert.

All of the above is deployed via Puppet so only a few lines of Puppet/Hiera configuration is needed, see here. If you're curious about the router side, it's over there.

What's next?

This setup has been working flawlessly for a few months now, and is going to grow progressively.

On the "small" improvements side, or wishlist, we want to be able to monitor the Anycast endpoints from various vantage points, or make it v6 ready.

On the larger side, the next big step is to roll Anycast for our authoritative DNS servers (the ones answering for all the *.wikipedia.org hostnames). The outline of the plan can be seen on the tracking task.

Our goal is to do externally what we have been doing internally. Each datacenter will advertise to their transit and peering neighbors the same IPs. The internet will take care of routing users to the optimal site. Add some safety mechanisms (eg. automatic IPs withdrawals), proper monitoring and voila!

Photo by Clint Adair on Unsplash

RPKI Origin Validation

2020-08-10T13:02:48+00:00

The problem

BGP binds the internet together by standardizing a way for all networks to tell their neighbors “If you want to reach IP X, send the packets to network Y”. BGP is great for its resiliency and scalability, but less so for its security.
How can we know which network (Autonomous System) is the legitimate owner of an IP? Without that information, IPs can easily get hijacked, either accidentally or maliciously.

Since the late 90s, databases named Internet Routing Registries (IRR) have been trying to fulfill that (single) source of truth role. Unfortunately, they are subject to a lot of issues: fragmentation (many existing databases, not all equally well-maintained), security (some databases allow anyone to “claim” a prefix) and complexity (for the network operators). They also contain a lot of inaccurate data that have accumulated over time.

The first question that comes to mind is “why not fix what already exists instead of re-inventing the wheel?” Some efforts are being made on that, especially since IRR have a broader scope than just associating IPs to operators. Reciprocally, the Resource Public Key Infrastructure's (RPKI) scope is focused on enforcing IP/AS ownership, not replacing IRRs.
Second question is, how to make sure RPKI data doesn’t become similarly inaccurate? I believe that IRRs became stale/outdated because only a few providers were rejecting prefixes based on this information. Hopefully the documentation, simplicity and existing tooling for RPKI will democratize its adoption and make inaccuracies easy to spot and quick to remedy.

The solution

From an operator perspective, RPKI works with 2 interdependent parts:

Signing: tell the world which prefixes have been delegated to one's AS
Validation: prevent one's network from routing traffic to hijacked networks

Signing

Just a brief summary as there are a lot of resources available online.

Some points to highlight though:

It’s the one step to make your prefixes safer.
It’s very easy to do through RIR’s online signing tools
- although these tools are of varying quality
One should setup monitoring for their ROAs (especially expiration). Some RIRs offer that option (e.g. RIPE).
One should not create a ROA with prefixes more specific than your routing policy (e.g. if you only ever advertises a /22 don't add the more specific /24 to your ROA)

Validation

How does it work?

Everything starts with a validator, also called RPKI Relying Party software. Many open source implementations exist, in various languages with relatively similar features: RPKI Validator, OctoRPKI, Routinator, RPKI Toolkit, FORT Validator, rpki-client.
RPKI works as a chain of trust, and the 1st level of that chain are the RIRs. To know how to reach that 1st level (the Trust Anchors), the validator needs a file called a Trust Anchor Locator (TAL), which is a pointer to each RIR’s RPKI repository or any repository you trust, as well as their public key.
TALs are present on each RIR’s website and validators include most of them. ARIN’s is an exception as users are required to agree to the ARIN Relying Party Agreement.
The RPKI repository contains either ROAs (Route Origin Authorization) or pointers to other repositories, themselves containing ROAs trusted by the upstream repository.

ROAs are signed database items saying “Prefix X is allowed to be advertised by AS Z”. Those items also have a creation and expiration date. There is no limitation on how many prefixes can be advertised by an AS, or the other way around.

Validators will fetch the ROAs from all the available repositories using rsync or RRDP (over HTTPS).

Validators will then verify the ROAs (i.e. make sure the format and the signature are correct). Everything that can’t be verified at this level will be ignored. For example, if a ROA expired, it will be ignored (as if it were not in the repository). This prevents any risk of a prefix becoming unreachable if its owner forgets to “renew” its ROA.
The validator is decoupled from the router for performance reasons. Routers usually have high routing performances, but very little resources for any other tasks.

Now that we have a curated and verified list of prefixes/ASNs pairs, we have to communicate it to the router. For that the Validator uses the RTR (RPKI-To-Router) protocol. Most of the time this is embedded in the validator, but standalone applications like GoRTR also exist.
The RFC recommends encrypted transport such as SSH and TLS however they do not insist on encryption. This risk is mitigated by ensuring that "If unprotected TCP is the transport, the cache and routers MUST be on the same trusted and controlled network".
Like everything, it is also recommended to run more than one validator, to ensure no interruption in prefix validation. Note that here as well, the risk of unreachable prefixes is prevented by a timeout period (for example 1 hour on Junos) where if the router is unable to reach the validator it will begin ignoring validation.

The router will check every prefix learned (asynchronously, thus not impacting the BGP convergence time) against its internal RPKI database, which is periodically synced with the validator. Each BGP prefix will then have one of the 4 labels:

Valid, the BGP prefix is originating from an AS and the proper matching ROA is on file
Invalid, the BGP prefix is covered by a ROA, but the origin AS is not in any ROA
Unknown, the BGP prefix doesn’t have any matching ROA (most of the Internet so far)
Unverified, the router didn't check that prefix against its database yet

The most important and useful one here is the Invalid state, which indicates a misconfiguration at best, a malicious hijack at worse.

The last step is for the router to do something useful with this new information.
For the Invalid prefixes received from your peers:

Lower the local preference. This can be useful as a proof of concept, but only protects against a very narrow situation: where a prefix of similar size but different origin AS is learned from multiple peers. This could potentially save the day after a misconfiguration in the DFZ (Default-Free Zone), but would not protect from a malicious actor advertising a more specific prefix.
Discard. More difficult decision, especially as long as RPKI unreachable prefixes exist, which are IPs on the Internet without any larger/smaller prefix than the Invalid ones. This will cause them to become “invisible” to your network. Operators have to consider if this risk is worth the extra security? To help make that call, Pmacct can now show how much traffic, if any, is being exchanged between your network and those IPs.

A good first move is to discard invalids on your IXP facing links, for two reasons:

Eliminates the risk of unreachable prefixes, as traffic will reroute through transit links. Worse case scenario is now sub-optimal routing. It’s also easier to reach out to peers to ask them to fix their ROAs.
Prefixes learned from IXPs usually have a very short AS path. A rogue prefix originating from there would most likely be preferred over one learned through a transit.

For the Invalid prefixes advertised to your downstreams:

Add a BGP community to the valid and invalids prefixes you’re forwarding down. This allows customers to make their own routing decisions, without having to deploy a full RPKI infra. This means your customers are placing their trust in you as such you must not blindly forward communities you have learned from untrusted peers.
Discard. This can be done progressively, customer after customer before a global discard.

Tadam, the Internet is a bit more secure, thank you!
(Check your NOC mailbox just in case.)

Monitoring

Before you start dropping prefixes, better make sure everything is healthy.
Additionally, people who will respond to alerts and “I can’t reach X” emails should be trained on how to react.

Some validators (such as OctoRPKI or Routinator) provide Prometheus endpoints exposing various metrics.
For the router side, a draft RFC for a RTR MIB exists, but I’m not aware of any implementations. Syslog is more or less an option as well. Tools like junos-pyez or NAPALM with some custom parsing seems to be the most complete option so far.

What doesn’t it protect against?

Your own prefixes

First of all, doing validation doesn’t protect your own prefixes, as it only impacts outbound traffic. The two things you can do for that is sign your prefixes and advocating for more people to deploy RPKI.

Transit (or any intermediary AS) not doing validation

In the above diagram, even if Wikimedia discards the malicious /24, it would send traffic for 192.0.2.1 to its transit provider (as it’s the best next-hop for the /23). The transit not doing any validation would naturally forward that traffic to the malicious AS as it is advertising a more specific.
How to mitigate it?

Peer with as many networks as possible (the shorter the AS path, the better)
Peer with networks doing RPKI validation (maybe we should keep a hall of fame somewhere?)
Advertise only /24s (and /48s v6)

Disaggregating has the adverse effect of increasing the size of the global routing table, which in many cases is frowned upon by the operators community. Decision to do so needs not to be taken lightly.

AS forgery

RPKI only ensure that the prefix is being advertised by the proper AS#. A malicious network could either change the AS# from the prefixes it’s advertising (to pretend to be the valid source AS) or fake a longer AS PATH (pretending to be transit for the target network). Very unlikely to be the result of a misconfiguration.
How to mitigate it?

Peer with as many networks as possible (the shorter the AS path, the better)
Advertise only /24s (and /48s v6), with the same warning as above
Monitor prefixes AS PATHS and contact the rogue network’s upstream

Man in the middle

Slightly related to the point above, RPKI is vulnerable to all kinds of MitM attacks as it only validates the source AS, not the whole path.
Take the example above. A malicious network could advertise the same prefix while still maintaining an AS path going to the Legit AS (and forwarding the traffic). This is more sneaky and complex than AS forgery, as the target network still receives traffic.
How to mitigate it?

Peer with as many networks as possible (the shorter the AS path, the better)

More specific ROA

A variation of the previous point. If you allow your AS to advertize a more specific prefix, but don’t actually advertise it, a malicious AS doesn’t even need to have a shorter AS path to MitM-it.
How to mitigate it?

Ensure the ROA strictly matches what you’re advertising in BGP

Some efforts (such as BGPsec) are being made to perform full path validation, but nothing is production ready yet.

TL;DR;

Deploy more than one validator
Keep them on the same trusted network as your routers (or use encryption)
Monitor validators and routers
Write documentation and train your staff
Check if any traffic would be null-routed eg. pmacct
Peer with many networks (short AS Paths)
Discard Invalids starting with IXPs
Share your experience

Where are we now?

From NIST we see Invalids represent 0.23% of equivalent /24s. With RPKI unreachable being an even smaller subset.

As seen on the Internet

AMS-IX route servers rejects by default prefixes with invalid origins.
Telia rejects invalids
AT&T rejects invalids
Netnod rejects invalids
SEACOM and Workonline reject invalids (but don’t use ARIN’s TAL)
Google is planning to reject invalids for its peerings links
NTT rejects invalids
A more comprehensive list of networks rejecting invalids (methodology not detailed)
And the now famous https://isbgpsafeyet.com/

Wikimedia

All the work done at the Foundation is public by default, RPKI is no exception. You can find the main tracking task on our Phabricator instance, all related code changes on Gerrit the doc obviously on a wiki, and graphs on Grafana.

Back in April, after looking at all the available validators, we decided to use Routinator for two main reasons: its RTR daemon is embedded into the validator (no need to maintain several tools) and its development was active with an explicit roadmap.
In parallel to the implementation side, we wrote a Prometheus exporter comparing our real time webrequests to a list of invalid prefixes, giving us the percentage of requests coming from unreachable prefixes. This was then exposed in real-time on our Grafana dashboard and used to hoover around 0.01%.

Last July, we started to reject invalids on IXP links, where our AS-paths are the shortest. In addition to having an open peering policy, this contributed to making more than half of our traffic “safer”.

On January 15th, we flipped the switch to discard invalid prefixes on transit links as well. One of our concerns is legit providers suddenly been unable to reach Wikipedia after a misconfiguration. Not our fault, but users would still be widely impacted.
On the other hand, our hope is that having such a popular website enforcing RPKI adds a considerable amount of trust in the system, accelerating the ongoing adoption of RPKI validation.

Aftermatch

As of the time of publishing this article, we have been contacted by 13 providers reporting Wikipedia being unreachable for them or one of their customers. Quickly fixed after explaining them what was the issue.

Edit: August 11, added mentions of FORT Validator and rpki-client.

Header: https://commons.wikimedia.org/wiki/File:DublinAirport31mar2007-03.jpg

Routing knowledge

Ganeti on modern network design

Context

Per rack clusters

L2 abstraction at the ToR

Routed Ganeti

Setup and investigation

Guest VM IPv4 connectivity

Guest VM DHCP

Guest VM IPv6 connectivity

/128

/64

Hypervisor firewalling

Guest VM routes redistribution

Guest VMs BGP

v4 and v6 prefixes allocations

L2 to L3 cluster migration

L2 abstraction at the hypervisor

Conclusion

Next steps

Other considerations

Not going that way

Use Linux VRF to separate hypervisor from VM traffic

Mixed clusters

Possible future work

Dynamic Ganeti cluster VIP

Apply the same mechanism to bare-metal servers

Add support for multiple IPs per host

Use fixed MAC address for tap* interfaces

Use iBGP between hypervisor and VMs

Possible limitations

Applications or OS handling improperly /32 or /128 interface IPs

Ressources

Multi-platform network configuration

Some context

The past

SNMP

The present

Monitoring

NETCONF

The future

Quick digression on Datastores

RESTCONF

gNMI

YANG

The plan

NETCONF vs. gNMI

Authentication

Automation

Cookbooks

Homer

Monitoring

Conclusion

Netbox news

Infrastructure

New changes already visible and used in prod

Internal anycast

Load balancing

GeoDNS

LVS

Bypassing the LVS

What is anycast?

Reasons for using anycast

Our implementation

What's next?

RPKI Origin Validation

The problem

The solution

Signing

Validation

How does it work?

Monitoring

What doesn’t it protect against?

Your own prefixes

Transit (or any intermediary AS) not doing validation

AS forgery

Man in the middle

More specific ROA

TL;DR;

Where are we now?

`L2` abstraction at the ToR

`L2` to `L3` cluster migration

`L2` abstraction at the hypervisor