Been a while since my last update, been quite busy! but I thought I’d do a post on something BGP related, as everyone loves BGP!
There’s an interesting addition to BGP route-reflection that’s found it’s way into a few trains of code on Juniper and Cisco, (I assume it’s on others too) that attempts to solve one of the annoying issues that occurs when centralised route-reflectors are used.
It all boils down to the basics of path selection, in networks where the setup is relatively simple and identical routes are received, at different edge routers within the network – similar to anycast routing.
Consider the below lab topology;
The core of the network is super simple, basic ISIS, basic LDP/MPLS with ASR-9kv as an out-of-path route-reflector, with iBGP adjacencies configured along the green arrows, the red arrows signify the eBGP sessions, between AS 100-200 and AS 100-300, where IOSv-7 and IOSv-8 advertise an identical 184.108.40.206/32 route. IOSv-3 and IOSv-4 are just P routers running ISIS/LDP only, for the sake of adding a few hops.
With everything configured as defaults, lets look at the path selection;
If we take a brief look at the situation, specifically IOSv-2 and IOSv-5 it’s pretty easy to see what’s happening, the network has basically converged to prefer the path via AS-300 to get to 220.127.116.11/32
For many networks, this sort of thing isn’t a problem – there’s a functional, working path to 18.104.22.168/32, if the edge router connected to AS-300 fails, the path through AS-200 via IOSv-1 will be used to get to the same prefix — everybody is happy because we can ping stuff.
The problem though, is that even a layman with no knowledge of networks or routing would look at this situation and think ‘that seems a bit rubbish’ especially considering that the basic cost of each of those routers (in a large scale environment) might cost as much as $1million – it seems a bit lame how they can’t make better use of paths.
Surely there has to be a simple way to make better use of paths? First – lets look at why the network has converged in such a way, starting with the route-reflector (ASR-9kv)
So it’s pretty easy to see the reason why the path through AS-300 has been selected, with two competing routes, the BGP path selection process works through each of the attributes of the routes before it finds a difference, to select a winner;
1; Weight (no weight configured anywhere on the network other than defaults)
2: Local-preference (both routes have the default of 100)
3: Prefer locally originated routes (both identical, neither are locally originated – they are received from the RR)
4: Prefer shortest AS-Path (both paths lengths are identical)
5: Prefer lowest origin code (both routes have the same origin of IGP)
6: Prefer the lowest MED (both MED values are unconfigured and 0)
7: Prefer eBGP paths over iBGP (IOSv-2 and IOSv-5 receive both paths as iBGP from the RR)
8: Prefer the path with the lowest IGP metric (Bingo! the path via IOSv-6 on AS-300 has a IGP next-hop metric of 21, vs the path via IOSv-1 with it’s IGP next-hop metric of 22)
The problem here, is that once the route-reflector has made this decision – other alternate paths can’t be used in any way at all, because as everyone knows – BGP only normally advertises best-paths, so any other routes received by the route-reflector go no further and aren’t advertised to the network.
In the case of this lab, the only reason this has happened is because one edge router is only slightly closer than another to the route-reflector, so the route-reflector has gone ahead and made the decision for everyone, despite the obvious fact that from a packet forwarding and latency perspective – IOSv-2 has a suboptimal path, it would be much better if IOSv-2 went via IOSv-1 rather than all the way through IOSv-6 to get to 22.214.171.124/32
The diagram with the ISIS metrics imposed shows the simplicity of the problem;
If we had 1000 edge routers on this network, every single one of them would select the path through IOSv-6 in AS-300 – where IOSv-1 wouldn’t receive a single packet of egress traffic., apart from anything it sends locally, (because eBGP routes are preferred over iBGP)
The problem with IGPs in service-provider networks, is that they’re difficult to tweak at the best of times, even if we made them the same – the RR would still only advertise a single route, based on the next decision in the BGP path selection process (oldest path followed by RID) <yes we know add-paths exists, but that’s not without issues 🙂 >
If we start to manipulate the metrics, that normally has the undesirable result of moving lots of traffic from one link to another – which makes management and planning difficult.
My personal approach would normally be to try and stick to good design, in order to prevent this sort of behaviour. An obvious and simple method and one that’s normally employed in larger ISPs is to have route-reflectors that are pop based in a hierarchy, that is route-reflector clients are always served by a route-reflector that’s closest to them – that way the IGP next-hop costs will always be lower, than relying on a centralised route-reflector that’s buried in the middle of the network, somewhere behind 20 P routers.
For example in the below change to the design, IOSv-1 and IOSv-6 each have their own local route-reflector (RR1 and RR2), in this case each RR is metrically closer to the edge-router it serves, meaning that if the BGP tiebreaker happens to fall on the IGP next-hop cost, the closest value will always be chosen.
The problem with the above design, is that whilst it’s simpler from a protocols perspective – it ends up being much more expensive and eventually more complex in the long run. If I have 500x POPs that’s a lot of route-reflectors and a more complex hierarchy, along with longer convergence times – but then again with 500x POPs, I’d also have many other issues to contend with.
In smaller networks with perhaps a pair of centralised route-reflectors, we can use BGP-ORR (optimal route reflection) to employ some of the information held inside the IGP LSA database to assist BGP in making a better routing decision.
This is possible because as we all know – with link-state IGPs such as ISIS or OSPF, they each hold a full live state of all links and all paths in the network, so it makes sense to hook into this information, rather than having BGP act in isolation and compute a suboptimal path.
More information on the draft is given below;
So – I’ll go ahead with the existing topology and configure BGP-ORR on the route-reflector only, and we’ll look at how the routing has changed;
A reminder of the topology;
A quick look at the BGP configuration on ARS9kv;
Before we go over the configuration, lets look at the results on IOSv-1 and IOSv-5 (recall from a few pages up, that previously both routers had picked the route via IOSv-6 (AS-300)
Notice how IOSv-2 and IOSv-5 have each selected their closest peering router (IOSv-1 and IOSv-6) respectively, to get to 126.96.36.199/32, instead of everything going via IOSv-6, as illustrated below;
For ye of little faith – a traceroute confirms the newer optimised best path from IOSv-2 and IOSv-5 – both routers choose their closest exit (1 hop away)
So with the configuration applied, how does BGP-ORR actually work?
It all boils down to perspective, that is – rather than the route-reflector making a decision based purely on it’s own information such as it’s own IGP cost to the next-hop. Using BGP-ORR the route-reflector can ‘hook’ into the LSA database and check the IGP next-hop cost from the perspective of the RR client, rather than the RR itself.
This is possible with IGPs because IGPs generally contain a full database of link states that are distributed to all devices running the IGP, which ultimately means we can put the route-reflector anywhere in the network using BGP-ORR. Because we can ‘hack’ the protocol to make a calculation from the perspective of wherever we choose, rather than the current location.
The below diagram illustrates it as simply as possible in the current topology for IOSv-2 only;
In the above diagram, ASR-9kv decides the best path using IOSv-2’s cost to IOSv-1, by looking at the ISIS database in the same way that IOSv-2 looks at it, or from the perspective of IOSv-2.
If we look at the ISIS routes on IOSv-2, followed by the BGP-ORR policy on the route-reflector, we can see that the route-reflector uses the very same costs.
Essentially, the ISIS costs are copied and pasted from the IGP database into the BGP-ORR database, so that the route-reflector can use this information in it’s path selection process.
Lets have a quick review of the route-reflector config;
The first portion of the configuration, we specify the root device that we want to use to compute the IGP cost, in the case of which in the case of IOSv-2 is R2 – 192.168.0.2 in the config. We can also specify secondary and tertiary devices.
Under the RR neighbour configuration, we apply ‘optimal-route-reflection’ along with the policy we configured previously, in the case of IOSv-2 for neighbour 192.168.0.2
Lastly, under ISIS we need to configure ‘distribute BGP-LS‘ this essentially tells ISIS to distribute it’s information into BGP, more information on BGP-LS (BGP Link-State) see my previous blog post on Segment-routing and ODL
As a conclusion, I think BGP-ORR is a useful addition to the protocol – I’ve certainly worked on networks where it would make sense to implement, unfortunately it only seems to exist in a few trains of code on certain devices. In this lab example – I was using Cisco VIRL spun up under Vagrant on packet.net, where the XR9kv router is the only one that supports BGP-ORR, but I have seen it recently on JUNOS.
As for potential downsides of BGP-ORR, in larger networks it could become quite complicated to design, where you have lots of different routers that all need to be balanced correctly, and in larger networks having centralised route-reflectors can be a big downside and distributed RR designs may work better.
I’d also be interested to see how well BGP-ORR converges in networks with larger LSA databases and BGP tables.
Bye for now! 🙂