BGP Flowspec redirect with ExaBGP

I’ve been busy as hell since the summer, not had much time to work on blog posts – but it’s all been good work! I also got a new job working for Riot Games, (Makers of the worlds largest online multiplayer game – league of legends) which has been totally fantastic.

This post is about BGP Flowspec, specifically how we can now more easily redirect traffic to a scrubbing appliance, it’s common for a device such as an Arbor TMS, or some other type of filtering box, to be installed close to the network edge, it could be a linux box full of filters, a DPI box, anything that might be useful in terms of performing traffic verification or enforcement.

In the event that a DDOS event occurs, it’s possible to redirect suspect traffic, or traffic to a specific victim host, through an appliance where it can be dropped or permitted.

Traditionally this has been done with Layer-3 VPNS, where ingress traffic from the internet is punted into a “Dirty VRF” it’s then forced through a mitigation appliance where it’s either dropped, or permitted – where it returns back into the same router but in a new “Clean VRF”

It looks something like this;

vrfs

  • DDOS Traffic from Lizardsquad ingresses through the edge router aimed at the victim 1.2.3.4/32
  • A BGP host route of 1.2.3.4/32 is injected into the edge router via the Dirty VRF with a next-hop of the mitigation appliance
  • RIB Groups or route leaking is used to punt traffic aimed at 1.2.3.4 from GRT into the Dirty VRF
  • Suspect traffic is either dropped or forwarded back into the same edge router via the “Clean VRF”
  • It flows towards the destination, where it’s then leaked back into GRT ahead of reaching the final destination

There are many different permutations of this design, but the main flaw with it is the reliance on having to provision a clean VRF everywhere, along with route-leaking.

The only real reason for the existence of the Clean VRF in this scenario, is to prevent forming a routing-loop on the edge router – if traffic is returned back into GRT, or back through the Dirty VRF – it’ll encounter the 1.2.3.4/32 mitigation route, and be looped back into the mitigation appliance for infinity;

loop

It’s always seemed a bit of a waste to me, to have to provision VPNs everywhere simply because we can’t put clean traffic into the same router, without causing a huge routing loop, because unless we resort to horrific things like policy-routing – we’re stuck with regular forwarding logic.

The only other alternative to this is to route clean traffic back into a physically different router that doesn’t maintain a mitigation route, that way the returned clean traffic can just follow the regular path to the destination;

lizard

Obviously, the above scenario only really applies if you absolutely can’t run L3VPNs in your network, but still need DDOS mitigation.

Thankfully now that BGP Flowspec is now widely available, we can simplify everything and have a much more streamlined design.

Flowspec is standardised under RFC 5575 https://tools.ietf.org/html/rfc5575

The main principle of Flowspec is actually based on policy routing, in that we can apply a match criteria to ingress traffic, packets that match any of the specific criteria can then be subject to specific actions – in pretty much exactly the same way as with policy-routing, with one major difference – we can program it through BGP rather than through the CLI.

There are some obvious benefits to using BGP for this task;

  • Most networks and their operators already run and understand BGP – to turn on an additional AFI/SAFI to support flow routes, is pretty easy
  • It’s far easier to automate – programatic networking using APIs to inject routes is far less laborious, than having to do things like policy-routing, or creating gigantic clunky scripts.
  • Flowspec supports new extended “redirect” communities that allow a router to automatically forward traffic directly to a different IP next-hop, or directly to a VRF without requiring much configuration
  • There are a number of open source BGP daemons that are fully programmable and support Flowspec – such as ExaBGP and GoBGP, they’re also free!

Like with policy-routing and indeed QoS there’s a whole host of specific criteria that we can use to match packets;

  • Type 1 – IPv4 / IPv6 Destination prefix
  • Type 2 – IPv4 / IPv6 Source Prefix
  • Type 3 – IP protocol
  • Type 4 – Source / Destination Port
  • Type 5 – Destination Port
  • Type 6 – Source Port
  • Type 7 – ICMP Type
  • Type 8 – ICMP Code
  • Type 9 – TCP Flags
  • Type 10 – Packet Length
  • Type 11 – DSCP
  • Type 12 – Fragment encoding

Likewise, once we’ve matched our packet – there’s a number of highly useful things that we can do to it using new BGP extended communities;

  • Type 0x8006 – Drop, or police; (traffic-rate 0 , or traffic-rate <rate> )
  • Type 0x8007 – Traffic action; (apply sampling)
  • Type 0x8008 – Redirect to VRF; (punt traffic into a VRF based on route-target)
  • Type 0x8009 – Traffic marking; (Set a DSCP value)
  • Type 0x0800 – Redirect to IP NH; (creates a policy that forces traffic towards the specified next-hop) currently supported on Cisco and Nokia – but not Juniper 😦

The beauty of Flowspec, is that all of this is done directly in hardware – if you’re using modern silicon there’s practically no hit to performance, even with small packets – you should be able to run Flowspec rules at line rate – but as always, make sure you read the docs, test it AND speak to your vendor, 😉

Furthermore, the configuration is really quite basic – once you’ve enabled the BGP Flowspec AFI/SAFI and have a working session, simply go ahead and inject your mitigation routes – most of the work and config takes place on the controller.

Lets take a quick look at the lab topology;

Cisco / Juniper mix, assume basic ISIS/MPLS internal connectivity with iBGP between loopbacks, all other basic settings at default.

lab2

Lets take a look at the basic ExaBGP config;

(ExaBGP can be installed from git; https://github.com/Exa-Networks/exabgp)

Untitled

Pretty simple stuff;

  • Lines 1 through 6 take care of basic BGP neighbour establishment
  • Line 7 specifies the AFI/SAFI as “Flow” enabling Flowspec
  • Lines 9 and 10 signify the match criteria;
    • Match anything from 30.30.30.100/32
  • Lines 12 and 13 attach an extended community to the Flow route advertisement
    • redirect 666:666 corresponds to the “Dirty VRF” route-target on the edge router

Now lets take a look at the relevant config snippets on “edge1” the router which will receive and install the Flow route, from ExaBGP;

flowspeccfg

  • Lines 1 through 11 relate to standard iBGP internal peering
  • Lines 11 through 18 relate to the upstream eBGP peering to the upstream router in AS1
  • lines 20 through 26 run an internal iBGP session between the edge1 router and the ExaBGP controller;
    • Family inet for “flow” is enabled
    • It references a policy-statement called FSPEC which contains the extended community and the VRF we wish to use for redirecting to the “Dirty VRF”
    • The “No validate” command disables the route-validation procedure if the packets match a specific policy
    • The community “ON-RAMP” is used to match the Flow route coming from ExaBGP, tying it to the policy

The “Dirty VRF” configuration is shown below, essentially it’s just a VRF with the same route-target that ExaBGP is advertising flow routes for, a default route just punts all traffic directly into the mitigation appliance;

routing-instance

There’s also one really important behaviour that should be understood when employing Flowspec, as it differs between vendors.

Remember nearer the start of the post, where I talked about the problem relating to routing-loops, where traffic matching a mitigation route will loop for infinity if it’s routed back into the same device.

This occurs on a Juniper router, because when you enable Flowspec – implicitly applies that flowspec filter to every single interface on the router, so if your packet re-enters the router on any interface, it’ll match the flow-route every time and have the same action applied to it.

To get around this problem, Juniper added the ability to exclude interfaces from Flowspec processing, the config looks like this;

exclude

In the above snippet, we’re excluding interface ge-0/0/1 from any form of further Flowspec processing or filtering, this allows return traffic to flow naturally southbound towards Edge2 inside GRT

Note; this is not needed on some other platforms such as a Nokia 7750 – where Flowspec is embedded inside a packet filter, and so Flowspec is only ever applied to whichever interfaces the packet filter is applied to, rather than to every single interface on the router – as is the case with Juniper. Always read the documentation – especially Nokia as they have a tendency to completely change things from one release to the next 😀 

Lets see it in action;

I’m using the Ostinato traffic generator inside Eve-NG to send a small amount of traffic from the external generator in AS1 behind the Cisco router “peering” from the IP address 30.30.30.100, to the internal endpoint in AS65001, behind the Cisco router “Edge2, targeted at 192.168.100.2

Traffic flows normally from north to south;

Ostinato1

If we look at the mitigation interface (ge-0/0/0) on Edge1, we can see that nothing is being punted to the mitigation device, traffic is just flowing normally, out of the southbound ge-0/0/2 interface towards Edge2;

normal

So lets go ahead and turn on the Flowspec advertisement, firstly by switching on the ExaBGP process and advertising the Flow route to Edge1;

Exaon

So we can see some relevant information, such as the connection parameters and the successful connection to Edge1, lets look on Edge1 to see what’s being received;

route

The flow route received by ExaBGP contains some interesting information;

  • Line 15 shows a regex against the prefix of *,30.30.30.100 – this means anything “from” 30.30.30.100, compared to normal destination based routing, if you remember from the ExaBGP config – we’re matching the source of the traffic for inspection.
  • Line 26 specifies the Announcement bits as 0-Flow
  • Line 28 includes the special Flowspec “redirect community” of 666:666

Flowspec in Juniper uses the firewall filter architecture, it doesn’t add any configuration to the device, instead it uses the BGP advertisement to automatically construct a firewall filter, from the flow route advertisement;

firewallfilter

We can see that the firewall filter has been added, it’s matching packets so hopefully those packets should be flowing out of the “Dirty-VRF” towards the mitigation appliance, (remember before, they were flowing straight down from north to south)

mitigationon

We can see that the traffic rate on ge-0/0/0 has gone up to 98pps, meaning we’re sending traffic towards the mitigation appliance. That very same traffic returns clean on ge-0/0/1.

In the case of the lab, the mitigation appliance is just a Cisco CSR with a default route pointing back at the ge-0/0/1 interface on the Juniper, but whether it’s an Arbor TMS or a Linux box full of filters, the principle remains the same.

In many cases, vendor supported DDOS Mitigation appliances such as Arbor SP/TMS have built in support for Flowspec, so you can trigger mitigation flow routes automatically if certain things get detected.

Previously such appliances had no other way of redirecting traffic without advertising hundreds, or thousands of /32 victim host routes, in order to break regular best-path routing, Thanks to Flowspec – we can now specify traffic sources, ports, protocols, you name it.

It’s also pretty easy to rate limit, if we look at the ExaBGP config, we can use the “rate-limit” extended community, to create a packet policer directly in the forwarding plane, all built from a BGP advertisement;

1000bps

In the above config, I’ve simply removed the redirect community and replaced it with “rate-limit” this instead encodes the rate-limit action into the flow route advertisement, in this case 1000Bps

If we go back to the router and see what’s happening and look at the Flowspec filter;

newrate

We can see the Flowspec BGP flow route being received, with the “traffic-rate:0:1000” community being received.

We can also see that the firewall filter now has two entries, one for matching the source and a second for rate-limiting traffic that exceeds the configured speed, but there’s a mismatch – can you see it?

If you look closely at the Firewall filter – it’s converted the rate to “8K” rather than the ExaBGP configured value of 1k.

The reason for this, is that there appears to be a mismatch between the RFC and the Juniper implementation, RFC 5575 specifies that the rate should be specified in Bytes per second, however Juniper convert that value to bps (bits per second) inside their firewall filter;

From the RFC 5575;

The remaining 4 octets carry the rate information in IEEE floating point [IEEE.754.1985] format, units being bytes per second. A traffic-rate of 0 should result on all traffic for the particular flow to be discarded.

The fact that Juniper convert the Value to bps isn’t a problem, it’s just something to be aware of and explains the differences in the show commands.

Hope you found this useful 🙂

BGP Optimal-route-reflection (BGP-ORR)

Been a while since my last update, been quite busy! but I thought I’d do a post on something BGP related, as everyone loves BGP!

There’s an interesting addition to BGP route-reflection that’s found it’s way into a few trains of code on Juniper and Cisco, (I assume it’s on others too) that attempts to solve one of the annoying issues that occurs when centralised route-reflectors are used.

It all boils down to the basics of path selection, in networks where the setup is relatively simple and identical routes are received, at different edge routers within the network – similar to anycast routing.

Consider the below lab topology;

Screen Shot 2017-06-01 at 22.25.36

The core of the network is super simple, basic ISIS, basic LDP/MPLS with ASR-9kv as an out-of-path route-reflector, with iBGP adjacencies configured along the green arrows, the red arrows signify the eBGP sessions, between AS 100-200 and AS 100-300, where IOSv-7 and IOSv-8 advertise an identical 6.6.6.6/32 route. IOSv-3 and IOSv-4 are just P routers running ISIS/LDP only, for the sake of adding a few hops.

With everything configured as defaults, lets look at the path selection;

 iosv-1#show ip bgp
BGP table version is 6, local router ID is 192.168.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
* i 6.6.6.6/32 192.168.0.4 0 100 0 300 i
*>                    10.0.0.2 0 0 200 i

 

iosv-2#show ip bgp
BGP table version is 29, local router ID is 192.168.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.4 0 100 0 300 i

iosv-5#sh ip bgp
BGP table version is 27, local router ID is 192.168.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.4 0 100 0 300 i

iosv-6#sh ip bgp
BGP table version is 6, local router ID is 192.168.0.4
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*> 6.6.6.6/32 10.1.0.2 0 0 300 i

 

 

If we take a brief look at the situation, specifically IOSv-2 and IOSv-5 it’s pretty easy to see what’s happening, the network has basically converged to prefer the path via AS-300 to get to 6.6.6.6/32

For many networks, this sort of thing isn’t a problem – there’s a functional, working path to 6.6.6.6/32, if the edge router connected to AS-300 fails, the path through AS-200 via IOSv-1 will be used to get to the same prefix — everybody is happy because we can ping stuff.

Screen Shot 2017-06-01 at 22.41.33

The problem though, is that even a layman with no knowledge of networks or routing would look at this situation and think ‘that seems a bit rubbish’ especially considering that the basic cost of each of those routers (in a large scale environment) might cost as much as $1million – it seems a bit lame how they can’t make better use of paths.

Surely there has to be a simple way to make better use of paths? First – lets look at why the network has converged in such a way, starting with the route-reflector (ASR-9kv)

RP/0/RP0/CPU0:iosxrv9000-1#sh bgp
Thu Jun 1 21:46:12.042 UTC
BGP router identifier 192.168.0.5, local AS number 100
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0xe0000000 RD version: 41
BGP main routing table version 41
BGP NSR Initial initsync version 2 (Reached)
BGP NSR/ISSU Sync-Group versions 0/0
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
i – internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i – IGP, e – EGP, ? – incomplete
Network Next Hop Metric LocPrf Weight Path
* i6.6.6.6/32 192.168.0.1 0 100 0 200 i
*>i                 192.168.0.4 0 100 0 300 i

Processed 1 prefixes, 2 paths
RP/0/RP0/CPU0:iosxrv9000-1#show bgp 6.6.6.6/32
Thu Jun 1 21:47:19.015 UTC
BGP routing table entry for 6.6.6.6/32
Versions:
Process bRIB/RIB SendTblVer
Speaker 41 41
Last Modified: Jun 1 20:28:41.601 for 01:18:38
Paths: (2 available, best #2)
Advertised to update-groups (with more than one peer):
0.2
Path #1: Received by speaker 0
Not advertised to any peer
200, (Received from a RR-client)
192.168.0.1 (metric 22) from 192.168.0.1 (192.168.0.1)
Origin IGP, metric 0, localpref 100, valid, internal, group-best
Received Path ID 0, Local Path ID 0, version 0
Path #2: Received by speaker 0
Advertised to update-groups (with more than one peer):
0.2
300, (Received from a RR-client)
192.168.0.4 (metric 21) from 192.168.0.4 (192.168.0.4)
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 1, version 23
RP/0/RP0/CPU0:iosxrv9000-1#

 

So it’s pretty easy to see the reason why the path through AS-300 has been selected, with two competing routes, the BGP path selection process works through each of the attributes of the routes before it finds a difference, to select a winner;

1; Weight (no weight configured anywhere on the network other than defaults)

2: Local-preference (both routes have the default of 100)

3: Prefer locally originated routes (both identical, neither are locally originated – they are received from the RR)

4: Prefer shortest AS-Path (both paths lengths are identical)

5: Prefer lowest origin code (both routes have the same origin of IGP)

6: Prefer the lowest MED (both MED values are unconfigured and 0)

7: Prefer eBGP paths over iBGP (IOSv-2 and IOSv-5 receive both paths as iBGP from the RR)

8: Prefer the path with the lowest IGP metric (Bingo! the path via IOSv-6 on AS-300 has a IGP next-hop metric of 21, vs the path via IOSv-1 with it’s IGP next-hop metric of 22)

The problem here, is that once the route-reflector has made this decision – other alternate paths can’t be used in any way at all, because as everyone knows – BGP only normally advertises best-paths, so any other routes received by the route-reflector go no further and aren’t advertised to the network.

In the case of this lab, the only reason this has happened is because one edge router is only slightly closer than another to the route-reflector, so the route-reflector has gone ahead and made the decision for everyone, despite the obvious fact that from a packet forwarding and latency perspective – IOSv-2 has a suboptimal path, it would be much better if IOSv-2 went via IOSv-1 rather than all the way through IOSv-6 to get to 6.6.6.6/32

The diagram with the ISIS metrics imposed shows the simplicity of the problem;

Screen Shot 2017-06-01 at 23.16.02

If we had 1000 edge routers on this network, every single one of them would select the path through IOSv-6 in AS-300 – where IOSv-1 wouldn’t receive a single packet of egress traffic., apart from anything it sends locally, (because eBGP routes are preferred over iBGP)

The problem with IGPs in service-provider networks, is that they’re difficult to tweak at the best of times, even if we made them the same – the RR would still only advertise a single route, based on the next decision in the BGP path selection process (oldest path followed by RID) <yes we know add-paths exists, but that’s not without issues 🙂 >

If we start to manipulate the metrics, that normally has the undesirable result of moving lots of traffic from one link to another – which makes management and planning difficult.

My personal approach would normally be to try and stick to good design, in order to prevent this sort of behaviour. An obvious and simple method and one that’s normally employed in larger ISPs is to have route-reflectors that are pop based in a hierarchy, that is route-reflector clients are always served by a route-reflector that’s closest to them – that way the IGP next-hop costs will always be lower, than relying on a centralised route-reflector that’s buried in the middle of the network, somewhere behind 20 P routers.

For example in the below change to the design, IOSv-1 and IOSv-6 each have their own local route-reflector (RR1 and RR2), in this case each RR is metrically closer to the edge-router it serves, meaning that if the BGP tiebreaker happens to fall on the IGP next-hop cost, the closest value will always be chosen.

Screen Shot 2017-06-01 at 23.33.15

The problem with the above design, is that whilst it’s simpler from a protocols perspective – it ends up being much more expensive and eventually more complex in the long run. If I have 500x POPs that’s a lot of route-reflectors and a more complex hierarchy, along with longer convergence times – but then again with 500x POPs, I’d also have many other issues to contend with.

In smaller networks with perhaps a pair of centralised route-reflectors, we can use BGP-ORR (optimal route reflection) to employ some of the information held inside the IGP LSA database to assist BGP in making a better routing decision.

This is possible because as we all know – with link-state IGPs such as ISIS or OSPF, they each hold a full live state of all links and all paths in the network, so it makes sense to hook into this information, rather than having BGP act in isolation and compute a suboptimal path.

More information on the draft is given below;

https://tools.ietf.org/html/draft-ietf-idr-bgp-optimal-route-reflection-13

So – I’ll go ahead with the existing topology and configure BGP-ORR on the route-reflector only, and we’ll look at how the routing has changed;

A reminder of the topology;

Screen Shot 2017-06-01 at 23.55.37

A quick look at the BGP configuration on ARS9kv;

RP/0/RP0/CPU0:iosxrv9000-1#show run router bgp
Thu Jun 1 22:57:50.574 UTC
router bgp 100
bgp router-id 192.168.0.5
address-family ipv4 unicast
optimal-route-reflection r2 192.168.0.2
optimal-route-reflection r5 192.168.0.3
optimal-route-reflection r7 192.168.0.1
optimal-route-reflection r8 192.168.0.4

!
! iBGP
! iBGP clients
neighbor 192.168.0.1
remote-as 100
description RR client iosv-1
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r7
route-reflector-client
!
!
neighbor 192.168.0.2
remote-as 100
description RR client iosv-2
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r2
route-reflector-client
!
!
neighbor 192.168.0.3
remote-as 100
description RR client iosv-5
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r5
route-reflector-client
!
!
neighbor 192.168.0.4
remote-as 100
description RR client iosv-6
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r8
route-reflector-client
!
!
!

RP/0/RP0/CPU0:iosxrv9000-1# sh run router isis
Thu Jun 1 22:57:56.223 UTC
router isis 100
is-type level-2-only
net 49.1921.6800.0005.00
distribute bgp-ls
address-family ipv4 unicast
metric-style wide
!
interface Loopback0
passive
circuit-type level-2-only
address-family ipv4 unicast
!
!
interface GigabitEthernet0/0/0/0
point-to-point
address-family ipv4 unicast
!
!
!RP/0/RP0/CPU0:iosxrv9000-1#

 

Before we go over the configuration, lets look at the results on IOSv-1 and IOSv-5 (recall from a few pages up, that previously both routers had picked the route via IOSv-6 (AS-300)

iosv-2#sh ip bgp
BGP table version is 32, local router ID is 192.168.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.1 0 100 0 200 i
iosv-2#

iosv-5#sh ip bgp
BGP table version is 29, local router ID is 192.168.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.4 0 100 0 300 i
iosv-5#

 

Notice how IOSv-2 and IOSv-5 have each selected their closest peering router (IOSv-1 and IOSv-6) respectively, to get to 6.6.6.6/32, instead of everything going via IOSv-6, as illustrated below;

Screen Shot 2017-06-02 at 10.07.11

For ye of little faith – a traceroute confirms the newer optimised best path from IOSv-2 and IOSv-5 – both routers choose their closest exit (1 hop away)

iosv-2#traceroute 6.6.6.6 source lo0
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
1 10.0.128.1 1 msec 0 msec 0 msec
2 10.0.0.2 1 msec * 0 msec

iosv-2#

iosv-5#trace 6.6.6.6 source lo0
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
1 10.0.0.14 1 msec 0 msec 0 msec
2 10.1.0.2 1 msec * 0 msec

iosv-5#

 

So with the configuration applied, how does BGP-ORR actually work?

It all boils down to perspective, that is – rather than the route-reflector making a decision based purely on it’s own information such as it’s own IGP cost to the next-hop. Using BGP-ORR the route-reflector can ‘hook’ into the LSA database and check the IGP next-hop cost from the perspective of the RR client, rather than the RR itself.

This is possible with IGPs because IGPs generally contain a full database of link states that are distributed to all devices running the IGP, which ultimately means we can put the route-reflector anywhere in the network using BGP-ORR. Because we can ‘hack’ the protocol to make a calculation from the perspective of wherever we choose, rather than the current location.

The below diagram illustrates it as simply as possible in the current topology for IOSv-2 only;

Screen Shot 2017-06-02 at 11.13.42

In the above diagram, ASR-9kv decides the best path using IOSv-2’s cost to IOSv-1, by looking at the ISIS database in the same way that IOSv-2 looks at it, or from the perspective of IOSv-2.

If we look at the ISIS routes on IOSv-2, followed by the BGP-ORR policy on the route-reflector, we can see that the route-reflector uses the very same costs.

iosv-2#show ip route isis
Codes: L – local, C – connected, S – static, R – RIP, M – mobile, B – BGP
D – EIGRP, EX – EIGRP external, O – OSPF, IA – OSPF inter area
N1 – OSPF NSSA external type 1, N2 – OSPF NSSA external type 2
E1 – OSPF external type 1, E2 – OSPF external type 2
i – IS-IS, su – IS-IS summary, L1 – IS-IS level-1, L2 – IS-IS level-2
ia – IS-IS inter area, * – candidate default, U – per-user static route
o – ODR, P – periodic downloaded static route, H – NHRP, l – LISP
a – application route
+ – replicated route, % – next hop override, p – overrides from PfR

Gateway of last resort is not set

10.0.0.0/8 is variably subnetted, 8 subnets, 3 masks
i L2 10.2.128.0/30 [115/12] via 10.2.0.1, 01:25:02, GigabitEthernet0/2
192.168.0.0/32 is subnetted, 7 subnets
i L2 192.168.0.1 [115/11] via 10.0.128.1, 01:35:19, GigabitEthernet0/1
i L2 192.168.0.3 [115/13] via 10.2.0.1, 01:34:49, GigabitEthernet0/2
[115/13] via 10.0.128.1, 01:34:49, GigabitEthernet0/1
i L2 192.168.0.4 [115/12] via 10.2.0.1, 01:34:49, GigabitEthernet0/2
i L2 192.168.0.5 [115/2] via 10.2.0.1, 01:25:02, GigabitEthernet0/2
i L2 192.168.0.9 [115/12] via 10.0.128.1, 01:35:09, GigabitEthernet0/1
i L2 192.168.0.10 [115/11] via 10.2.0.1, 01:35:09, GigabitEthernet0/2
iosv-2#

RP/0/RP0/CPU0:iosxrv9000-1#show orrspf database r2
Fri Jun 2 10:20:25.187 UTC

ORR policy: r2, IPv4, RIB tableid: 0xe0000002
Configured root: primary: 192.168.0.2, secondary: NULL, tertiary: NULL
Actual Root: 192.168.0.2, Root node: 1921.6800.0002.0000

Prefix                                  Cost
192.168.0.1                           11
192.168.0.2                           10
192.168.0.3                           13
192.168.0.4                           12
192.168.0.5                            2
192.168.0.9                           12
192.168.0.10                         11

Number of mapping entries: 8
RP/0/RP0/CPU0:iosxrv9000-1#

 

Essentially, the ISIS costs are copied and pasted from the IGP database into the BGP-ORR database, so that the route-reflector can use this information in it’s path selection process.

Lets have a quick review of the route-reflector config;

router bgp 100
bgp router-id 192.168.0.5
address-family ipv4 unicast
optimal-route-reflection r2 192.168.0.2
optimal-route-reflection r5 192.168.0.3
optimal-route-reflection r7 192.168.0.1
optimal-route-reflection r8 192.168.0.4

!
! iBGP
! iBGP clients
neighbor 192.168.0.1
remote-as 100
description RR client iosv-1
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r7
route-reflector-client
!
!
neighbor 192.168.0.2
remote-as 100
description RR client iosv-2
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r2
route-reflector-client
!
!
neighbor 192.168.0.3
remote-as 100
description RR client iosv-5
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r5
route-reflector-client
!
!
neighbor 192.168.0.4
remote-as 100
description RR client iosv-6
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r8
route-reflector-client
!
!
!

RP/0/RP0/CPU0:iosxrv9000-1#sh run router isis
Fri Jun 2 10:35:00.027 UTC
router isis 100
is-type level-2-only
net 49.1921.6800.0005.00
distribute bgp-ls
address-family ipv4 unicast
metric-style wide
!
interface Loopback0
passive
circuit-type level-2-only
address-family ipv4 unicast
!
!
interface GigabitEthernet0/0/0/0
point-to-point
address-family ipv4 unicast
!
!
!

 

The first portion of the configuration, we specify the root device that we want to use to compute the IGP cost, in the case of which in the case of IOSv-2 is R2 – 192.168.0.2 in the config. We can also specify secondary and tertiary devices.

Under the RR neighbour configuration, we apply ‘optimal-route-reflection’ along with the policy we configured previously, in the case of IOSv-2 for neighbour 192.168.0.2 

Lastly, under ISIS we need to configure ‘distribute BGP-LS‘ this essentially tells ISIS to distribute it’s information into BGP, more information on BGP-LS (BGP Link-State) see my previous blog post on Segment-routing and ODL

As a conclusion, I think BGP-ORR is a useful addition to the protocol – I’ve certainly worked on networks where it would make sense to implement, unfortunately it only seems to exist in a few trains of code on certain devices. In this lab example – I was using Cisco VIRL spun up under Vagrant on packet.net, where the XR9kv router is the only one that supports BGP-ORR, but I have seen it recently on JUNOS.

As for potential downsides of BGP-ORR, in larger networks it could become quite complicated to design, where you have lots of different routers that all need to be balanced correctly, and in larger networks having centralised route-reflectors can be a big downside and distributed RR designs may work better.

I’d also be interested to see how well BGP-ORR converges in networks with larger LSA databases and BGP tables.

Bye for now! 🙂

Segment-routing + Opendaylight SDN + Pathman-SR + PCEP

opendaylight_logo    Cisco.png

This is a second technical post related to segment-routing, I did a basic introduction to this technology on Juniper MX here;

https://tgregory.org/2016/08/13/segment-routing-on-junos-the-basics/

For this post I’m looking at something a bit more advanced and fun – performing Segment-routing traffic-engineering using an SDN controller, in this case OpenDaylight Beryllium – an open source SDN controller with some very powerful functionality.

This post will use Cisco ASR9kV virtual routers running on a Cisco UCS chassis, mostly because Cisco currently have the leading-edge support for Segment-routing at this time, Juniper seem to be lagging behind a bit on that front!

Lets check out the topology;

odl1

It’s a pretty simple scenario – all of the routers in the topology are configured in the following way;

  • XRV-1 to XRV-8; PE routers (BGP IPv4)
  • XRV 2 to XRV7; P routers (ISIS-Segment-routing)
  • XRV4 is an in-path RR connecting to the ODL controller

odl2

The first thing to look at here is BGP-LS “BGP Link-state” which is an extension of BGP that allows IGP information (OSPF/ISIS) to be injected into BGP, this falls conveniently into the world of centralised path computation – where we can use a controller of some sort to look at the network’s link-state information, then compute a path through the network. The controller can then communicate that path back down to a device within the network using a different method, ultimately resulting in an action of some sort – for example, signalling an LSP.

Some older historic platforms such as HP Route analytics – which enabled you to discover the live IGP topology by running ISIS or OSPF directly with a network device, however IGPs tend to be very intense protocols and also require additional effort to support within an application, rather than a traditional router. IGPs are only usually limited to the domain within which they operate – for example if we have a large network with many different IGP domains or inter-domain MPLS, the IGP’s view becomes much more limited. BGP on the other hand can bridge many of these gaps, and when programmed with the ability to carry IGP information – can be quite useful.

The next element is PCE or Path computation element – which generally contains two core elements;

  • PCC – Path computation client – In the case of this lab network, a PCC would be a PE router
  • PCE – Path computation element – In the case of this lab network, the PCE would be the ODL controller

These elements communicate using PCEP (Path computation element protocol) which allows a central controller (in this case ODL) to essentially program the PCC with a path – for example, by signalling the actual LSP;

Basic components;

yeee

Basic components plus an application (in this case Pathman-SR) which can compute and signal an LSP from ODL to the PCC (XRV-1);

pathman

In the above example, an opensource application (in this case Pathman-SR) is using the information about the network topology obtained via BGP-LS and PCE, stored inside ODL – to compute and signal a Segment-routing LSP from XRV-1 to XRV-8, via XRV3, XRV5 and XRV7.

Before we look at the routers, lets take a quick look at OpenDaylight, general information can be found here; https://www.opendaylight.org I’m running Beryllium 0.4.3 which is the same Cisco’s DCloud demo – it’s a relatively straightforward install process, I’m running my copy on top of a standard Ubuntu install.

yang

From inside ODL you can use the YANG UI to query information held inside the controller, which is essentially a much easier way of querying the data, using presets – for example, I can view the link-state topology learnt via BGP-LS pretty easily;

topology

There’s a whole load of functionality possible with ODL, from BGP-Flowspec, to Openflow, to LSP provisioning, for now we’re just going to keep it basic – all of this is opensource and requires quite a bit of “playing” to get working.

Lets take a look at provisioning some segment-routing TE tunnels, first a reminder of the diagram;

odl1

And an example of some configuration – XRv-1

ISIS;

  1. router isis CORE-SR
  2.  is-type level-2-only
  3.  net 49.0001.0001.0001.00
  4.  address-family ipv4 unicast
  5.   metric-style wide
  6.   mpls traffic-eng level-2-only
  7.   mpls traffic-eng router-id Loopback0
  8.   redistribute static
  9.   segment-routing mpls
  10.  !
  11.  interface Loopback0
  12.   address-family ipv4 unicast
  13.    prefix-sid index 10
  14.   !
  15.  !
  16.  interface GigabitEthernet0/0/0/0.12
  17.   point-to-point
  18.   address-family ipv4 unicast
  19.   !
  20.  !
  21.  interface GigabitEthernet0/0/0/1.13
  22.   point-to-point
  23.   address-family ipv4 unicast
  24.   !
  25.  !
  26. !

 

A relatively simple ISIS configuration, with nothing remarkable going on,

  • Line 9 enabled Segment-Routing for ISIS
  • Line 13 injects a SID (Segment-identifier) of 10 into ISIS for loopback 0

The other aspect of the configuration which generates a bit of interest, is the PCE and mpls traffic-eng configuration;

  1. mpls traffic-eng
  2.  pce
  3.   peer source ipv4 49.1.1.1
  4.   peer ipv4 192.168.3.250
  5.   !
  6.   segment-routing
  7.   logging events peer-status
  8.   stateful-client
  9.    instantiation
  10.   !
  11.  !
  12.  logging events all
  13.  auto-tunnel pcc
  14.   tunnel-id min 1 max 99
  15.  !
  16.  reoptimize timers delay installation 0
  17. !

 

  • Line 1 enables basic traffic-engineering, an important point to note – to do MPLS-TE for Segment-routing, you don’t need to turn on TE on every single interface like you would if you were using RSVP, so long as ISIS TE is enabled and
  • Lines 2, 3 and 4 connect the router from it’s loopback address, to the opendaylight controller and enable PCE
  • Line 6 through 9 specify the segment-routing parameters for TE
  • Line 14 specifies the tunnel ID for automatically generated tunnels – for tunnels spawned by the controller

Going back to the diagram, XRv-4 was also configured for BGP-LS;

  1. router bgp 65535
  2.  bgp router-id 49.1.1.4
  3.  bgp cluster-id 49.1.1.4
  4.  address-family ipv4 unicast
  5.  !
  6.  address-family link-state link-state
  7.  !
  8.  neighbor 49.1.1.1
  9.   remote-as 65535
  10.   update-source Loopback0
  11.   address-family ipv4 unicast
  12.    route-reflector-client
  13.   !
  14.  !
  15.  neighbor 49.1.1.8
  16.   remote-as 65535
  17.   update-source Loopback0
  18.   address-family ipv4 unicast
  19.    route-reflector-client
  20.   !
  21.  !
  22.  neighbor 192.168.3.250
  23.   remote-as 65535
  24.   update-source GigabitEthernet0/0/0/5
  25.   address-family ipv4 unicast
  26.    route-reflector-client
  27.   !
  28.   address-family link-state link-state
  29.    route-reflector-client
  30.   !
  31.  !
  32. !

 

  • Line 6 enables the BGP Link-state AFI/SAFI
  • Lines 8 through 19 are standard BGP RR config for IPv4
  • Line 22 is the BGP peer for the Opendaylight controller
  • Line 28 turns on the link-state AFI/SAFI for Opendaylight

Also of Interest on XRv-4 is the ISIS configuration;

  1. router isis CORE-SR
  2.  is-type level-2-only
  3.  net 49.0001.0001.0004.00
  4.  distribute bgp-ls
  5.  address-family ipv4 unicast
  6.   metric-style wide
  7.   mpls traffic-eng level-2-only
  8.   mpls traffic-eng router-id Loopback0
  9.   redistribute static
  10.   segment-routing mpls
  11.  !
  12.  interface Loopback0
  13.   address-family ipv4 unicast
  14.    prefix-sid index 40
  15.   !
  16.  !

 

  • Line 4 copies the ISIS link-state information into BGP-link state

If we do a “show bgp link-state link-state” we can see the information taken from ISIS, injected into BGP – and subsequently advertised to Opendaylight;

  1. RP/0/RP0/CPU0:XRV9k-4#show bgp link-state link-state
  2. Thu Dec  1 21:40:44.032 UTC
  3. BGP router identifier 49.1.1.4, local AS number 65535
  4. BGP generic scan interval 60 secs
  5. Non-stop routing is enabled
  6. BGP table state: Active
  7. Table ID: 0x0   RD version: 78
  8. BGP main routing table version 78
  9. BGP NSR Initial initsync version 78 (Reached)
  10. BGP NSR/ISSU Sync-Group versions 0/0
  11. BGP scan interval 60 secs
  12. Status codes: s suppressed, d damped, h history, * valid, > best
  13.               i – internal, r RIB-failure, S stale, N Nexthop-discard
  14. Origin codes: i – IGP, e – EGP, ? – incomplete
  15. Prefix codes: E link, V node, T IP reacheable route, u/U unknown
  16.               I Identifier, N local node, R remote node, L link, P prefix
  17.               L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static/peer-node
  18.               a area-ID, l link-ID, t topology-ID, s ISO-ID,
  19.               c confed-ID/ASN, b bgp-identifier, r router-ID,
  20.               i if-address, n nbr-address, o OSPF Route-type, p IP-prefix
  21.               d designated router address
  22.    Network            Next Hop            Metric LocPrf Weight Path
  23. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0001.00]]/328
  24.                       0.0.0.0                                0 i
  25. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0002.00]]/328
  26.                       0.0.0.0                                0 i
  27. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0003.00]]/328
  28.                       0.0.0.0                                0 i
  29. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0004.00]]/328
  30.                       0.0.0.0                                0 i
  31. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0005.00]]/328
  32.                       0.0.0.0                                0 i
  33. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0006.00]]/328
  34.                       0.0.0.0                                0 i
  35. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0007.00]]/328
  36.                       0.0.0.0                                0 i
  37. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0008.00]]/328
  38.                       0.0.0.0                                0 i
  39. *> [E][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0001.00]][R[c65535][b0.0.0.0][s0001.0001.0002.00]][L[i10.10.12.0][n10.10.12.1]]/696
  40.                       0.0.0.0                                0 i
  41. *> [E][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0001.00]][R[c65535][b0.0.0.0][s0001.0001.0003.00]][L[i10.10.13.0][n10.10.13.1]]/696
  42.                       0.0.0.0                                0 i
  43. *> [E][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0002.00]][R[c65535][b0.0.0.0][s0001.0001.0001.00]][L[i10.10.12.1][n10.10.12.0]]/696

 

With this information we can use an additional app on top of OpenDaylight to provision some Segment-routing LSPs, in this case I’m going to use something from Cisco Devnet called Pathman-SR – it essentially connects to ODL using REST to program the network, Pathman can be found here; https://github.com/CiscoDevNet/pathman-sr

Once it’s installed and running, simply browse to it’s url (http://192.168.3.250:8020/cisco-ctao/apps/pathman_sr/index.html) and you’re presented with a nice view of the network;

pathman

From here, it’s possible to compute a path from one point to another – then signal that LSP on the network using PCEP, in this case – lets program a path from XRv9k-1 to XRv9k-8

In this case, lets program a path via XRV9k-2, via 4, via 7 to 8;

pathman3

Once Pathman has calculated the path – hit deploy, Pathman sends the path to ODL – which then connects via PCEP to XRV9kv-1 and provisions the LSP;

pathman2

Once this is done, it’s check XRV9k-1 to check out the SR-TE tunnel;

  1. RP/0/RP0/CPU0:XRV9k-1#sh ip int bri
  2. Thu Dec  1 22:05:38.799 UTC
  3. Interface                      IP-Address      Status          Protocol Vrf-Name
  4. Loopback0                      49.1.1.1        Up              Up       default
  5. tunnel-te1                     49.1.1.1        Up              Up       default
  6. GigabitEthernet0/0/0/0         unassigned      Up              Up       default
  7. GigabitEthernet0/0/0/0.12      10.10.12.0      Up              Up       default
  8. GigabitEthernet0/0/0/1         unassigned      Up              Up       default
  9. GigabitEthernet0/0/0/1.13      10.10.13.0      Up              Up       default
  10. GigabitEthernet0/0/0/2         100.1.0.1       Up              Up       default
  11. GigabitEthernet0/0/0/3         192.168.3.248   Up              Up       default
  12. GigabitEthernet0/0/0/4         unassigned      Shutdown        Down     default
  13. GigabitEthernet0/0/0/5         unassigned      Shutdown        Down     default
  14. GigabitEthernet0/0/0/6         unassigned      Shutdown        Down     default
  15. MgmtEth0/RP0/CPU0/0            unassigned      Shutdown        Down     default

 

We can see from the output of “show ip int brief” on line 5, that interface tunnel-te1 has been created, but it’s nowhere in the config;

  1. RP/0/RP0/CPU0:XRV9k-1#sh run interface tunnel-te1
  2. Thu Dec  1 22:07:41.409 UTC
  3. % No such configuration item(s)
  4. RP/0/RP0/CPU0:XRV9k-1#

 

PCE signalled LSPs never appear in the configuration, they’re created, managed and deleted by the controller – it is possible to manually add an LSP then delegate it to the controller, but that’s beyond the scope here (that’s technical speak for “I couldn’t make it work 🙂 )

Lets check out the details of the SR-TE tunnel;

  1. RP/0/RP0/CPU0:XRV9k-1#show mpls traffic-eng tunnels
  2. Thu Dec  1 22:09:56.983 UTC
  3. Name: tunnel-te1  Destination: 49.1.1.8  Ifhandle:0x8000064 (auto-tunnel pcc)
  4.   Signalled-Name: XRV9k-1 -> XRV9k-8
  5.   Status:
  6.     Admin:    up Oper:   up   Path:  valid   Signalling: connected
  7.     path option 10, (Segment-Routing) type explicit (autopcc_te1) (Basis for Setup)
  8.     G-PID: 0x0800 (derived from egress interface properties)
  9.     Bandwidth Requested: 0 kbps  CT0
  10.     Creation Time: Thu Dec  1 22:01:21 2016 (00:08:37 ago)
  11.   Config Parameters:
  12.     Bandwidth:        0 kbps (CT0) Priority:  7  7 Affinity: 0x0/0xffff
  13.     Metric Type: TE (global)
  14.     Path Selection:
  15.       Tiebreaker: Min-fill (default)
  16.       Protection: any (default)
  17.     Hop-limit: disabled
  18.     Cost-limit: disabled
  19.     Path-invalidation timeout: 10000 msec (default), Action: Tear (default)
  20.     AutoRoute: disabled  LockDown: disabled   Policy class: not set
  21.     Forward class: 0 (default)
  22.     Forwarding-Adjacency: disabled
  23.     Autoroute Destinations: 0
  24.     Loadshare:          0 equal loadshares
  25.     Auto-bw: disabled
  26.     Path Protection: Not Enabled
  27.     BFD Fast Detection: Disabled
  28.     Reoptimization after affinity failure: Enabled
  29.     SRLG discovery: Disabled
  30.   Auto PCC:
  31.     Symbolic name: XRV9k-1 -> XRV9k-8
  32.     PCEP ID: 2
  33.     Delegated to: 192.168.3.250
  34.     Created by: 192.168.3.250
  35.   History:
  36.     Tunnel has been up for: 00:08:37 (since Thu Dec 01 22:01:21 UTC 2016)
  37.     Current LSP:
  38.       Uptime: 00:08:37 (since Thu Dec 01 22:01:21 UTC 2016)
  39.   Segment-Routing Path Info (PCE controlled)
  40.     Segment0[Node]: 49.1.1.2, Label: 16020
  41.     Segment1[Node]: 49.1.1.4, Label: 16040
  42.     Segment2[Node]: 49.1.1.7, Label: 16070
  43.     Segment3[Node]: 49.1.1.8, Label: 16080
  44. Displayed 1 (of 1) heads, 0 (of 0) midpoints, 0 (of 0) tails
  45. Displayed 1 up, 0 down, 0 recovering, 0 recovered heads
  46. RP/0/RP0/CPU0:XRV9k-1#

 

Points of interest;

  • Line 4 shows the name of the LSP as configured by Pathman
  • Line 7 shows that the signalling is Segment-routing via autoPCC
  • Lines 33 and 34 show the tunnel was generated by the Opendaylight controller
  • Lines 39 shows the LSP is PCE controlled
  • Lines 40 through 43 show the programmed path
  • Line 44 basically shows XRV9k-1 being the SR-TE headend,

Lines 40-43 show some of the main benefits of Segment-routing, we have a programmed traffic-engineered path through the network, but with far less control-plane overhead than if we’d done this with RSVP-TE, for example – lets look at the routers in the path (xrv-2 xrv-4 and xrv-7)

  1. RP/0/RP0/CPU0:XRV9k-2#show mpls traffic-eng tunnels
  2. Thu Dec  1 22:14:38.855 UTC
  3. RP/0/RP0/CPU0:XRV9k-2#
  4. RP/0/RP0/CPU0:XRV9k-4#show mpls traffic-eng tunnels
  5. Thu Dec  1 22:14:45.915 UTC
  6. RP/0/RP0/CPU0:XRV9k-4#
  7. RP/0/RP0/CPU0:XRV9k-7#show mpls traffic-eng tunnels
  8. Thu Dec  1 22:15:17.873 UTC
  9. RP/0/RP0/CPU0:XRV9k-7#

 

Essentially – the path that the SR-TE tunnel takes contains no real control-plane state, this is a real advantage for large networks as the whole thing is much more efficient.

The only pitfall here, is that whilst we’ve generated a Segment-routed LSP, like all MPLS-TE tunnels we need to tell the router to put traffic into it – normally we do this with autoroute-announce or a static route, at this time OpenDaylight doesn’t support the PCEP extensions to actually configure a static route, so we still need to manually put traffic into the tunnel – this is fixed in Cisco’s openSDN and WAE (wan automation engine)

  1. router static
  2.  address-family ipv4 unicast
  3.   49.1.1.8/32 tunnel-te1
  4.  !
  5. !

 

I regularly do testing and development work with some of the largest ISPs in the UK – and something that regularly comes up, is where customers are running a traditional full-mesh of RSVP LSPs, if you have 500 edge routers – that’s 250k LSPs being signalled end to end, the “P” routers in the network need to signal and maintain all of that state. When I do testing in these sorts of environments, it’s not uncommon to see nasty problems with route-engine CPUs when links fail, as those 250k LSPs end up having to be re-signalled – indeed this very subject came up in a conversation at LINX95 last week.

With Segment-routing, the traffic-engineered path is basically encoded into the packet with MPLS labels – the only real difficulty is that it requires the use of more labels in the packet, but once the hardware can deal with the label-depth, I think it’s a much better solution than RSVP, it’s more efficient and it’s far simpler.

From my perspective – all I’ve really shown here is a basic LSP provisioning tool, but it’s nice to be able to get the basics working, in the future I hope to get my hands on a segment-routing enabled version of Northstar, or Cisco’s OpenSDN controller – (which is Cisco productised version of ODL) 🙂

 

EVPN vs PBB-EVPN

The is the next in a series of technical posts relating to EVPN – in particular PBB-EVPN (Provider backbone bridging, Ethernet VPN) and attempts to explain the basic setup, application and problems solved within a very large layer-2 environment. Readers new to EVPN may wish to start with my first post which gives examples of the most basic implementation of regular EVPN;

https://tgregory.org/2016/06/04/evpn-in-action-1/

Regular EVPN without a doubt is the future of MPLS based multi-point layer-2 VPN connectivity, it adds the highly scalable BGP based control-plane, that’s been used to good effect in Layer-3 VPNs for over a decade. It has much better mechanisms for handling BUM (broadcast unknown multicast) traffic and can properly do active-active layer-2 forwarding, and because EVPN PE’s all synchronise their ARP tables with one another – you can design large layer-2/layer-3 networks that stretch across numerous data centres or POPs,  and move machines around at layer-2 or layer-3 without having to re-address or re-provision – you can learn how to do this here;

https://tgregory.org/2016/06/11/inter-vlan-routing-mobility/

But like any technology it can never be perfect from day one, EVPN contains more layer-2 and layer-3 functionality than just about any single protocol developed so far, but it comes at a cost – control-plane resources, consider the following scenario;

capture

The above example is an extremely simple example of a network with 3x data centres, each data centre has 1k hosts sat behind it. The 3x “P” routers in the centre of the network are running ISIS and LDP only, each edge router (MX-1 through MX-3) is running basic EVPN with all hosts in a single VLAN.

A quick recap of the basic config (configs identical on all 3x PE routers, with the exception of IP addresses)

  1. interfaces {  
  2.     ge-1/0/0 {
  3.         flexible-vlan-tagging;
  4.         encapsulation flexible-ethernet-services;
  5.         unit 100 {
  6.             encapsulation vlan-bridge;
  7.             vlan-id 100;
  8.         }
  9.     }
  10. routing-instances {
  11.     EVPN-100 {
  12.         instance-type virtual-switch;
  13.         route-distinguisher 1.1.1.1:100;
  14.         vrf-target target:100:100;
  15.         protocols {
  16.             evpn {
  17.                 extended-vlan-list 100;
  18.             }
  19.         }
  20.         bridge-domains {
  21.             VL-100 {
  22.                 vlan-id 100;
  23.                 interface ge-1/0/0.100;
  24.             }
  25.         }
  26.     }
  27. }
  28. protocols {
  29.    bgp {
  30.         group fullmesh {
  31.             type internal;
  32.             local-address 10.10.10.1;
  33.             family evpn {
  34.                 signaling;
  35.             }
  36.             neighbor 10.10.10.2;
  37.             neighbor 10.10.10.3;
  38.         }
  39.     }

This is only a small-scale setup using MX-5 routers but it’s easy to use this example in order to project the problem – that is, the EVPN control-plane is quite resource intensive.

With 1k hosts per site – that equates to 3k EVPN BGP routes that need to be advertised – which isn’t that bad, however if you’re a large service-provider spanning many data-centres across a whole country, or even multiple countries – 3k routes is a tiny amount, it may be the case that you have hundreds of thousands or millions of EVPN routes spread across hundreds or thousands of edge routers.

Having hundreds of thousands, or millions of routes is a problem that can normally be easily dealt with if things are IPv4 or IPv6, in that we can rely on summarising these routes down into blocks or aggregates in BGP – to make things much more sensible.

However in the layer-2 world, it’s not possible to summarise mac-addresses as they’re mostly completely random, in regular EVPN – if I have 1 million hosts, that’s going to equate to 1 million EVPN MAC routes which get advertised everywhere – which isn’t going to run very smoothly at all, once we start moving hosts around, or have any large-scale failures in the network that might require huge numbers of hosts to move from one place to another.

If I spin up the 3x 1k hosts in IXIA, spread across all 3x sites – we can clearly see the amount of EVPN control-plane state being generated and advertised across the network;

  1. tim@MX5-1> show evpn instance extensive    
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.1:100
  4.   Per-instance MAC route label: 300640
  5.   MAC database status                Local  Remote
  6. Total MAC addresses:              1000    2000
  7.     Default gateway MAC addresses:       0       0
  8.   Number of local interfaces: 1 (1 up)
  9.     Interface name  ESI                            Mode             Status
  10.     ge-1/0/0.100    00:00:00:00:00:00:00:00:00:00  single-homed     Up
  11.   Number of IRB interfaces: 0 (0 up)
  12.   Number of bridge domains: 1
  13.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  14.     100          1   1     Extended         Enabled   300704
  15.   Number of neighbors: 2
  16.     10.10.10.2
  17.       Received routes
  18. MAC address advertisement:           1000
  19.         MAC+IP address advertisement:           0
  20.         Inclusive multicast:                    1
  21.         Ethernet auto-discovery:                0
  22.     10.10.10.3
  23.       Received routes
  24. MAC address advertisement:           1000
  25.         MAC+IP address advertisement:           0
  26.         Inclusive multicast:                    1
  27.         Ethernet auto-discovery:                0
  28.   Number of ethernet segments: 0

 

And obviously all of this information is injected into BGP – all of which needs to be advertised and distributed;

  1. tim@MX5-1> show bgp summary
  2. Groups: 1 Peers: 2 Down peers: 0
  3. Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
  4. bgp.evpn.0
  5.                     2002       2002          0          0          0          0
  6. Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped…
  7. 10.10.10.2              100       1246       1410       0       1     5:51:04 Establ
  8.   bgp.evpn.0: 1001/1001/1001/0
  9.   EVPN-100.evpn.0: 1001/1001/1001/0
  10.   __default_evpn__.evpn.0: 0/0/0/0
  11. 10.10.10.3              100       1243       1393       0       0     5:50:51 Establ
  12.   bgp.evpn.0: 1001/1001/1001/0
  13.   EVPN-100.evpn.0: 1001/1001/1001/0
  14.   __default_evpn__.evpn.0: 0/0/0/0
  15. tim@MX5-1>

 

With our 3k host setup, it’s obvious that things will be just fine – and I imagine a good 20-30k hosts in a well designed network running on routers with big CPUs and memory would probably be ok, however I suspect that in a large network already carrying the full BGP table, + L3VPNs and everything else, adding an additional 90-100k EVPN routes might not be such a good idea.

So what’s available for large-scale layer-2 networks?

Normally in large layer-2 networks, QinQ is enough to provide sufficient scale for most large-enterprises, with QinQ (802.1ad) we simply multiplex VLANs by adding a second VLAN service-tag (S-TAG) which allows us to represent many different customer tags, or C-TAGs – because the size of the dot1q header allows for 4096 different VLANs, if we add a second tag – that gives us 4096 x 4096 possible combinations which equals over 1.6 million;

As a quick recap on QinQ – in the below example, all frames from Customer 1 for Vlans 1-4096 are encapsulated with an S-TAG of 100, whilst all frames from Customer 2, for the same VLAN range of 1-4096 are encapsulated in S-TAG 200;

qinq

The problem with QinQ (PBN Provider-bridged-networks) is that it’s essentially limited to 1.6 Million possible combinations – which sounds like a lot, however if you’re a large service provider with tens of millions of consumers, businesses and big-data products – 1.6 Million isn’t very much.

Whether you run QinQ across a large switched enterprise network, or breakout individual high-bandwidth interfaces into a switch and sell hundreds of leased-line services in a multi-tenant design using VPLS – you’re still always going to be limited to 1.6 Million in total – and that’s before we mention things like active-active multi-homing which doesn’t work with VPLS.

Another disadvantage is that with QinQ, every device in the network is required to learn all the customer mac addresses – so it’s quite easy to see the scaling problems from the beginning.

For truly massive scale, Big-Data, DCI, provider-transport etc – we need to make the leap to PBB (provider backbone bridging) but what exactly is PBB?

Before we look at PBB-EVPN, we should first take some time to understand why basic PBB is and what problems it solves;

PBB was originally defined as 802.1ah or “mac in mac” and is layer-2 end to end, instead of merely adding another VLAN tag PBB, actually duplicates the mac layer of the customer frame and separates it from the provider domain, by encapsulating it in a new set of headers. This allows for complete transparency between the customer network and the provider network.

Instead of multiplexing VLANs, PBB uses a 24bit I-SID (service ID) the fact that it’s 24 bit gives us an immediate idea of the scale we’re talking about here – 16 million possible services. PBB also introduces the concept of B-TAG or Backbone tag – this essentially hides the customer source/destination mac addresses behind a backbone entity, removing the requirement for every device in the network to learn every single customer mac-address – analogous in some ways with IP address aggregation because of the reduction in network state.

Check the below diagram for a basic topology and list of basic PBB terms;

diag1

  • PB = Provider bridge (802.1ad)
  • PEB = Provider edge bridge (802.1ad)
  • BEB = Backbone edge bridge (802.1ah)
  • BCB = Backbone core bridge (802.1ah)

The “BEB” backbone-edge-bridge is the first immediate point of interest within PBB, essentially it forms the boundary between the access-network and the core network and introduces 2x new concepts;

  • I-Component
  • B-Component

The I-Component essentially forms the customer or access facing interface or routing instance, the B-Component is the backbone facing PBB core instance – the B-Component uses B-MAC addressing (backbone MAC) in order to forward customer frames to the core based on the new imposed B-MAC, instead of any of the original S or C VLAN tags and C-MAC (customer-MAC) which would have been the case in a regular PBN QinQ setup.

In this case – the “BEB” or Backbone edge bridge forms the connectivity between the access and core, where on one side it maintains a full layer-2 state with the access-network, however on the other side – it operates only in B-MAC forwarding, where enormous services and huge numbers of C-MAC (customer MACs) on the access-side can be represented by individual B-MAC addresses on the core side. this obviously drastically reduces the amount of control-plane processing – especially in the core on the “BCB” Backbone core bridges – where forwarding is performed using B-MACs only.

In terms of C-MAC and B-MAC it makes it easier if you break the network up into two distinct sets of space, ideally “C-MAC space” and “B-MAC space”

diag2

It’s pretty easy to talk about C-MAC or customer mac addressing as that’s something we’ve all been dealing with since we sent our first ping packet – however B-MACs are a new concept.

Within the PBB network, each “BEB” Backbone-edge-bridge has one or more B-MAC identifiers which are unique on the entire network, can be assigned automatically or statically by design;

diag3

The interesting part starts when we begin looking at the packet flow from one side of the network to the other – in this case we’ll use the above diagram to send a packet from left to right – note how the composition of the frame changes at each section of the network, including the new PBB encapsulated frame;

diag4

If we look at the packet flow from left to right – a series of interesting things happen;

  1. The user on the left hand side, sends a regular single-tagged frame with a destination MAC address of “C2” on the far right hand side of the network.
  2. The ingress PEB (provider edge bridge) is performing regular QinQ, where it pushes a new “S-VLAN” onto the frame to create a brand new QinQ frame
  3. The double-tagged packet traverses the network, where it lands on the first BEB router – here the frame is received and the BEB generates a unique I-SID based on the S-VLAN and the B-MAC
  4. The BEB encapsulates the original frame with it’s C-VLAN and S-VLAN intact, and adds the new PBB encapsulation, complete with the PBB Source and destination B-MACs (B4 and B2) and I-SID – and forwards it into the core of the network
  5. The BCBs in the core forward frame based only on the source and destination B-MAC and take no notice of the internal original frame information
  6. The egress BEB strips the PBB header away and forwards the original QinQ frame onto the access network,
  7. Eventually the egress PEB switch pops the S-VLAN and forwards the original frame with the destination mac of C2, to the user interface.

So that’s vanilla PBB in a nutshell – essentially, it’s a way of hiding the gigantic amount of customer mac-addresses behind a drastically smaller number of backbone mac-addresses, without the devices in the core having to learn and process all of the individual customer state. Combined with a new I-SID service identifier we can create an encapsulation that allows for a huge number of services.

But like most things – it’s not perfect.

In 2016 (and for literally the last decade) most modern networks have a simple MPLS core comprising of PE and P routers, when it comes to PBB – we need the devices in the core to act as switches (the BCB backbone-core-bridge), performing forwarding decisions based on B-MAC addresses, which is obviously incompatible with a modern MPLS network where we’re switching packets between edge loopback addresses, using MPLS labels and recursion.

So the obvious question is – can we replace the BCB element in the middle with MPLS – whilst stealing the huge service scaling properties of the BEB PBB edge?

The answer is yes! by combining PBB with EVPN – we can replace the BCB element of the core and signal the “B-Component” using EVPN BGP signalling and encapsulate the whole thing inside MPLS using PE and P routers so that the PBB-EVPN architecture now reflects something we’re all a little more used to;

diag5

We now have the vast scale of PBB – combined with the simplicity and elegance of a traditional basic MPLS core network where the amount of network-wide state information has been drastically reduced, as opposed to regular PBB which is layer-2 over layer-2 we’ve moved to a model which is much more like layer-2 over layer-2 over layer-3.

The next question is – how do we configure it ? whilst PBB-EVPN simplifies the control-plane across the core and allows huge numbers of layer-2 services to transit the network in a much more simple manner – it is a little more complex to configure on Juniper MX series routers, but we’ll go through it step by step 🙂

Before we look at the configuration – it’s easier to understand if we visualise what’s happening inside the router itself, by breaking the whole thing up into blocks;

diag6

Basically, we break the router up into several blocks – in Juniper both the customer facing I-Component and backbone facing B-Component are configured as two separate routing-instances, with each routing-instance containing a bridge domain. Each bridge-group is different – the I-Component bridge-domain (BR-I-100) contains the physical tagged interface facing the customer including some service-options and the service-type which is “ELAN” a multi-point MEF carrier Ethernet standard, and the I-SID that we’re going to use to identify the service – in this case “100100” for VLAN 100.

The B-Component also contains a bridge-domain “BR-B-100100” which forms the backbone facing bridge where the B-MAC is sourced from, it also defines the EVPN PBB options used to signal the core.

These routing-instances are connected together by a pair of special interfaces;

  • PIP – Provider instance port
  • CBP – Customer backbone port

These interfaces join the I-Component and B-Component routing-instances together and are a bit like logical psuedo-interfaces normally found inside Juniper routers, used to connect certain logical elements together.

Lets take a look at the configuration of the routing-instances on Juniper MX-

  • Note, PBB-EVPN seems to have been supported only in very recent versions of Junos, these devices are MX-5 routers running Junos 16.1R2.11
  • All physical connectivity is done on TRIO via a “MIC-3D-20GE-SFP” card
  1. PBB-EVPN-B-COMP {
  2.     instance-type virtual-switch;
  3.     interface cbp0.1000;
  4.     route-distinguisher 1.1.1.1:100;
  5.     vrf-target target:100:100;
  6.     protocols {
  7.         evpn {
  8.             control-word;
  9.             pbb-evpn-core;
  10.             extended-isid-list 100100;
  11.         }
  12.     }
  13.     bridge-domains {
  14.         BR-B-100100 {
  15.             vlan-id 999;
  16.             isid-list 100100;
  17.             vlan-id-scope-local;
  18.         }
  19.     }
  20. }
  21. PBB-EVPN-I-COMP {
  22.     instance-type virtual-switch;
  23.     interface pip0.1000;
  24.     bridge-domains {
  25.         BR-I-100 {
  26.             vlan-id 100;
  27.             interface ge-1/0/0.100;
  28.         }
  29.     }
  30.     pbb-options {
  31.         peer-instance PBB-EVPN-B-COMP;
  32.     }
  33.     service-groups {
  34.         CUST1 {
  35.             service-type elan;
  36.             pbb-service-options {
  37.                 isid 100100 vlan-id-list 100;
  38.             }
  39.         }
  40.     }
  41. }

 

If we look at the configuration line by line, it works out as follows – for the B-Component

  • Lines 2,4 and 5 represent normal EVPN route-distribution properties, (RD/RT etc)
  • Line 3 brings the customer backbone port into the B-Component routing-instance – this logically links the B-Component to the I-Component
  • Lines 8 and 9 specify the control-word and switch on the PBB-EVPN-CORE feature
  • Line 10 allows only a I-Component service with an I-SID of 100100 to be processed by the B-Component
  • Lines 13-17 are bridge-domain options
  • Line 15 references VLAN-999 – this is currently unused but needs to be configured any value can be added here
  • Line 16 species the I-SID mapping

For the I-Component;

  • Line 23 adds the PIP (provider instance port) to the I-Component routing-instance
  • Lines 24 -29 are standard bridge-domain settings which add the physical customer facing interface (ge-1/0/0.100) for VLAN-ID 100, to the bridge-domain and routing-instance
  • Lines 30 and 31 activate the PBB service – and reference the “PBB-EVPN-B-COMP” routing-instance as the peer-instance for the service, this is how the I-Component is linked to the B-Component
  • Lines 33-37 reference the service group, in this case “CUST1” with the service-type set as ELAN (the MEF standard for multipoint layer-2 connectivity) the I-Component I-SID for this service for VLAN-100 is 100100 as defined on line 37

Lets examine the PIP and CBP interfaces;

  1. interfaces {
  2.     ge-1/0/0 {
  3.         flexible-vlan-tagging;
  4.         encapsulation flexible-ethernet-services;
  5.         unit 100 {
  6.             encapsulation vlan-bridge;
  7.             vlan-id 100;
  8.         }
  9.     }
  10. cbp0 {
  11.         unit 1000 {
  12.             family bridge {
  13.                 interface-mode trunk;
  14.                 bridge-domain-type bvlan;
  15.                 isid-list all;
  16.             }
  17.         }
  18.     }
  19. pip0 {
  20.         unit 1000 {
  21.             family bridge {
  22.                 interface-mode trunk;
  23.                 bridge-domain-type svlan;
  24.                 isid-list all-service-groups;
  25.             }
  26.         }
  27.     }
  28. }

 

  • Lines 2 through 9 represent a standard gigabit Ethernet interface configured with vlan-bridge encapsulation for VLAN 100 – standard stuff we’re all used to seeing;
  • Lines 10 15 represent the CBP interface (customer backbone port) for unit 1000 where the bridge-domain-type is set to bvlan (backbone vlan) and accept any I-SID, this connects the B-Component to the I-Component
  • Lines 19 through 24 represent the PIP (provider instance port) for the same unit 1000, as an svlan bridge – using an I-SID list for any service-group
  • The PIP0 interface, connects the I-Component to the B-Component

A lot to remember so far! – another point worth mentioning is that PBB-EVPN doesn’t work unless you have the router set for “enhanced-ip mode”;

  1. chassis {
  2.     network-services enhanced-ip;
  3. }

 

And – like our regular EVPN configuration from previous blog posts – we just have basic EVPN signalling turned on inside BGP;

  1. protocols {
  2.     bgp {
  3.         group mesh {
  4.             type internal;
  5.             local-address 10.10.10.1;
  6.             family evpn {
  7.                 signaling;
  8.             }
  9.             neighbor 10.10.10.2;
  10.             neighbor 10.10.10.3;
  11.         }
  12.     }

 

The fact that we have basic BGP EVPN signalling turned on, is a real advantage – as it keeps the service inline with modern MPLS based offerings – where we have a simple LDP/IGP core with all the edge services (L3VPN, Multicast, IPv4, IPv6, L2VPN) controlled by a single protocol – BGP which we all know and love.

So I have this configuration running on 3x MX5 routers – the configurations from the above snippets are identical across all 3x MX5 routers, with the obvious exception of IP addresses – lets recap the diagram;

diag7

With the configuration applied – I’ll go ahead and spawn 3000 hosts using IXIA, each host is an emulated machine sat on the same layer-2 /16 subnet with VLAN-100 spanned across all three sites in the PBB-EVPN – basically just imagine MX-1 MX-2 and MX-3 as switches with 1000 laptops plugged into each one 🙂 to keep things simple – I’m only going to use single-tagged frames directly from IXIA and send a full-mesh of traffic to all sites, with each stream being 300Mbps, with 3x sites – that’s 900Mbps of traffic in total;

diag8

Traffic appears to be successfully forwarded end to end without delay – lets check some of the show commands on MX-1;

We can clearly see 3000 hosts inside the bridge-mac table on MX5-1;

  1. tim@MX5-1> show bridge mac-table count
  2. 3 MAC address learned in routing instance PBB-EVPN-B-COMP bridge domain BR-B-100100
  3.   MAC address count per learn VLAN within routing instance:
  4.     Learn VLAN ID            MAC count
  5.               999                    3
  6. 3000 MAC address learned in routing instance PBB-EVPN-I-COMP bridge domain BR-I-100
  7.   MAC address count per interface within routing instance:
  8.     Logical interface        MAC count
  9.     ge-1/0/0.100:100              1000
  10.     rbeb.32768                    1000
  11.     rbeb.32769                    1000
  12.   MAC address count per learn VLAN within routing instance:
  13.     Learn VLAN ID            MAC count
  14.               100                 3000
  15. tim@MX5-1>

 

  • Line 2 shows the B-Component and 3x B-MACs being learnt via the PBB-EVPN
  • Line 6 shows the I-component and all 3000 mac-addresses live on the network, 1000 learnt locally via the directly connected interface, and 2000 learnt via RBEB.32768 and RBEB.32769 – the RBEB is the remote-backbone-edge-bridge – once the frames come in from the EVPN and the PBB headers are popped, the original C-MACs are learnt, which is why we see 3000 MACs in the I-Component locally, whilst we see only 3x B-MACs learnt remotely from the B-Component.

Lets look at the BGP table;

  1. tim@MX5-1> show bgp summary
  2. Groups: 1 Peers: 2 Down peers: 0
  3. Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
  4. bgp.evpn.0
  5.                        4          4          0          0          0          0
  6. Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped…
  7. 10.10.10.2              100       1183       1177       0       0     8:50:22 Establ
  8. bgp.evpn.0: 2/2/2/0
  9.   PBB-EVPN-B-COMP.evpn.0: 2/2/2/0
  10.   __default_evpn__.evpn.0: 0/0/0/0
  11. 10.10.10.3              100       1183       1179       0       0     8:50:18 Establ
  12.   bgp.evpn.0: 2/2/2/0
  13. PBB-EVPN-B-COMP.evpn.0: 2/2/2/0
  14.   __default_evpn__.evpn.0: 0/0/0/0
  15. tim@MX5-1> show route protocol bgp table PBB-EVPN-B-COMP.evpn.0
  16. PBB-EVPN-B-COMP.evpn.0: 6 destinations, 6 routes (6 active, 0 holddown, 0 hidden)
  17. + = Active Route, – = Last Active, * = Both
  18. 2:1.1.1.1:100::100100::a8:d0:e5:5b:75:c8/304 MAC/IP
  19.                    *[BGP/170] 08:25:43, localpref 100, from 10.10.10.2
  20.                       AS path: I, validation-state: unverified
  21.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  22. 2:1.1.1.1:100::100100::a8:d0:e5:5b:94:60/304 MAC/IP  
  23.                    *[BGP/170] 08:24:56, localpref 100, from 10.10.10.3
  24.                       AS path: I, validation-state: unverified
  25.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299920
  26. 3:1.1.1.1:100::100100::10.10.10.2/304 IM
  27.                    *[BGP/170] 08:25:47, localpref 100, from 10.10.10.2
  28.                       AS path: I, validation-state: unverified
  29.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  30. 3:1.1.1.1:100::100100::10.10.10.3/304 IM
  31.                    *[BGP/170] 08:24:57, localpref 100, from 10.10.10.3
  32.                       AS path: I, validation-state: unverified
  33.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299920
  34. tim@MX5-1>

 

Here we can see quite quickly the savings made in memory, CPU and control-plane processing, we have a stretched layer-2 network with 3000 hosts, in regular EVPN by now we’d have 3000 EVPN MAC routes being advertised and received across the network despite only 3x sites being in play. Here with PBB-EVPN we only have 3x B-MACs in the BGP table – with 2x being learnt remotely (shown in purple on lines 18 and 22)

Technically we could have a million mac-addresses at each site – provided the switches could handle that many mac-addresses, we’d still only be advertising the 3x B-MACs across the core from the B-Component, so PBB-EVPN does provide massive scale – it’s true that the locally we’d still need to learn 1 million C-MACs, but the difference is we don’t need to advertise them all back and forth across the network – that state remains local and is represented by a B-MAC and I-SID for that specific customer or service.

We can take a look at the bridge mac-table to see the different mac-addresses in play, for both the B-Component and the I-Component;

  1. tim@MX5-1> show bridge mac-table
  2. MAC flags       (S -static MAC, D -dynamic MAC, L -locally learned, C -Control MAC
  3.     O -OVSDB MAC, SE -Statistics enabled, NM -Non configured MAC, R -Remote PE MAC)
  4. Routing instance : PBB-EVPN-B-COMP
  5.  Bridging domain : BR-B-100100, VLAN : 999
  6.    MAC                 MAC      Logical          NH     RTR
  7.    addresssss          flags    interface        Index  ID
  8.    01:1e:83:01:87:04   DC                        1048575 0      
  9.    a8:d0:e5:5b:75:c8   DC                        1048576 1048576
  10.    a8:d0:e5:5b:94:60   DC                        1048578 1048578
  11. MAC flags (S -static MAC, D -dynamic MAC,
  12.            SE -Statistics enabled, NM -Non configured MAC)
  13. Routing instance : PBB-EVPN-I-COMP
  14.  Bridging domain : BR-I-100, ISID : 100100, VLAN : 100
  15.    MAC                 MAC      Logical                 Remote
  16.    address             flags    interface               BEB address
  17.    00:00:00:bc:25:2f   D        ge-1/0/0.100        
  18.    00:00:00:bc:25:31   D        ge-1/0/0.100        
  19.    00:00:00:bc:25:33   D        ge-1/0/0.100        
  20.    00:00:00:bc:25:35   D        ge-1/0/0.100        
  21.    00:00:00:bc:25:37   D        ge-1/0/0.100        
  22.    00:00:00:bc:25:39   D        ge-1/0/0.100        
  23.    00:00:00:bc:25:3b   D        ge-1/0/0.100        
  24.    00:00:00:bc:25:3d   D        ge-1/0/0.100    
  25.         <omitted>
  26.    00:00:00:bc:2f:67   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  27.    00:00:00:bc:2f:69   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  28.    00:00:00:bc:2f:6b   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  29.    00:00:00:bc:2f:6d   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  30.    00:00:00:bc:2f:6f   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  31.    00:00:00:bc:2f:71   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  32.    00:00:00:bc:2f:73   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  33.         <omitted>
  34.    00:00:00:bc:3c:65   D        rbeb.32769              a8:d0:e5:5b:94:60  
  35.    00:00:00:bc:3c:67   D        rbeb.32769              a8:d0:e5:5b:94:60  
  36.    00:00:00:bc:3c:69   D        rbeb.32769              a8:d0:e5:5b:94:60  
  37.    00:00:00:bc:3c:6b   D        rbeb.32769              a8:d0:e5:5b:94:60  
  38.    00:00:00:bc:3c:6d   D        rbeb.32769              a8:d0:e5:5b:94:60  
  39.    00:00:00:bc:3c:6f   D        rbeb.32769              a8:d0:e5:5b:94:60  
  40.    00:00:00:bc:3c:71   D        rbeb.32769              a8:d0:e5:5b:94:60

 

Because there are 3000x MAC addresses currently on the network I’ve omitted both of them so you can see the important differences;

  • Lines 10 through 12, show the B-MACs with line 10 showing the local B-MAC for MX-1, and lines 11 and 12 showing the B-MACs learnt from MX-2 and MX-3 via BGP (we can see these in the BGP routing table on lines 18 and 22 from the previous example)
  • Lines 21 through 28 give a small sample of the locally learnt mac-addresses connected to ge-1/0/0.100 in the I-Component
  • Lines 32 through 38 give a small sample of the remotely learnt mac-addresses from MX-2’s B-MAC (a8:d0:e5:5b:75:c8)
  • Lines 42 through 48 give a small sample of the remotely learnt mac-addresses from MX-3’s B-MAC (a8:d0:e5:5b:94:60)
  • Essentially – the EVPN control-plane is only present for B-MACs, whilst the C-MAC forwarding is handled in the same way as VPLS on the forwarding plane, the big advantage is that all this information isn’t thrown into BGP – it’s kept locally.

Finally, with traffic running – I have the connection between MX-1 and P1 tapped so I can capture packets into wireshark at line-rate, lets look at a packet in the middle of the network to see what it looks like;

diag9

We can see the MPLS lables (2x labels, one for the IGP transport and one for the EVPN) below that we have our backbone Ethernet header with source and destination B-MAC (802.1ah provider backbone bridge) below that we have our 802.1ah PBB I-SID field with the customer’s original C-MAC, and last we have the original dot1q frame, (single-tagged in this case)

So that’s pretty much it – as far as the basics of PBB-EVPNs are concerned, a few basic points;

  • PBB-EVPNs are considerably more complicated to configure than regular EVPN, however if you need massive scale and the ability to handle hundreds of thousands or millions of mac-addresses on the network – it’s currently one of the best technologies to look at
  • Unfortunately PBB-EVPN is pretty much layer-2 only, most of the fancy layer-3 hooks build into regular EVPN which I demonstrated in previous blog posts aren’t supported for PBB-EVPN, it is essentially a layer-2 solution
  • PBB-EVPN does support layer-2 multi-homing which I might look into with a later blog-post

I hope anyone reading this found it useful 🙂

Segment Routing on JUNOS – The basics

Anybody who’s been to any seminar, associated with any major networking systems manufacturer or bought any recent study material, will almost certainly have come across something new called “Segment Routing” it sounds pretty cool – but what is it and why has it been created?

To understand this we first need to rewind to what most of us are used to doing on a daily basis – designing/building/maintaining/troubleshooting networks, that are built mostly around LDP or RSVP-TE based protocols. But what’s wrong with these protocols? why has Segment-Routing been invented and what problems does it solve?

Before we delve into the depths of Segment-Routing, lets first remind ourselves of what basic LDP based MPLS is. LDP or “Label Distribution Protocol” was first invented around 1999, superseding the now defunct “TGP” or “Tag distribution protocol” in order to solve the problems of traditional IPv4 based routing. Where control-plane resources were finite in nature, MPLS enabled routers to forward packets based solely on labels, rather than destination IP address, allowing for a much more simple design. The fact that the “M” in MPLS stands for “Multiprotocol” allowed engineers to support a whole range of different services and encapsulations, that could be tunnelled between devices in a network running nothing other than traditional IPv4, the role of LDP was to generate and distribute MPLS label bindings to other devices in a network, alongside a common IGP such as ISIS or OSPF.

Back in the late 1990’s and early 2000’s, routers were much smaller and far less powerful – especially where relatively resource intense protocols like OSPF or ISIS were concerned, there was also the problem that protocols like OSPF – which is based on IP were very difficult to modify due to the size of the IP header, as a result rather than modify the IGPs to support MPLS natively – the decision was made to invent a totally separate protocol (LDP) to run alongside the IGPs simply to provide the MPLS label distribution and binding capability – many people today regard LDP as a “Sticking plaster” I myself prefer the phrase “Gaffer tape” 🙂

A quick refresher on how LDP works using a pile of MX routers, consider the following basic topology;

seg3

All routers have an identical configuration, the only difference is the ISIS ISO address and the IP addressing;

  1. tim@MX-1> show configuration protocols
  2. isis {
  3.     level 1 disable;
  4.     interface xe-2/0/0.0 {
  5.         point-to-point;
  6.     }
  7.     interface lo0.0 {
  8.         passive;
  9.     }
  10. }
  11. ldp {
  12.     interface xe-2/0/0.0;
  13. }

 

Assume LDP adjacencies are established between all devices, the following sequence of events occurs;

  • MX-4 injects it’s local loopback 4.4.4.4/32 into ISIS, this is advertised throughout the network – LDP also creates an MPLS label-binding for label-value 3 (the implicit-null label) which is advertised towards MX-3;

seg4

  • MX-3 receives the prefix with the label-binding of 3 (implicit-null) and creates an entry in it’s forwarding table with a “pop” action, for any traffic destined for 4.4.4.4 out of interface xe-0/0/0 (essentially sending the packet unlabelled) at the same time it generates a new outgoing label of “299780” for 4.4.4.4 which is advertised towards MX-2;

seg5

  • When MX-2 receives 4.4.4.4 with a label binding of 299780, it adds the entry to it’s forwarding table out of interface xe-0/0/1, whilst at the same time forwarding the prefix towards MX-1 with a different label of, “299781” MX-2 is now aware of 2x MPLS labels for 4.4.4.4 – the label of 299780 it received from MX-3 and the new label of 299781 it generated and sent to MX-1, this essentially means any packets coming from MX-1 towards 4.4.4.4, tagged with label 299781 on xe-0/0/0 will be swapped to 299780 and forwarded out of xe-0/0/1 – hence the “hop by hop” forwarding paradigm;

seg6

With such a small network involving only 4x routers, it’s difficult to imagine running into problems with LDP because it’s so simple and easy, however the moment you go from 4x routers to 1000x routers or beyond it starts to become far less efficient;

  • Because LSRs generate labels for remote FEC’s on a hop-by-hop basis you end up with a large amount of MPLS labels contained in the LFIB which have to be distributed alongside the IGP, resulting in a large amount of overhead. In the above example we have multiple labels for a single prefix with only 3 routers (with the fourth performing PHP)
  • We have to run LDP alongside the IGP everywhere, simply for MPLS to work – it’s true that we’ve all been doing this for years so why complain about it now when it works just fine? A simple solution is always the best solution, larger networks would be much simpler if the IGP could be made to accommodate the MPLS label advertisement functionality.
  • No traffic-engineering functionality; ultimately at the end of the day, in 99% of networks LDP simply “follows” the IGP best-path mechanism, if you change the IGP metrics you end up shifting large amounts of traffic around which is often undesirable – as such LDP tends to be a pain in the neck, if you have more complex traffic requirements, for example making sure that 40Gbps of streaming video avoids a certain link in the network – with LDP it can’t be done very easily without resorting to endless hacks and tactical tweaks.

So LDP is far from perfect when we get into more complicated scenarios, if we have a larger network where we want to do any sort of traffic-engineering – the only real alternative is RSVP-TE.

RSVP-TE – essentially is an extension of the original “RSVP” Resource Reservation Protocol that allows it to generate MPLS labels for prefixes, whilst at the same time using it’s Resource reservation capabilities to reserve specific LSPs through the network, that require a certain amount of bandwidth – or simply reserving a path that’s determined by the network designer, rather than the IGP and it’s lowest-path-cost mentality.

The rather obvious cost with RSVP-TE is that it’s a lot more complex, I’ve lost count of the amount of times I’ve suggested a relatively simple RSVP-TE solution to a traffic-engineering problem, for the people in the room to simply rule it out just because it’s just too complex in nature – I’ve worked with a small number of global carrier/mobile networks who almost exclusively use RSVP-TE along with it’s fancy features, such as “auto-bandwidth” but the vast majority of smaller networks tend to stay away from it.

A further problem with RSVP-TE is that in large networks with numerous “P” routers and “PE” routers, the LSP state between the ingress and egress LSR must be maintained – in a network with 1000’s of routers, all of that information needs to be signalled – including bandwidth reservations, path reservations so on and so fourth, as opposed to LDP where we simply bind an MPLS label. The end result can be that in some networks control-plane processing can be extremely intense on the route engines if the network encounters a significant failure – imagine a P router with 5k signalled LSPs traversing it, if it drops a link or card – those 5k LSPs need to be recalculated and re-signalled throughout the entire network.

To make matters worse, many networks run LDP and RSVP-TE at the same time, LDP for traditional basic MPLS connectivity, with RSVP-TE LSPs running over the top to provide the traffic-engineering capability, that might be needed in certain niche parts of the network – like keeping sensitive VOIP traffic separate from bulk internet traffic – the complexity ramps up pretty quickly in these environments and you end up with a lot of different protocols stacked up on top of each other – when all we really want to do is just forward packets between routers in a network………. 😀

 

Which brings me finally to Segment routing!

 

Segment routing is essentially proposed as a replacement for LDP or RSVP-TE, where the IGP (currently ISIS or OSPF) has been extended to incorporate the MPLS labelling and segment-routing functions internally, leading to the immediate obvious benefit, of not having to run an additional protocol alongside the IGP to provide the MPLS functionality – we can do everything inside ISIS or OSPF.

To make things even cooler, Segment-routing can operate over an IPv4 or IPv6 data-plane, supports ECMP and also has extensions built into it, which allow it cater for things like L3-VPNs or VPLS running over the top. The only thing it can’t do is reserve bandwidth in the same way that RSVP-TE can, but this can be accomplished via the use of an external controller (SDN)

Segment routing support was released on Juniper MX routers under 15.1F6

For now lets look at a basic topology, along with some of the basic concepts and configurations, consider the below expanded topology from the LDP examples above;

seg7

Everything is the same, except that I’ve gone an added an additional link between MX-2 and MX-4. The first step is to enable segment-routing, for this network I’m using ISIS as the IGP. Turning segment-routing on is pretty simple – I just need to have MPLS and ISIS enabled on the correct interfaces and switch on “source-packet-routing” under ISIS;

  1. tim@MX-1# show protocols
  2. mpls {
  3.     interface xe-2/0/0.0;
  4. }
  5. isis {
  6.     source-packet-routing;
  7.     level 1 disable;
  8.     interface xe-2/0/0.0 {
  9.         point-to-point;
  10.     }
  11.     interface lo0.0 {
  12.         passive;
  13.     }
  14. }

 

Notice how it’s called “source-packet-routing” essentially, Segment-routing uses a source routing paradigm, where the ingress PE determines the path through the network based on a set of instructions or “segments”

Take this on contrast with RSVP-TE, where the control-plane is source routed (the head-end LSR computes the path through the network to the tail-end) but the packets are only sent with a single RSVP MPLS label, and so the control-plane is source-routed, but the data-plane is not. 

With “segment-routing” enabled on all the routers in the network, lets take a look and see what’s what;

We have a normal ISIS adjacency on MX-1;

  1. tim@MX-1> show isis adjacency
  2. Interface             System         L State        Hold (secs) SNPA
  3. xe-2/0/0.0            MX-2           2  Up                   21
  4. {master}
  5. tim@MX-1>

 

Let’s check out the ISIS database and see if anything new is present;

  1. tim@MX-1> show isis database extensive MX-2.00
  2. IS-IS level 1 link-state database:
  3. IS-IS level 2 link-state database:
  4. MX-2.00-00 Sequence: 0x28, Checksum: 0x4cff, Lifetime: 616 secs
  5.    IS neighbor: MX-1.00                       Metric:       10
  6.      Two-way fragment: MX-1.00-00, Two-way first fragment: MX-1.00-00
  7.    IS neighbor: MX-3.00                       Metric:       10
  8.      Two-way fragment: MX-3.00-00, Two-way first fragment: MX-3.00-00
  9.    IS neighbor: MX-4.00                       Metric:       10
  10.      Two-way fragment: MX-4.00-00, Two-way first fragment: MX-4.00-00
  11.    IP prefix: 2.2.2.2/32                      Metric:        0 Internal Up
  12.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  13.    IP prefix: 10.10.10.2/31                   Metric:       10 Internal Up
  14.    IP prefix: 10.10.10.4/31                   Metric:       10 Internal Up
  15.   Header: LSP ID: MX-2.00-00, Length: 315 bytes
  16.     Allocated length: 335 bytes, Router ID: 2.2.2.2
  17.     Remaining lifetime: 616 secs, Level: 2, Interface: 327
  18.     Estimated free bytes: 81, Actual free bytes: 20
  19.     Aging timer expires in: 616 secs
  20.     Protocols: IP, IPv6
  21.   Packet: LSP ID: MX-2.00-00, Length: 315 bytes, Lifetime : 1198 secs
  22.     Checksum: 0x4cff, Sequence: 0x28, Attributes: 0x3 <L1 L2>
  23.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  24.     Packet type: 20, Packet version: 1, Max area: 0
  25.   TLVs:
  26.     Area address: 49.0001 (3)
  27.     LSP Buffer Size: 1492
  28.     Speaks: IP
  29.     Speaks: IPV6
  30.     IP router id: 2.2.2.2
  31.     IP address: 2.2.2.2
  32.     Hostname: MX-2
  33.     Router Capability:  Router ID 2.2.2.2, Flags: 0x01
  34. SPRING Algorithm – Algo: 0
  35.     IS neighbor: MX-1.00, Internal, Metric: default 10
  36.     IS neighbor: MX-3.00, Internal, Metric: default 10
  37.     IS neighbor: MX-4.00, Internal, Metric: default 10
  38.     IS extended neighbor: MX-1.00, Metric: default 10
  39.       IP address: 10.10.10.1
  40.       Neighbor’s IP address: 10.10.10.0
  41.       Local interface index: 328, Remote interface index: 327
  42. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299784
  43.     IS extended neighbor: MX-3.00, Metric: default 10
  44.       IP address: 10.10.10.2
  45.       Neighbor’s IP address: 10.10.10.3
  46.       Local interface index: 329, Remote interface index: 333
  47. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299783
  48.     IS extended neighbor: MX-4.00, Metric: default 10
  49.       IP address: 10.10.10.4
  50.       Neighbor’s IP address: 10.10.10.5
  51.       Local interface index: 331, Remote interface index: 333
  52. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299785
  53.     IP prefix: 2.2.2.2/32, Internal, Metric: default 0, Up
  54.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  55.     IP prefix: 10.10.10.2/31, Internal, Metric: default 10, Up
  56.     IP prefix: 10.10.10.4/31, Internal, Metric: default 10, Up
  57.     IP extended prefix: 2.2.2.2/32 metric 0 up
  58.     IP extended prefix: 10.10.10.0/31 metric 10 up
  59.     IP extended prefix: 10.10.10.2/31 metric 10 up
  60.     IP extended prefix: 10.10.10.4/31 metric 10 up
  61.   No queued transmissions
  62. {master}
  63. tim@MX-1>

 

So if we look at the ISIS database against MX-1’s neighbour (MX-2) we can see some additional things happening in ISIS;

  • We can see that SPRING (Segment-routing) is turned on and is a known TLV
  • We can see something called a “P2P IPv4 Adj-SID” with an associated MPLS label

The “IPv4 Adj-SID” is known as the IGP adjacency segment, and is essentially a segment attached to a directly connected IGP adjacency, it’s injected locally by the router at either side of the adjacency – this can easily be demonstrated if we simply have a link between MX-1 and MX-2;

seg8

We take another look at the ISIS database on MX1;

  1. tim@MX-1> show isis database extensive
  2. IS-IS level 1 link-state database:
  3. IS-IS level 2 link-state database:
  4. MX-1.00-00 Sequence: 0x2, Checksum: 0xf229, Lifetime: 827 secs
  5.    IS neighbor: MX-2.00                       Metric:       10
  6.      Two-way fragment: MX-2.00-00, Two-way first fragment: MX-2.00-00
  7.    IP prefix: 1.1.1.1/32                      Metric:        0 Internal Up
  8.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  9.   Header: LSP ID: MX-1.00-00, Length: 171 bytes
  10.     Allocated length: 1492 bytes, Router ID: 1.1.1.1
  11.     Remaining lifetime: 827 secs, Level: 2, Interface: 0
  12.     Estimated free bytes: 1273, Actual free bytes: 1321
  13.     Aging timer expires in: 827 secs
  14.     Protocols: IP, IPv6
  15.   Packet: LSP ID: MX-1.00-00, Length: 171 bytes, Lifetime : 1198 secs
  16.     Checksum: 0xf229, Sequence: 0x2, Attributes: 0x3 <L1 L2>
  17.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  18.     Packet type: 20, Packet version: 1, Max area: 0
  19.   TLVs:
  20.     Area address: 49.0001 (3)
  21.     LSP Buffer Size: 1492
  22.     Speaks: IP
  23.     Speaks: IPV6
  24.     IP router id: 1.1.1.1
  25.     IP address: 1.1.1.1
  26.     Hostname: MX-1
  27.     Router Capability:  Router ID 1.1.1.1, Flags: 0x01
  28.       SPRING Algorithm – Algo: 0
  29.     IP prefix: 1.1.1.1/32, Internal, Metric: default 0, Up
  30.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  31.     IP extended prefix: 1.1.1.1/32 metric 0 up
  32.     IP extended prefix: 10.10.10.0/31 metric 10 up
  33.     IS neighbor: MX-2.00, Internal, Metric: default 10
  34.     IS extended neighbor: MX-2.00, Metric: default 10
  35.       IP address: 10.10.10.0
  36.       Neighbor’s IP address: 10.10.10.1
  37.       Local interface index: 327, Remote interface index: 328
  38. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299856
  39.   No queued transmissions
  40. MX-2.00-00 Sequence: 0x2, Checksum: 0x90bf, Lifetime: 825 secs
  41.    IS neighbor: MX-1.00                       Metric:       10
  42.      Two-way fragment: MX-1.00-00, Two-way first fragment: MX-1.00-00
  43.    IP prefix: 2.2.2.2/32                      Metric:        0 Internal Up
  44.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  45.   Header: LSP ID: MX-2.00-00, Length: 171 bytes
  46.     Allocated length: 284 bytes, Router ID: 2.2.2.2
  47.     Remaining lifetime: 825 secs, Level: 2, Interface: 327
  48.     Estimated free bytes: 113, Actual free bytes: 113
  49.     Aging timer expires in: 825 secs
  50.     Protocols: IP, IPv6
  51.   Packet: LSP ID: MX-2.00-00, Length: 171 bytes, Lifetime : 1198 secs
  52.     Checksum: 0x90bf, Sequence: 0x2, Attributes: 0x3 <L1 L2>
  53.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  54.     Packet type: 20, Packet version: 1, Max area: 0
  55.   TLVs:
  56.     Area address: 49.0001 (3)
  57.     LSP Buffer Size: 1492
  58.     Speaks: IP
  59.     Speaks: IPV6
  60.     IP router id: 2.2.2.2
  61.     IP address: 2.2.2.2
  62.     Hostname: MX-2
  63.     Router Capability:  Router ID 2.2.2.2, Flags: 0x01
  64.       SPRING Algorithm – Algo: 0
  65.     IP prefix: 2.2.2.2/32, Internal, Metric: default 0, Up
  66.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  67.     IP extended prefix: 2.2.2.2/32 metric 0 up
  68.     IP extended prefix: 10.10.10.0/31 metric 10 up
  69.     IS neighbor: MX-1.00, Internal, Metric: default 10
  70.     IS extended neighbor: MX-1.00, Metric: default 10
  71.       IP address: 10.10.10.1
  72.       Neighbor’s IP address: 10.10.10.0
  73.       Local interface index: 328, Remote interface index: 327
  74. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299784
  75.   No queued transmissions
  76. {master}
  77. tim@MX-1>

 

So we can see from the ISIS database, that each router on either side of the adjacency has locally generated a label for it’s own side of the link. Consider that this information is injected into the ISIS database, and the ISIS database is flooded throughout the entire network – this gives any ingress LSR the required knowledge to perform traffic-engineering by simply imposing whichever adjacency segment instructions it needs for a packet to take a specific path through the network, for the purposes of traffic-engineering.

Take the below example, if MX-1 sends packets containing the IGP Adj-SID of 10  for MX-2’s link to MX-3 (ADJ-SID = 10) traffic can be steered via MX-3 as soon as it lands on MX-2. Note that whilst MX-2 will allocate it’s ADJ-SID of 10 and distribute it via the IGP, only MX-2 will install that label in the forwarding-table – because it’s locally significant.

seg9

The Adjacency segment is of the two main building blocks of segment-routing, and is generally known as a local segment, simply because it’s designed to have a local significance – if a packet arrives on an interface with a specific local-segment instruction in the stack, the device will act on that instruction and forward the packet in a particular way for that segment, or part of the network.

The next type of segment is known as the “nodal segment” or “global segment” and is globally significant, it generally represents the loopback address of each router in the network and is configured as an index, lets go ahead and look at the configuration;

  1. tim@MX-1> show configuration protocols isis
  2. source-packet-routing {
  3.     node-segment ipv4-index 10;
  4. }
  5. level 1 disable;
  6. interface xe-2/0/0.0 {
  7.     point-to-point;
  8. }
  9. interface lo0.0 {
  10.     passive;
  11. }

 

So a relatively straightforward configuration, I’ll go ahead and configure the rest of the network as above but with the following indexes;

  • MX-1 = node-segment index-10
  • MX-2 = node-segment index-20
  • MX-3 = node-segment index-30
  • MX-4 = node-segment index-40

seg10

So with the node-segment index configured on each router, lets check what’s changed inside the ISIS database on MX-1, for the LSAs received for MX-2 to keep things simple for now;

  1. tim@MX-1> show isis database extensive MX-2
  2. IS-IS level 1 link-state database:
  3. IS-IS level 2 link-state database:
  4. MX-2.00-00 Sequence: 0x73, Checksum: 0xd32e, Lifetime: 479 secs
  5.   IPV4 Index: 20
  6.   Node Segment Blocks Advertised:
  7.     Start Index : 0, Size : 4096, Label-Range: [ 800000, 804095 ]
  8.    IS neighbor: MX-1.00                       Metric:       10
  9.      Two-way fragment: MX-1.00-00, Two-way first fragment: MX-1.00-00
  10.    IS neighbor: MX-3.00                       Metric:       10
  11.      Two-way fragment: MX-3.00-00, Two-way first fragment: MX-3.00-00
  12.    IS neighbor: MX-4.00                       Metric:       10
  13.      Two-way fragment: MX-4.00-00, Two-way first fragment: MX-4.00-00
  14.    IP prefix: 2.2.2.2/32                      Metric:        0 Internal Up
  15.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  16.    IP prefix: 10.10.10.2/31                   Metric:       10 Internal Up
  17.    IP prefix: 10.10.10.4/31                   Metric:       10 Internal Up
  18.   Header: LSP ID: MX-2.00-00, Length: 335 bytes
  19.     Allocated length: 335 bytes, Router ID: 2.2.2.2
  20.     Remaining lifetime: 479 secs, Level: 2, Interface: 327
  21.     Estimated free bytes: 113, Actual free bytes: 0
  22.     Aging timer expires in: 479 secs
  23.     Protocols: IP, IPv6
  24.   Packet: LSP ID: MX-2.00-00, Length: 335 bytes, Lifetime : 1198 secs
  25.     Checksum: 0xd32e, Sequence: 0x73, Attributes: 0x3 <L1 L2>
  26.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  27.     Packet type: 20, Packet version: 1, Max area: 0
  28.   TLVs:
  29.     Area address: 49.0001 (3)
  30.     LSP Buffer Size: 1492
  31.     Speaks: IP
  32.     Speaks: IPV6
  33.     IP router id: 2.2.2.2
  34.     IP address: 2.2.2.2
  35.     Hostname: MX-2
  36.     Router Capability:  Router ID 2.2.2.2, Flags: 0x01
  37.       SPRING Capability – Flags: 0xc0(I:1,V:1), Range: 4096, SID-Label: 800000
  38.       SPRING Algorithm – Algo: 0
  39.     IS neighbor: MX-1.00, Internal, Metric: default 10
  40.     IS neighbor: MX-3.00, Internal, Metric: default 10
  41.     IS neighbor: MX-4.00, Internal, Metric: default 10
  42.     IS extended neighbor: MX-1.00, Metric: default 10
  43.       IP address: 10.10.10.1
  44.       Neighbor’s IP address: 10.10.10.0
  45.       Local interface index: 328, Remote interface index: 0
  46.       P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299784
  47.     IS extended neighbor: MX-3.00, Metric: default 10
  48.       IP address: 10.10.10.2
  49.       Neighbor’s IP address: 10.10.10.3
  50.       Local interface index: 329, Remote interface index: 0
  51.       P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299789
  52.     IS extended neighbor: MX-4.00, Metric: default 10
  53.       IP address: 10.10.10.4
  54.       Neighbor’s IP address: 10.10.10.5
  55.       Local interface index: 331, Remote interface index: 0
  56.       P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299788
  57.     IP prefix: 2.2.2.2/32, Internal, Metric: default 0, Up
  58.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  59.     IP prefix: 10.10.10.2/31, Internal, Metric: default 10, Up
  60.     IP prefix: 10.10.10.4/31, Internal, Metric: default 10, Up
  61.     IP extended prefix: 2.2.2.2/32 metric 0 up
  62.       8 bytes of subtlvs
  63.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 20
  64.     IP extended prefix: 10.10.10.0/31 metric 10 up
  65.     IP extended prefix: 10.10.10.2/31 metric 10 up
  66.     IP extended prefix: 10.10.10.4/31 metric 10 up
  67.   No queued transmissions
  68. {master}
  69. tim@MX-1>

 

Some explanations;

  • Line 8 signifies that MX-2 is advertising a nodal segment block or SRGB “Segment-routing global block” this is essentially a range that all networking vendors have agreed, from which to allocate nodal-segment labels, here is starts at value 800000 and has a maximum range of 4096
  • Lines5 51, 56 and 61 show the IGP Adjecency segments we’ve already talked about (for the links to MX-2’s neighbours
  • Line 68 is the important one – here we can see a node SID with a value of 20, which is the value I configured under MX-2;
  1. tim@MX-2> show configuration protocols isis
  2. source-packet-routing {
  3.     node-segment ipv4-index 20;
  4. }
  5. level 1 disable;
  6. interface xe-0/0/0.0 {
  7.     point-to-point;
  8. }
  9. interface xe-0/0/1.0 {
  10.     point-to-point;
  11. }
  12. interface xe-0/0/2.0 {
  13.     point-to-point;
  14. }
  15. interface lo0.0 {
  16.     passive;
  17. }

.

So if I go back onto MX-1 and look at the mpls.0 routing-table – I should see an egress label of 20 for 2.2.2.2?

  1. tim@MX-1> show route table mpls.0
  2. mpls.0: 12 destinations, 12 routes (12 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 0                  *[MPLS/0] 16:12:24, metric 1
  5.                       to table inet.0
  6. 0(S=0)             *[MPLS/0] 16:12:24, metric 1
  7.                       to table mpls.0
  8. 1                  *[MPLS/0] 16:12:24, metric 1
  9.                       Receive
  10. 2                  *[MPLS/0] 16:12:24, metric 1
  11.                       to table inet6.0
  12. 2(S=0)             *[MPLS/0] 16:12:24, metric 1
  13.                       to table mpls.0
  14. 13                 *[MPLS/0] 16:12:24, metric 1
  15.                       Receive
  16. 299856             *[L-ISIS/14] 15:24:07, metric 0
  17.                     > to 10.10.10.1 via xe-2/0/0.0, Pop
  18. 299856(S=0)        *[L-ISIS/14] 00:07:46, metric 0
  19.                     > to 10.10.10.1 via xe-2/0/0.0, Pop
  20. 800020             *[L-ISIS/14] 00:22:52, metric 10
  21.                     > to 10.10.10.1 via xe-2/0/0.0, Pop  
  22. 800020(S=0)        *[L-ISIS/14] 00:07:46, metric 10
  23.                     > to 10.10.10.1 via xe-2/0/0.0, Pop
  24. 800030             *[L-ISIS/14] 00:22:47, metric 20
  25.                     > to 10.10.10.1 via xe-2/0/0.0, Swap 800030
  26. 800040             *[L-ISIS/14] 00:22:40, metric 20
  27.                     > to 10.10.10.1 via xe-2/0/0.0, Swap 800040
  28. {master}
  29. tim@MX-1>

.

Wrong! Label 2o doesn’t seem to be anywhere, instead I have 800020..

Remember from the previous example above on line 42 – we have the “SRGB” base starting at 800000. Because global-segments are unique, all routers use the same SRGB block starting at 800000, then each configured loopback index shifts the SRGB base value by the index value. If I configured an index of “666” on MX-4, then it’s global-segment ID would be 800666 and so on.

If we look at the entire ISIS Database on MX-1 for all routers – we can see all the node segments, and their configured values;

  1. tim@MX-1> show isis database extensive | match node
  2.   Node Segment Blocks Advertised:
  3.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 10
  4.   Node Segment Blocks Advertised:
  5.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 20
  6.   Node Segment Blocks Advertised:
  7.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 30
  8.   Node Segment Blocks Advertised:
  9.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 40
  10. {master}
  11. tim@MX-1>

 

We can look at the inet.3 table to see the loopback prefixes of all the routers in the network, being resolved down to their nodal-segment labels;

  1. tim@MX-1> show route table inet.3
  2. inet.3: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 2.2.2.2/32         *[L-ISIS/14] 00:31:02, metric 10
  5.                     > to 10.10.10.1 via xe-2/0/0.0
  6. 3.3.3.3/32         *[L-ISIS/14] 00:30:57, metric 20
  7.                     > to 10.10.10.1 via xe-2/0/0.0, Push 800030
  8. 4.4.4.4/32         *[L-ISIS/14] 00:30:50, metric 20
  9.                     > to 10.10.10.1 via xe-2/0/0.0, Push 800040

 

We see the node-segments for MX-3 and MX-4, but not for MX-2 simply because of PHP – but nevertheless, we can see how it all fits together quite nicely.

It must be pointed out that in a network where packets are simply being forwarded using the global-segment label of the destination, for example; if we wanted to send packets from MX-1 to MX-4 without any traffic-engineering, the same label will be used end to end, (the SRGB base of 800000 + the index 40 = 800040) as opposed to LDP, where labels for a single destination or FEC, are generated on a hop-by-hop basis, and get swapped to different values at every hop. Routers will also perform the same IGP based ECMP hashing for equal-cost paths, essentially the packet forwarding behaves the same as LDP, but with much less state information in the network.

 

The whole aim of basic segment routing, is to use the global “nodal-segments” alongside local “adjacency-segments” to allow an ingress LSR to calculate an exact path through the network – with much less state than what was previously possible with protocols such as RSVP-TE

For example, if we wanted to perform basic traffic-engineering, and send packets from MX-1 to MX-4, but via the longer path through MX-3, the following things would occur;

seg11

MX-1 imposes 2x labels, label 299784 for the Adj-SID of MX-2’s path via MX-3, and label value 800040, (the node-index 40 configured at Mx-4, plus the SRGB base value of 800000) and forwards the packet to MX-2;

seg12

MX-2 receives the packet, due to the presence of the ADJ-SID=299784 label, it follows the instruction and forwards the packet out of that link, towards MX-3 – popping the ADJ-SID label in the process;

seg13

MX-3 receives the packet with label 800040 (the node-SID of MX-4) performs PHP in the standard way, and forwards the packet direct to MX-4, completing the process. It’s entirely acceptable to use explicit-null to preserve the MPLS label on egress towards MX-4 for the purposes of EXP QoS if you’re running pipe-mode.

 

Clever readers will notice that segment-routing basically all boils down to a head-end LSR programming it’s own path through the network, by imposing a number of MPLS labels which are treated as instructions – this leads the obvious question of hardware support, even high-end routers have a limitation to the number of MPLS labels that can be handled by an ASIC, the maximum label-depth tends to be 3-5 depending on which model of router or chipset you’re using, so it might be a while until more hardware vendors accommodate larger numbers of labels in the label stack.

Consider the fact that with segment-routing, it’s possible to perform VPN connectivity along with traffic-engineering purely inside ISIS or OSPF, by simply using a much deeper label stack – we could quite quickly end up with 3-5 labels in the stack and hit the limits of our already very expensive linecards.

In terms of providing VPN services and performing things like traffic-engineering, as far as I can tell it’s not possible to do this manually on Juniper router inside the CLI at this time – you need a centralised controller to do this, or a “PCE” – “path computational element” which is generally a server running the controller software, this connects into a “PCC” – “path computational client” which would be the head-end LSR node performing the signalling, as directed by the server (PCE). This generally takes place via a protocol known as PCEP (path computational element protocol)

Essentially the difference between a PCE that’s provisioning RSVP-TE tunnels, and a PCE that’s signalling segments – both tell the head-end LSR how to forward traffic, except with segment-routing, no LSP’s are provisioned – it simply imposes a set of instructions (labels) as opposed to constructing an actual LSP through a chain of devices – again saving on state in the network.

At this time there are a few different controllers on the Market, Juniper’s Northstar, Cisco’s Open SDN, and a freeware controller known as “open daylight” one of my colleagues has managed to get open daylight working with IOS-XR to good effect, I may try and get hold of a demo Northstar license so I can demo this technology in action with IXIA – but that’ll be for next time,

Thanks for reading 🙂