BGP Flowspec redirect with ExaBGP

I’ve been busy as hell since the summer, not had much time to work on blog posts – but it’s all been good work! I also got a new job working for Riot Games, (Makers of the worlds largest online multiplayer game – league of legends) which has been totally fantastic.

This post is about BGP Flowspec, specifically how we can now more easily redirect traffic to a scrubbing appliance, it’s common for a device such as an Arbor TMS, or some other type of filtering box, to be installed close to the network edge, it could be a linux box full of filters, a DPI box, anything that might be useful in terms of performing traffic verification or enforcement.

In the event that a DDOS event occurs, it’s possible to redirect suspect traffic, or traffic to a specific victim host, through an appliance where it can be dropped or permitted.

Traditionally this has been done with Layer-3 VPNS, where ingress traffic from the internet is punted into a “Dirty VRF” it’s then forced through a mitigation appliance where it’s either dropped, or permitted – where it returns back into the same router but in a new “Clean VRF”

It looks something like this;

vrfs

  • DDOS Traffic from Lizardsquad ingresses through the edge router aimed at the victim 1.2.3.4/32
  • A BGP host route of 1.2.3.4/32 is injected into the edge router via the Dirty VRF with a next-hop of the mitigation appliance
  • RIB Groups or route leaking is used to punt traffic aimed at 1.2.3.4 from GRT into the Dirty VRF
  • Suspect traffic is either dropped or forwarded back into the same edge router via the “Clean VRF”
  • It flows towards the destination, where it’s then leaked back into GRT ahead of reaching the final destination

There are many different permutations of this design, but the main flaw with it is the reliance on having to provision a clean VRF everywhere, along with route-leaking.

The only real reason for the existence of the Clean VRF in this scenario, is to prevent forming a routing-loop on the edge router – if traffic is returned back into GRT, or back through the Dirty VRF – it’ll encounter the 1.2.3.4/32 mitigation route, and be looped back into the mitigation appliance for infinity;

loop

It’s always seemed a bit of a waste to me, to have to provision VPNs everywhere simply because we can’t put clean traffic into the same router, without causing a huge routing loop, because unless we resort to horrific things like policy-routing – we’re stuck with regular forwarding logic.

The only other alternative to this is to route clean traffic back into a physically different router that doesn’t maintain a mitigation route, that way the returned clean traffic can just follow the regular path to the destination;

lizard

Obviously, the above scenario only really applies if you absolutely can’t run L3VPNs in your network, but still need DDOS mitigation.

Thankfully now that BGP Flowspec is now widely available, we can simplify everything and have a much more streamlined design.

Flowspec is standardised under RFC 5575 https://tools.ietf.org/html/rfc5575

The main principle of Flowspec is actually based on policy routing, in that we can apply a match criteria to ingress traffic, packets that match any of the specific criteria can then be subject to specific actions – in pretty much exactly the same way as with policy-routing, with one major difference – we can program it through BGP rather than through the CLI.

There are some obvious benefits to using BGP for this task;

  • Most networks and their operators already run and understand BGP – to turn on an additional AFI/SAFI to support flow routes, is pretty easy
  • It’s far easier to automate – programatic networking using APIs to inject routes is far less laborious, than having to do things like policy-routing, or creating gigantic clunky scripts.
  • Flowspec supports new extended “redirect” communities that allow a router to automatically forward traffic directly to a different IP next-hop, or directly to a VRF without requiring much configuration
  • There are a number of open source BGP daemons that are fully programmable and support Flowspec – such as ExaBGP and GoBGP, they’re also free!

Like with policy-routing and indeed QoS there’s a whole host of specific criteria that we can use to match packets;

  • Type 1 – IPv4 / IPv6 Destination prefix
  • Type 2 – IPv4 / IPv6 Source Prefix
  • Type 3 – IP protocol
  • Type 4 – Source / Destination Port
  • Type 5 – Destination Port
  • Type 6 – Source Port
  • Type 7 – ICMP Type
  • Type 8 – ICMP Code
  • Type 9 – TCP Flags
  • Type 10 – Packet Length
  • Type 11 – DSCP
  • Type 12 – Fragment encoding

Likewise, once we’ve matched our packet – there’s a number of highly useful things that we can do to it using new BGP extended communities;

  • Type 0x8006 – Drop, or police; (traffic-rate 0 , or traffic-rate <rate> )
  • Type 0x8007 – Traffic action; (apply sampling)
  • Type 0x8008 – Redirect to VRF; (punt traffic into a VRF based on route-target)
  • Type 0x8009 – Traffic marking; (Set a DSCP value)
  • Type 0x0800 – Redirect to IP NH; (creates a policy that forces traffic towards the specified next-hop) currently supported on Cisco and Nokia – but not Juniper 😦

The beauty of Flowspec, is that all of this is done directly in hardware – if you’re using modern silicon there’s practically no hit to performance, even with small packets – you should be able to run Flowspec rules at line rate – but as always, make sure you read the docs, test it AND speak to your vendor, 😉

Furthermore, the configuration is really quite basic – once you’ve enabled the BGP Flowspec AFI/SAFI and have a working session, simply go ahead and inject your mitigation routes – most of the work and config takes place on the controller.

Lets take a quick look at the lab topology;

Cisco / Juniper mix, assume basic ISIS/MPLS internal connectivity with iBGP between loopbacks, all other basic settings at default.

lab2

Lets take a look at the basic ExaBGP config;

(ExaBGP can be installed from git; https://github.com/Exa-Networks/exabgp)

Untitled

Pretty simple stuff;

  • Lines 1 through 6 take care of basic BGP neighbour establishment
  • Line 7 specifies the AFI/SAFI as “Flow” enabling Flowspec
  • Lines 9 and 10 signify the match criteria;
    • Match anything from 30.30.30.100/32
  • Lines 12 and 13 attach an extended community to the Flow route advertisement
    • redirect 666:666 corresponds to the “Dirty VRF” route-target on the edge router

Now lets take a look at the relevant config snippets on “edge1” the router which will receive and install the Flow route, from ExaBGP;

flowspeccfg

  • Lines 1 through 11 relate to standard iBGP internal peering
  • Lines 11 through 18 relate to the upstream eBGP peering to the upstream router in AS1
  • lines 20 through 26 run an internal iBGP session between the edge1 router and the ExaBGP controller;
    • Family inet for “flow” is enabled
    • It references a policy-statement called FSPEC which contains the extended community and the VRF we wish to use for redirecting to the “Dirty VRF”
    • The “No validate” command disables the route-validation procedure if the packets match a specific policy
    • The community “ON-RAMP” is used to match the Flow route coming from ExaBGP, tying it to the policy

The “Dirty VRF” configuration is shown below, essentially it’s just a VRF with the same route-target that ExaBGP is advertising flow routes for, a default route just punts all traffic directly into the mitigation appliance;

routing-instance

There’s also one really important behaviour that should be understood when employing Flowspec, as it differs between vendors.

Remember nearer the start of the post, where I talked about the problem relating to routing-loops, where traffic matching a mitigation route will loop for infinity if it’s routed back into the same device.

This occurs on a Juniper router, because when you enable Flowspec – implicitly applies that flowspec filter to every single interface on the router, so if your packet re-enters the router on any interface, it’ll match the flow-route every time and have the same action applied to it.

To get around this problem, Juniper added the ability to exclude interfaces from Flowspec processing, the config looks like this;

exclude

In the above snippet, we’re excluding interface ge-0/0/1 from any form of further Flowspec processing or filtering, this allows return traffic to flow naturally southbound towards Edge2 inside GRT

Note; this is not needed on some other platforms such as a Nokia 7750 – where Flowspec is embedded inside a packet filter, and so Flowspec is only ever applied to whichever interfaces the packet filter is applied to, rather than to every single interface on the router – as is the case with Juniper. Always read the documentation – especially Nokia as they have a tendency to completely change things from one release to the next 😀 

Lets see it in action;

I’m using the Ostinato traffic generator inside Eve-NG to send a small amount of traffic from the external generator in AS1 behind the Cisco router “peering” from the IP address 30.30.30.100, to the internal endpoint in AS65001, behind the Cisco router “Edge2, targeted at 192.168.100.2

Traffic flows normally from north to south;

Ostinato1

If we look at the mitigation interface (ge-0/0/0) on Edge1, we can see that nothing is being punted to the mitigation device, traffic is just flowing normally, out of the southbound ge-0/0/2 interface towards Edge2;

normal

So lets go ahead and turn on the Flowspec advertisement, firstly by switching on the ExaBGP process and advertising the Flow route to Edge1;

Exaon

So we can see some relevant information, such as the connection parameters and the successful connection to Edge1, lets look on Edge1 to see what’s being received;

route

The flow route received by ExaBGP contains some interesting information;

  • Line 15 shows a regex against the prefix of *,30.30.30.100 – this means anything “from” 30.30.30.100, compared to normal destination based routing, if you remember from the ExaBGP config – we’re matching the source of the traffic for inspection.
  • Line 26 specifies the Announcement bits as 0-Flow
  • Line 28 includes the special Flowspec “redirect community” of 666:666

Flowspec in Juniper uses the firewall filter architecture, it doesn’t add any configuration to the device, instead it uses the BGP advertisement to automatically construct a firewall filter, from the flow route advertisement;

firewallfilter

We can see that the firewall filter has been added, it’s matching packets so hopefully those packets should be flowing out of the “Dirty-VRF” towards the mitigation appliance, (remember before, they were flowing straight down from north to south)

mitigationon

We can see that the traffic rate on ge-0/0/0 has gone up to 98pps, meaning we’re sending traffic towards the mitigation appliance. That very same traffic returns clean on ge-0/0/1.

In the case of the lab, the mitigation appliance is just a Cisco CSR with a default route pointing back at the ge-0/0/1 interface on the Juniper, but whether it’s an Arbor TMS or a Linux box full of filters, the principle remains the same.

In many cases, vendor supported DDOS Mitigation appliances such as Arbor SP/TMS have built in support for Flowspec, so you can trigger mitigation flow routes automatically if certain things get detected.

Previously such appliances had no other way of redirecting traffic without advertising hundreds, or thousands of /32 victim host routes, in order to break regular best-path routing, Thanks to Flowspec – we can now specify traffic sources, ports, protocols, you name it.

It’s also pretty easy to rate limit, if we look at the ExaBGP config, we can use the “rate-limit” extended community, to create a packet policer directly in the forwarding plane, all built from a BGP advertisement;

1000bps

In the above config, I’ve simply removed the redirect community and replaced it with “rate-limit” this instead encodes the rate-limit action into the flow route advertisement, in this case 1000Bps

If we go back to the router and see what’s happening and look at the Flowspec filter;

newrate

We can see the Flowspec BGP flow route being received, with the “traffic-rate:0:1000” community being received.

We can also see that the firewall filter now has two entries, one for matching the source and a second for rate-limiting traffic that exceeds the configured speed, but there’s a mismatch – can you see it?

If you look closely at the Firewall filter – it’s converted the rate to “8K” rather than the ExaBGP configured value of 1k.

The reason for this, is that there appears to be a mismatch between the RFC and the Juniper implementation, RFC 5575 specifies that the rate should be specified in Bytes per second, however Juniper convert that value to bps (bits per second) inside their firewall filter;

From the RFC 5575;

The remaining 4 octets carry the rate information in IEEE floating point [IEEE.754.1985] format, units being bytes per second. A traffic-rate of 0 should result on all traffic for the particular flow to be discarded.

The fact that Juniper convert the Value to bps isn’t a problem, it’s just something to be aware of and explains the differences in the show commands.

Hope you found this useful 🙂

BGP Optimal-route-reflection (BGP-ORR)

Been a while since my last update, been quite busy! but I thought I’d do a post on something BGP related, as everyone loves BGP!

There’s an interesting addition to BGP route-reflection that’s found it’s way into a few trains of code on Juniper and Cisco, (I assume it’s on others too) that attempts to solve one of the annoying issues that occurs when centralised route-reflectors are used.

It all boils down to the basics of path selection, in networks where the setup is relatively simple and identical routes are received, at different edge routers within the network – similar to anycast routing.

Consider the below lab topology;

Screen Shot 2017-06-01 at 22.25.36

The core of the network is super simple, basic ISIS, basic LDP/MPLS with ASR-9kv as an out-of-path route-reflector, with iBGP adjacencies configured along the green arrows, the red arrows signify the eBGP sessions, between AS 100-200 and AS 100-300, where IOSv-7 and IOSv-8 advertise an identical 6.6.6.6/32 route. IOSv-3 and IOSv-4 are just P routers running ISIS/LDP only, for the sake of adding a few hops.

With everything configured as defaults, lets look at the path selection;

 iosv-1#show ip bgp
BGP table version is 6, local router ID is 192.168.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
* i 6.6.6.6/32 192.168.0.4 0 100 0 300 i
*>                    10.0.0.2 0 0 200 i

 

iosv-2#show ip bgp
BGP table version is 29, local router ID is 192.168.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.4 0 100 0 300 i

iosv-5#sh ip bgp
BGP table version is 27, local router ID is 192.168.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.4 0 100 0 300 i

iosv-6#sh ip bgp
BGP table version is 6, local router ID is 192.168.0.4
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*> 6.6.6.6/32 10.1.0.2 0 0 300 i

 

 

If we take a brief look at the situation, specifically IOSv-2 and IOSv-5 it’s pretty easy to see what’s happening, the network has basically converged to prefer the path via AS-300 to get to 6.6.6.6/32

For many networks, this sort of thing isn’t a problem – there’s a functional, working path to 6.6.6.6/32, if the edge router connected to AS-300 fails, the path through AS-200 via IOSv-1 will be used to get to the same prefix — everybody is happy because we can ping stuff.

Screen Shot 2017-06-01 at 22.41.33

The problem though, is that even a layman with no knowledge of networks or routing would look at this situation and think ‘that seems a bit rubbish’ especially considering that the basic cost of each of those routers (in a large scale environment) might cost as much as $1million – it seems a bit lame how they can’t make better use of paths.

Surely there has to be a simple way to make better use of paths? First – lets look at why the network has converged in such a way, starting with the route-reflector (ASR-9kv)

RP/0/RP0/CPU0:iosxrv9000-1#sh bgp
Thu Jun 1 21:46:12.042 UTC
BGP router identifier 192.168.0.5, local AS number 100
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0xe0000000 RD version: 41
BGP main routing table version 41
BGP NSR Initial initsync version 2 (Reached)
BGP NSR/ISSU Sync-Group versions 0/0
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
i – internal, r RIB-failure, S stale, N Nexthop-discard
Origin codes: i – IGP, e – EGP, ? – incomplete
Network Next Hop Metric LocPrf Weight Path
* i6.6.6.6/32 192.168.0.1 0 100 0 200 i
*>i                 192.168.0.4 0 100 0 300 i

Processed 1 prefixes, 2 paths
RP/0/RP0/CPU0:iosxrv9000-1#show bgp 6.6.6.6/32
Thu Jun 1 21:47:19.015 UTC
BGP routing table entry for 6.6.6.6/32
Versions:
Process bRIB/RIB SendTblVer
Speaker 41 41
Last Modified: Jun 1 20:28:41.601 for 01:18:38
Paths: (2 available, best #2)
Advertised to update-groups (with more than one peer):
0.2
Path #1: Received by speaker 0
Not advertised to any peer
200, (Received from a RR-client)
192.168.0.1 (metric 22) from 192.168.0.1 (192.168.0.1)
Origin IGP, metric 0, localpref 100, valid, internal, group-best
Received Path ID 0, Local Path ID 0, version 0
Path #2: Received by speaker 0
Advertised to update-groups (with more than one peer):
0.2
300, (Received from a RR-client)
192.168.0.4 (metric 21) from 192.168.0.4 (192.168.0.4)
Origin IGP, metric 0, localpref 100, valid, internal, best, group-best
Received Path ID 0, Local Path ID 1, version 23
RP/0/RP0/CPU0:iosxrv9000-1#

 

So it’s pretty easy to see the reason why the path through AS-300 has been selected, with two competing routes, the BGP path selection process works through each of the attributes of the routes before it finds a difference, to select a winner;

1; Weight (no weight configured anywhere on the network other than defaults)

2: Local-preference (both routes have the default of 100)

3: Prefer locally originated routes (both identical, neither are locally originated – they are received from the RR)

4: Prefer shortest AS-Path (both paths lengths are identical)

5: Prefer lowest origin code (both routes have the same origin of IGP)

6: Prefer the lowest MED (both MED values are unconfigured and 0)

7: Prefer eBGP paths over iBGP (IOSv-2 and IOSv-5 receive both paths as iBGP from the RR)

8: Prefer the path with the lowest IGP metric (Bingo! the path via IOSv-6 on AS-300 has a IGP next-hop metric of 21, vs the path via IOSv-1 with it’s IGP next-hop metric of 22)

The problem here, is that once the route-reflector has made this decision – other alternate paths can’t be used in any way at all, because as everyone knows – BGP only normally advertises best-paths, so any other routes received by the route-reflector go no further and aren’t advertised to the network.

In the case of this lab, the only reason this has happened is because one edge router is only slightly closer than another to the route-reflector, so the route-reflector has gone ahead and made the decision for everyone, despite the obvious fact that from a packet forwarding and latency perspective – IOSv-2 has a suboptimal path, it would be much better if IOSv-2 went via IOSv-1 rather than all the way through IOSv-6 to get to 6.6.6.6/32

The diagram with the ISIS metrics imposed shows the simplicity of the problem;

Screen Shot 2017-06-01 at 23.16.02

If we had 1000 edge routers on this network, every single one of them would select the path through IOSv-6 in AS-300 – where IOSv-1 wouldn’t receive a single packet of egress traffic., apart from anything it sends locally, (because eBGP routes are preferred over iBGP)

The problem with IGPs in service-provider networks, is that they’re difficult to tweak at the best of times, even if we made them the same – the RR would still only advertise a single route, based on the next decision in the BGP path selection process (oldest path followed by RID) <yes we know add-paths exists, but that’s not without issues 🙂 >

If we start to manipulate the metrics, that normally has the undesirable result of moving lots of traffic from one link to another – which makes management and planning difficult.

My personal approach would normally be to try and stick to good design, in order to prevent this sort of behaviour. An obvious and simple method and one that’s normally employed in larger ISPs is to have route-reflectors that are pop based in a hierarchy, that is route-reflector clients are always served by a route-reflector that’s closest to them – that way the IGP next-hop costs will always be lower, than relying on a centralised route-reflector that’s buried in the middle of the network, somewhere behind 20 P routers.

For example in the below change to the design, IOSv-1 and IOSv-6 each have their own local route-reflector (RR1 and RR2), in this case each RR is metrically closer to the edge-router it serves, meaning that if the BGP tiebreaker happens to fall on the IGP next-hop cost, the closest value will always be chosen.

Screen Shot 2017-06-01 at 23.33.15

The problem with the above design, is that whilst it’s simpler from a protocols perspective – it ends up being much more expensive and eventually more complex in the long run. If I have 500x POPs that’s a lot of route-reflectors and a more complex hierarchy, along with longer convergence times – but then again with 500x POPs, I’d also have many other issues to contend with.

In smaller networks with perhaps a pair of centralised route-reflectors, we can use BGP-ORR (optimal route reflection) to employ some of the information held inside the IGP LSA database to assist BGP in making a better routing decision.

This is possible because as we all know – with link-state IGPs such as ISIS or OSPF, they each hold a full live state of all links and all paths in the network, so it makes sense to hook into this information, rather than having BGP act in isolation and compute a suboptimal path.

More information on the draft is given below;

https://tools.ietf.org/html/draft-ietf-idr-bgp-optimal-route-reflection-13

So – I’ll go ahead with the existing topology and configure BGP-ORR on the route-reflector only, and we’ll look at how the routing has changed;

A reminder of the topology;

Screen Shot 2017-06-01 at 23.55.37

A quick look at the BGP configuration on ARS9kv;

RP/0/RP0/CPU0:iosxrv9000-1#show run router bgp
Thu Jun 1 22:57:50.574 UTC
router bgp 100
bgp router-id 192.168.0.5
address-family ipv4 unicast
optimal-route-reflection r2 192.168.0.2
optimal-route-reflection r5 192.168.0.3
optimal-route-reflection r7 192.168.0.1
optimal-route-reflection r8 192.168.0.4

!
! iBGP
! iBGP clients
neighbor 192.168.0.1
remote-as 100
description RR client iosv-1
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r7
route-reflector-client
!
!
neighbor 192.168.0.2
remote-as 100
description RR client iosv-2
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r2
route-reflector-client
!
!
neighbor 192.168.0.3
remote-as 100
description RR client iosv-5
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r5
route-reflector-client
!
!
neighbor 192.168.0.4
remote-as 100
description RR client iosv-6
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r8
route-reflector-client
!
!
!

RP/0/RP0/CPU0:iosxrv9000-1# sh run router isis
Thu Jun 1 22:57:56.223 UTC
router isis 100
is-type level-2-only
net 49.1921.6800.0005.00
distribute bgp-ls
address-family ipv4 unicast
metric-style wide
!
interface Loopback0
passive
circuit-type level-2-only
address-family ipv4 unicast
!
!
interface GigabitEthernet0/0/0/0
point-to-point
address-family ipv4 unicast
!
!
!RP/0/RP0/CPU0:iosxrv9000-1#

 

Before we go over the configuration, lets look at the results on IOSv-1 and IOSv-5 (recall from a few pages up, that previously both routers had picked the route via IOSv-6 (AS-300)

iosv-2#sh ip bgp
BGP table version is 32, local router ID is 192.168.0.2
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.1 0 100 0 200 i
iosv-2#

iosv-5#sh ip bgp
BGP table version is 29, local router ID is 192.168.0.3
Status codes: s suppressed, d damped, h history, * valid, > best, i – internal,
r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter,
x best-external, a additional-path, c RIB-compressed,
Origin codes: i – IGP, e – EGP, ? – incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric LocPrf Weight Path
*>i 6.6.6.6/32 192.168.0.4 0 100 0 300 i
iosv-5#

 

Notice how IOSv-2 and IOSv-5 have each selected their closest peering router (IOSv-1 and IOSv-6) respectively, to get to 6.6.6.6/32, instead of everything going via IOSv-6, as illustrated below;

Screen Shot 2017-06-02 at 10.07.11

For ye of little faith – a traceroute confirms the newer optimised best path from IOSv-2 and IOSv-5 – both routers choose their closest exit (1 hop away)

iosv-2#traceroute 6.6.6.6 source lo0
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
1 10.0.128.1 1 msec 0 msec 0 msec
2 10.0.0.2 1 msec * 0 msec

iosv-2#

iosv-5#trace 6.6.6.6 source lo0
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
1 10.0.0.14 1 msec 0 msec 0 msec
2 10.1.0.2 1 msec * 0 msec

iosv-5#

 

So with the configuration applied, how does BGP-ORR actually work?

It all boils down to perspective, that is – rather than the route-reflector making a decision based purely on it’s own information such as it’s own IGP cost to the next-hop. Using BGP-ORR the route-reflector can ‘hook’ into the LSA database and check the IGP next-hop cost from the perspective of the RR client, rather than the RR itself.

This is possible with IGPs because IGPs generally contain a full database of link states that are distributed to all devices running the IGP, which ultimately means we can put the route-reflector anywhere in the network using BGP-ORR. Because we can ‘hack’ the protocol to make a calculation from the perspective of wherever we choose, rather than the current location.

The below diagram illustrates it as simply as possible in the current topology for IOSv-2 only;

Screen Shot 2017-06-02 at 11.13.42

In the above diagram, ASR-9kv decides the best path using IOSv-2’s cost to IOSv-1, by looking at the ISIS database in the same way that IOSv-2 looks at it, or from the perspective of IOSv-2.

If we look at the ISIS routes on IOSv-2, followed by the BGP-ORR policy on the route-reflector, we can see that the route-reflector uses the very same costs.

iosv-2#show ip route isis
Codes: L – local, C – connected, S – static, R – RIP, M – mobile, B – BGP
D – EIGRP, EX – EIGRP external, O – OSPF, IA – OSPF inter area
N1 – OSPF NSSA external type 1, N2 – OSPF NSSA external type 2
E1 – OSPF external type 1, E2 – OSPF external type 2
i – IS-IS, su – IS-IS summary, L1 – IS-IS level-1, L2 – IS-IS level-2
ia – IS-IS inter area, * – candidate default, U – per-user static route
o – ODR, P – periodic downloaded static route, H – NHRP, l – LISP
a – application route
+ – replicated route, % – next hop override, p – overrides from PfR

Gateway of last resort is not set

10.0.0.0/8 is variably subnetted, 8 subnets, 3 masks
i L2 10.2.128.0/30 [115/12] via 10.2.0.1, 01:25:02, GigabitEthernet0/2
192.168.0.0/32 is subnetted, 7 subnets
i L2 192.168.0.1 [115/11] via 10.0.128.1, 01:35:19, GigabitEthernet0/1
i L2 192.168.0.3 [115/13] via 10.2.0.1, 01:34:49, GigabitEthernet0/2
[115/13] via 10.0.128.1, 01:34:49, GigabitEthernet0/1
i L2 192.168.0.4 [115/12] via 10.2.0.1, 01:34:49, GigabitEthernet0/2
i L2 192.168.0.5 [115/2] via 10.2.0.1, 01:25:02, GigabitEthernet0/2
i L2 192.168.0.9 [115/12] via 10.0.128.1, 01:35:09, GigabitEthernet0/1
i L2 192.168.0.10 [115/11] via 10.2.0.1, 01:35:09, GigabitEthernet0/2
iosv-2#

RP/0/RP0/CPU0:iosxrv9000-1#show orrspf database r2
Fri Jun 2 10:20:25.187 UTC

ORR policy: r2, IPv4, RIB tableid: 0xe0000002
Configured root: primary: 192.168.0.2, secondary: NULL, tertiary: NULL
Actual Root: 192.168.0.2, Root node: 1921.6800.0002.0000

Prefix                                  Cost
192.168.0.1                           11
192.168.0.2                           10
192.168.0.3                           13
192.168.0.4                           12
192.168.0.5                            2
192.168.0.9                           12
192.168.0.10                         11

Number of mapping entries: 8
RP/0/RP0/CPU0:iosxrv9000-1#

 

Essentially, the ISIS costs are copied and pasted from the IGP database into the BGP-ORR database, so that the route-reflector can use this information in it’s path selection process.

Lets have a quick review of the route-reflector config;

router bgp 100
bgp router-id 192.168.0.5
address-family ipv4 unicast
optimal-route-reflection r2 192.168.0.2
optimal-route-reflection r5 192.168.0.3
optimal-route-reflection r7 192.168.0.1
optimal-route-reflection r8 192.168.0.4

!
! iBGP
! iBGP clients
neighbor 192.168.0.1
remote-as 100
description RR client iosv-1
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r7
route-reflector-client
!
!
neighbor 192.168.0.2
remote-as 100
description RR client iosv-2
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r2
route-reflector-client
!
!
neighbor 192.168.0.3
remote-as 100
description RR client iosv-5
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r5
route-reflector-client
!
!
neighbor 192.168.0.4
remote-as 100
description RR client iosv-6
update-source Loopback0
address-family ipv4 unicast
optimal-route-reflection r8
route-reflector-client
!
!
!

RP/0/RP0/CPU0:iosxrv9000-1#sh run router isis
Fri Jun 2 10:35:00.027 UTC
router isis 100
is-type level-2-only
net 49.1921.6800.0005.00
distribute bgp-ls
address-family ipv4 unicast
metric-style wide
!
interface Loopback0
passive
circuit-type level-2-only
address-family ipv4 unicast
!
!
interface GigabitEthernet0/0/0/0
point-to-point
address-family ipv4 unicast
!
!
!

 

The first portion of the configuration, we specify the root device that we want to use to compute the IGP cost, in the case of which in the case of IOSv-2 is R2 – 192.168.0.2 in the config. We can also specify secondary and tertiary devices.

Under the RR neighbour configuration, we apply ‘optimal-route-reflection’ along with the policy we configured previously, in the case of IOSv-2 for neighbour 192.168.0.2 

Lastly, under ISIS we need to configure ‘distribute BGP-LS‘ this essentially tells ISIS to distribute it’s information into BGP, more information on BGP-LS (BGP Link-State) see my previous blog post on Segment-routing and ODL

As a conclusion, I think BGP-ORR is a useful addition to the protocol – I’ve certainly worked on networks where it would make sense to implement, unfortunately it only seems to exist in a few trains of code on certain devices. In this lab example – I was using Cisco VIRL spun up under Vagrant on packet.net, where the XR9kv router is the only one that supports BGP-ORR, but I have seen it recently on JUNOS.

As for potential downsides of BGP-ORR, in larger networks it could become quite complicated to design, where you have lots of different routers that all need to be balanced correctly, and in larger networks having centralised route-reflectors can be a big downside and distributed RR designs may work better.

I’d also be interested to see how well BGP-ORR converges in networks with larger LSA databases and BGP tables.

Bye for now! 🙂

Segment-routing + Opendaylight SDN + Pathman-SR + PCEP

opendaylight_logo    Cisco.png

This is a second technical post related to segment-routing, I did a basic introduction to this technology on Juniper MX here;

https://tgregory.org/2016/08/13/segment-routing-on-junos-the-basics/

For this post I’m looking at something a bit more advanced and fun – performing Segment-routing traffic-engineering using an SDN controller, in this case OpenDaylight Beryllium – an open source SDN controller with some very powerful functionality.

This post will use Cisco ASR9kV virtual routers running on a Cisco UCS chassis, mostly because Cisco currently have the leading-edge support for Segment-routing at this time, Juniper seem to be lagging behind a bit on that front!

Lets check out the topology;

odl1

It’s a pretty simple scenario – all of the routers in the topology are configured in the following way;

  • XRV-1 to XRV-8; PE routers (BGP IPv4)
  • XRV 2 to XRV7; P routers (ISIS-Segment-routing)
  • XRV4 is an in-path RR connecting to the ODL controller

odl2

The first thing to look at here is BGP-LS “BGP Link-state” which is an extension of BGP that allows IGP information (OSPF/ISIS) to be injected into BGP, this falls conveniently into the world of centralised path computation – where we can use a controller of some sort to look at the network’s link-state information, then compute a path through the network. The controller can then communicate that path back down to a device within the network using a different method, ultimately resulting in an action of some sort – for example, signalling an LSP.

Some older historic platforms such as HP Route analytics – which enabled you to discover the live IGP topology by running ISIS or OSPF directly with a network device, however IGPs tend to be very intense protocols and also require additional effort to support within an application, rather than a traditional router. IGPs are only usually limited to the domain within which they operate – for example if we have a large network with many different IGP domains or inter-domain MPLS, the IGP’s view becomes much more limited. BGP on the other hand can bridge many of these gaps, and when programmed with the ability to carry IGP information – can be quite useful.

The next element is PCE or Path computation element – which generally contains two core elements;

  • PCC – Path computation client – In the case of this lab network, a PCC would be a PE router
  • PCE – Path computation element – In the case of this lab network, the PCE would be the ODL controller

These elements communicate using PCEP (Path computation element protocol) which allows a central controller (in this case ODL) to essentially program the PCC with a path – for example, by signalling the actual LSP;

Basic components;

yeee

Basic components plus an application (in this case Pathman-SR) which can compute and signal an LSP from ODL to the PCC (XRV-1);

pathman

In the above example, an opensource application (in this case Pathman-SR) is using the information about the network topology obtained via BGP-LS and PCE, stored inside ODL – to compute and signal a Segment-routing LSP from XRV-1 to XRV-8, via XRV3, XRV5 and XRV7.

Before we look at the routers, lets take a quick look at OpenDaylight, general information can be found here; https://www.opendaylight.org I’m running Beryllium 0.4.3 which is the same Cisco’s DCloud demo – it’s a relatively straightforward install process, I’m running my copy on top of a standard Ubuntu install.

yang

From inside ODL you can use the YANG UI to query information held inside the controller, which is essentially a much easier way of querying the data, using presets – for example, I can view the link-state topology learnt via BGP-LS pretty easily;

topology

There’s a whole load of functionality possible with ODL, from BGP-Flowspec, to Openflow, to LSP provisioning, for now we’re just going to keep it basic – all of this is opensource and requires quite a bit of “playing” to get working.

Lets take a look at provisioning some segment-routing TE tunnels, first a reminder of the diagram;

odl1

And an example of some configuration – XRv-1

ISIS;

  1. router isis CORE-SR
  2.  is-type level-2-only
  3.  net 49.0001.0001.0001.00
  4.  address-family ipv4 unicast
  5.   metric-style wide
  6.   mpls traffic-eng level-2-only
  7.   mpls traffic-eng router-id Loopback0
  8.   redistribute static
  9.   segment-routing mpls
  10.  !
  11.  interface Loopback0
  12.   address-family ipv4 unicast
  13.    prefix-sid index 10
  14.   !
  15.  !
  16.  interface GigabitEthernet0/0/0/0.12
  17.   point-to-point
  18.   address-family ipv4 unicast
  19.   !
  20.  !
  21.  interface GigabitEthernet0/0/0/1.13
  22.   point-to-point
  23.   address-family ipv4 unicast
  24.   !
  25.  !
  26. !

 

A relatively simple ISIS configuration, with nothing remarkable going on,

  • Line 9 enabled Segment-Routing for ISIS
  • Line 13 injects a SID (Segment-identifier) of 10 into ISIS for loopback 0

The other aspect of the configuration which generates a bit of interest, is the PCE and mpls traffic-eng configuration;

  1. mpls traffic-eng
  2.  pce
  3.   peer source ipv4 49.1.1.1
  4.   peer ipv4 192.168.3.250
  5.   !
  6.   segment-routing
  7.   logging events peer-status
  8.   stateful-client
  9.    instantiation
  10.   !
  11.  !
  12.  logging events all
  13.  auto-tunnel pcc
  14.   tunnel-id min 1 max 99
  15.  !
  16.  reoptimize timers delay installation 0
  17. !

 

  • Line 1 enables basic traffic-engineering, an important point to note – to do MPLS-TE for Segment-routing, you don’t need to turn on TE on every single interface like you would if you were using RSVP, so long as ISIS TE is enabled and
  • Lines 2, 3 and 4 connect the router from it’s loopback address, to the opendaylight controller and enable PCE
  • Line 6 through 9 specify the segment-routing parameters for TE
  • Line 14 specifies the tunnel ID for automatically generated tunnels – for tunnels spawned by the controller

Going back to the diagram, XRv-4 was also configured for BGP-LS;

  1. router bgp 65535
  2.  bgp router-id 49.1.1.4
  3.  bgp cluster-id 49.1.1.4
  4.  address-family ipv4 unicast
  5.  !
  6.  address-family link-state link-state
  7.  !
  8.  neighbor 49.1.1.1
  9.   remote-as 65535
  10.   update-source Loopback0
  11.   address-family ipv4 unicast
  12.    route-reflector-client
  13.   !
  14.  !
  15.  neighbor 49.1.1.8
  16.   remote-as 65535
  17.   update-source Loopback0
  18.   address-family ipv4 unicast
  19.    route-reflector-client
  20.   !
  21.  !
  22.  neighbor 192.168.3.250
  23.   remote-as 65535
  24.   update-source GigabitEthernet0/0/0/5
  25.   address-family ipv4 unicast
  26.    route-reflector-client
  27.   !
  28.   address-family link-state link-state
  29.    route-reflector-client
  30.   !
  31.  !
  32. !

 

  • Line 6 enables the BGP Link-state AFI/SAFI
  • Lines 8 through 19 are standard BGP RR config for IPv4
  • Line 22 is the BGP peer for the Opendaylight controller
  • Line 28 turns on the link-state AFI/SAFI for Opendaylight

Also of Interest on XRv-4 is the ISIS configuration;

  1. router isis CORE-SR
  2.  is-type level-2-only
  3.  net 49.0001.0001.0004.00
  4.  distribute bgp-ls
  5.  address-family ipv4 unicast
  6.   metric-style wide
  7.   mpls traffic-eng level-2-only
  8.   mpls traffic-eng router-id Loopback0
  9.   redistribute static
  10.   segment-routing mpls
  11.  !
  12.  interface Loopback0
  13.   address-family ipv4 unicast
  14.    prefix-sid index 40
  15.   !
  16.  !

 

  • Line 4 copies the ISIS link-state information into BGP-link state

If we do a “show bgp link-state link-state” we can see the information taken from ISIS, injected into BGP – and subsequently advertised to Opendaylight;

  1. RP/0/RP0/CPU0:XRV9k-4#show bgp link-state link-state
  2. Thu Dec  1 21:40:44.032 UTC
  3. BGP router identifier 49.1.1.4, local AS number 65535
  4. BGP generic scan interval 60 secs
  5. Non-stop routing is enabled
  6. BGP table state: Active
  7. Table ID: 0x0   RD version: 78
  8. BGP main routing table version 78
  9. BGP NSR Initial initsync version 78 (Reached)
  10. BGP NSR/ISSU Sync-Group versions 0/0
  11. BGP scan interval 60 secs
  12. Status codes: s suppressed, d damped, h history, * valid, > best
  13.               i – internal, r RIB-failure, S stale, N Nexthop-discard
  14. Origin codes: i – IGP, e – EGP, ? – incomplete
  15. Prefix codes: E link, V node, T IP reacheable route, u/U unknown
  16.               I Identifier, N local node, R remote node, L link, P prefix
  17.               L1/L2 ISIS level-1/level-2, O OSPF, D direct, S static/peer-node
  18.               a area-ID, l link-ID, t topology-ID, s ISO-ID,
  19.               c confed-ID/ASN, b bgp-identifier, r router-ID,
  20.               i if-address, n nbr-address, o OSPF Route-type, p IP-prefix
  21.               d designated router address
  22.    Network            Next Hop            Metric LocPrf Weight Path
  23. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0001.00]]/328
  24.                       0.0.0.0                                0 i
  25. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0002.00]]/328
  26.                       0.0.0.0                                0 i
  27. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0003.00]]/328
  28.                       0.0.0.0                                0 i
  29. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0004.00]]/328
  30.                       0.0.0.0                                0 i
  31. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0005.00]]/328
  32.                       0.0.0.0                                0 i
  33. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0006.00]]/328
  34.                       0.0.0.0                                0 i
  35. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0007.00]]/328
  36.                       0.0.0.0                                0 i
  37. *> [V][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0008.00]]/328
  38.                       0.0.0.0                                0 i
  39. *> [E][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0001.00]][R[c65535][b0.0.0.0][s0001.0001.0002.00]][L[i10.10.12.0][n10.10.12.1]]/696
  40.                       0.0.0.0                                0 i
  41. *> [E][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0001.00]][R[c65535][b0.0.0.0][s0001.0001.0003.00]][L[i10.10.13.0][n10.10.13.1]]/696
  42.                       0.0.0.0                                0 i
  43. *> [E][L2][I0x0][N[c65535][b0.0.0.0][s0001.0001.0002.00]][R[c65535][b0.0.0.0][s0001.0001.0001.00]][L[i10.10.12.1][n10.10.12.0]]/696

 

With this information we can use an additional app on top of OpenDaylight to provision some Segment-routing LSPs, in this case I’m going to use something from Cisco Devnet called Pathman-SR – it essentially connects to ODL using REST to program the network, Pathman can be found here; https://github.com/CiscoDevNet/pathman-sr

Once it’s installed and running, simply browse to it’s url (http://192.168.3.250:8020/cisco-ctao/apps/pathman_sr/index.html) and you’re presented with a nice view of the network;

pathman

From here, it’s possible to compute a path from one point to another – then signal that LSP on the network using PCEP, in this case – lets program a path from XRv9k-1 to XRv9k-8

In this case, lets program a path via XRV9k-2, via 4, via 7 to 8;

pathman3

Once Pathman has calculated the path – hit deploy, Pathman sends the path to ODL – which then connects via PCEP to XRV9kv-1 and provisions the LSP;

pathman2

Once this is done, it’s check XRV9k-1 to check out the SR-TE tunnel;

  1. RP/0/RP0/CPU0:XRV9k-1#sh ip int bri
  2. Thu Dec  1 22:05:38.799 UTC
  3. Interface                      IP-Address      Status          Protocol Vrf-Name
  4. Loopback0                      49.1.1.1        Up              Up       default
  5. tunnel-te1                     49.1.1.1        Up              Up       default
  6. GigabitEthernet0/0/0/0         unassigned      Up              Up       default
  7. GigabitEthernet0/0/0/0.12      10.10.12.0      Up              Up       default
  8. GigabitEthernet0/0/0/1         unassigned      Up              Up       default
  9. GigabitEthernet0/0/0/1.13      10.10.13.0      Up              Up       default
  10. GigabitEthernet0/0/0/2         100.1.0.1       Up              Up       default
  11. GigabitEthernet0/0/0/3         192.168.3.248   Up              Up       default
  12. GigabitEthernet0/0/0/4         unassigned      Shutdown        Down     default
  13. GigabitEthernet0/0/0/5         unassigned      Shutdown        Down     default
  14. GigabitEthernet0/0/0/6         unassigned      Shutdown        Down     default
  15. MgmtEth0/RP0/CPU0/0            unassigned      Shutdown        Down     default

 

We can see from the output of “show ip int brief” on line 5, that interface tunnel-te1 has been created, but it’s nowhere in the config;

  1. RP/0/RP0/CPU0:XRV9k-1#sh run interface tunnel-te1
  2. Thu Dec  1 22:07:41.409 UTC
  3. % No such configuration item(s)
  4. RP/0/RP0/CPU0:XRV9k-1#

 

PCE signalled LSPs never appear in the configuration, they’re created, managed and deleted by the controller – it is possible to manually add an LSP then delegate it to the controller, but that’s beyond the scope here (that’s technical speak for “I couldn’t make it work 🙂 )

Lets check out the details of the SR-TE tunnel;

  1. RP/0/RP0/CPU0:XRV9k-1#show mpls traffic-eng tunnels
  2. Thu Dec  1 22:09:56.983 UTC
  3. Name: tunnel-te1  Destination: 49.1.1.8  Ifhandle:0x8000064 (auto-tunnel pcc)
  4.   Signalled-Name: XRV9k-1 -> XRV9k-8
  5.   Status:
  6.     Admin:    up Oper:   up   Path:  valid   Signalling: connected
  7.     path option 10, (Segment-Routing) type explicit (autopcc_te1) (Basis for Setup)
  8.     G-PID: 0x0800 (derived from egress interface properties)
  9.     Bandwidth Requested: 0 kbps  CT0
  10.     Creation Time: Thu Dec  1 22:01:21 2016 (00:08:37 ago)
  11.   Config Parameters:
  12.     Bandwidth:        0 kbps (CT0) Priority:  7  7 Affinity: 0x0/0xffff
  13.     Metric Type: TE (global)
  14.     Path Selection:
  15.       Tiebreaker: Min-fill (default)
  16.       Protection: any (default)
  17.     Hop-limit: disabled
  18.     Cost-limit: disabled
  19.     Path-invalidation timeout: 10000 msec (default), Action: Tear (default)
  20.     AutoRoute: disabled  LockDown: disabled   Policy class: not set
  21.     Forward class: 0 (default)
  22.     Forwarding-Adjacency: disabled
  23.     Autoroute Destinations: 0
  24.     Loadshare:          0 equal loadshares
  25.     Auto-bw: disabled
  26.     Path Protection: Not Enabled
  27.     BFD Fast Detection: Disabled
  28.     Reoptimization after affinity failure: Enabled
  29.     SRLG discovery: Disabled
  30.   Auto PCC:
  31.     Symbolic name: XRV9k-1 -> XRV9k-8
  32.     PCEP ID: 2
  33.     Delegated to: 192.168.3.250
  34.     Created by: 192.168.3.250
  35.   History:
  36.     Tunnel has been up for: 00:08:37 (since Thu Dec 01 22:01:21 UTC 2016)
  37.     Current LSP:
  38.       Uptime: 00:08:37 (since Thu Dec 01 22:01:21 UTC 2016)
  39.   Segment-Routing Path Info (PCE controlled)
  40.     Segment0[Node]: 49.1.1.2, Label: 16020
  41.     Segment1[Node]: 49.1.1.4, Label: 16040
  42.     Segment2[Node]: 49.1.1.7, Label: 16070
  43.     Segment3[Node]: 49.1.1.8, Label: 16080
  44. Displayed 1 (of 1) heads, 0 (of 0) midpoints, 0 (of 0) tails
  45. Displayed 1 up, 0 down, 0 recovering, 0 recovered heads
  46. RP/0/RP0/CPU0:XRV9k-1#

 

Points of interest;

  • Line 4 shows the name of the LSP as configured by Pathman
  • Line 7 shows that the signalling is Segment-routing via autoPCC
  • Lines 33 and 34 show the tunnel was generated by the Opendaylight controller
  • Lines 39 shows the LSP is PCE controlled
  • Lines 40 through 43 show the programmed path
  • Line 44 basically shows XRV9k-1 being the SR-TE headend,

Lines 40-43 show some of the main benefits of Segment-routing, we have a programmed traffic-engineered path through the network, but with far less control-plane overhead than if we’d done this with RSVP-TE, for example – lets look at the routers in the path (xrv-2 xrv-4 and xrv-7)

  1. RP/0/RP0/CPU0:XRV9k-2#show mpls traffic-eng tunnels
  2. Thu Dec  1 22:14:38.855 UTC
  3. RP/0/RP0/CPU0:XRV9k-2#
  4. RP/0/RP0/CPU0:XRV9k-4#show mpls traffic-eng tunnels
  5. Thu Dec  1 22:14:45.915 UTC
  6. RP/0/RP0/CPU0:XRV9k-4#
  7. RP/0/RP0/CPU0:XRV9k-7#show mpls traffic-eng tunnels
  8. Thu Dec  1 22:15:17.873 UTC
  9. RP/0/RP0/CPU0:XRV9k-7#

 

Essentially – the path that the SR-TE tunnel takes contains no real control-plane state, this is a real advantage for large networks as the whole thing is much more efficient.

The only pitfall here, is that whilst we’ve generated a Segment-routed LSP, like all MPLS-TE tunnels we need to tell the router to put traffic into it – normally we do this with autoroute-announce or a static route, at this time OpenDaylight doesn’t support the PCEP extensions to actually configure a static route, so we still need to manually put traffic into the tunnel – this is fixed in Cisco’s openSDN and WAE (wan automation engine)

  1. router static
  2.  address-family ipv4 unicast
  3.   49.1.1.8/32 tunnel-te1
  4.  !
  5. !

 

I regularly do testing and development work with some of the largest ISPs in the UK – and something that regularly comes up, is where customers are running a traditional full-mesh of RSVP LSPs, if you have 500 edge routers – that’s 250k LSPs being signalled end to end, the “P” routers in the network need to signal and maintain all of that state. When I do testing in these sorts of environments, it’s not uncommon to see nasty problems with route-engine CPUs when links fail, as those 250k LSPs end up having to be re-signalled – indeed this very subject came up in a conversation at LINX95 last week.

With Segment-routing, the traffic-engineered path is basically encoded into the packet with MPLS labels – the only real difficulty is that it requires the use of more labels in the packet, but once the hardware can deal with the label-depth, I think it’s a much better solution than RSVP, it’s more efficient and it’s far simpler.

From my perspective – all I’ve really shown here is a basic LSP provisioning tool, but it’s nice to be able to get the basics working, in the future I hope to get my hands on a segment-routing enabled version of Northstar, or Cisco’s OpenSDN controller – (which is Cisco productised version of ODL) 🙂

 

EVPN vs PBB-EVPN

The is the next in a series of technical posts relating to EVPN – in particular PBB-EVPN (Provider backbone bridging, Ethernet VPN) and attempts to explain the basic setup, application and problems solved within a very large layer-2 environment. Readers new to EVPN may wish to start with my first post which gives examples of the most basic implementation of regular EVPN;

https://tgregory.org/2016/06/04/evpn-in-action-1/

Regular EVPN without a doubt is the future of MPLS based multi-point layer-2 VPN connectivity, it adds the highly scalable BGP based control-plane, that’s been used to good effect in Layer-3 VPNs for over a decade. It has much better mechanisms for handling BUM (broadcast unknown multicast) traffic and can properly do active-active layer-2 forwarding, and because EVPN PE’s all synchronise their ARP tables with one another – you can design large layer-2/layer-3 networks that stretch across numerous data centres or POPs,  and move machines around at layer-2 or layer-3 without having to re-address or re-provision – you can learn how to do this here;

https://tgregory.org/2016/06/11/inter-vlan-routing-mobility/

But like any technology it can never be perfect from day one, EVPN contains more layer-2 and layer-3 functionality than just about any single protocol developed so far, but it comes at a cost – control-plane resources, consider the following scenario;

capture

The above example is an extremely simple example of a network with 3x data centres, each data centre has 1k hosts sat behind it. The 3x “P” routers in the centre of the network are running ISIS and LDP only, each edge router (MX-1 through MX-3) is running basic EVPN with all hosts in a single VLAN.

A quick recap of the basic config (configs identical on all 3x PE routers, with the exception of IP addresses)

  1. interfaces {  
  2.     ge-1/0/0 {
  3.         flexible-vlan-tagging;
  4.         encapsulation flexible-ethernet-services;
  5.         unit 100 {
  6.             encapsulation vlan-bridge;
  7.             vlan-id 100;
  8.         }
  9.     }
  10. routing-instances {
  11.     EVPN-100 {
  12.         instance-type virtual-switch;
  13.         route-distinguisher 1.1.1.1:100;
  14.         vrf-target target:100:100;
  15.         protocols {
  16.             evpn {
  17.                 extended-vlan-list 100;
  18.             }
  19.         }
  20.         bridge-domains {
  21.             VL-100 {
  22.                 vlan-id 100;
  23.                 interface ge-1/0/0.100;
  24.             }
  25.         }
  26.     }
  27. }
  28. protocols {
  29.    bgp {
  30.         group fullmesh {
  31.             type internal;
  32.             local-address 10.10.10.1;
  33.             family evpn {
  34.                 signaling;
  35.             }
  36.             neighbor 10.10.10.2;
  37.             neighbor 10.10.10.3;
  38.         }
  39.     }

This is only a small-scale setup using MX-5 routers but it’s easy to use this example in order to project the problem – that is, the EVPN control-plane is quite resource intensive.

With 1k hosts per site – that equates to 3k EVPN BGP routes that need to be advertised – which isn’t that bad, however if you’re a large service-provider spanning many data-centres across a whole country, or even multiple countries – 3k routes is a tiny amount, it may be the case that you have hundreds of thousands or millions of EVPN routes spread across hundreds or thousands of edge routers.

Having hundreds of thousands, or millions of routes is a problem that can normally be easily dealt with if things are IPv4 or IPv6, in that we can rely on summarising these routes down into blocks or aggregates in BGP – to make things much more sensible.

However in the layer-2 world, it’s not possible to summarise mac-addresses as they’re mostly completely random, in regular EVPN – if I have 1 million hosts, that’s going to equate to 1 million EVPN MAC routes which get advertised everywhere – which isn’t going to run very smoothly at all, once we start moving hosts around, or have any large-scale failures in the network that might require huge numbers of hosts to move from one place to another.

If I spin up the 3x 1k hosts in IXIA, spread across all 3x sites – we can clearly see the amount of EVPN control-plane state being generated and advertised across the network;

  1. tim@MX5-1> show evpn instance extensive    
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.1:100
  4.   Per-instance MAC route label: 300640
  5.   MAC database status                Local  Remote
  6. Total MAC addresses:              1000    2000
  7.     Default gateway MAC addresses:       0       0
  8.   Number of local interfaces: 1 (1 up)
  9.     Interface name  ESI                            Mode             Status
  10.     ge-1/0/0.100    00:00:00:00:00:00:00:00:00:00  single-homed     Up
  11.   Number of IRB interfaces: 0 (0 up)
  12.   Number of bridge domains: 1
  13.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  14.     100          1   1     Extended         Enabled   300704
  15.   Number of neighbors: 2
  16.     10.10.10.2
  17.       Received routes
  18. MAC address advertisement:           1000
  19.         MAC+IP address advertisement:           0
  20.         Inclusive multicast:                    1
  21.         Ethernet auto-discovery:                0
  22.     10.10.10.3
  23.       Received routes
  24. MAC address advertisement:           1000
  25.         MAC+IP address advertisement:           0
  26.         Inclusive multicast:                    1
  27.         Ethernet auto-discovery:                0
  28.   Number of ethernet segments: 0

 

And obviously all of this information is injected into BGP – all of which needs to be advertised and distributed;

  1. tim@MX5-1> show bgp summary
  2. Groups: 1 Peers: 2 Down peers: 0
  3. Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
  4. bgp.evpn.0
  5.                     2002       2002          0          0          0          0
  6. Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped…
  7. 10.10.10.2              100       1246       1410       0       1     5:51:04 Establ
  8.   bgp.evpn.0: 1001/1001/1001/0
  9.   EVPN-100.evpn.0: 1001/1001/1001/0
  10.   __default_evpn__.evpn.0: 0/0/0/0
  11. 10.10.10.3              100       1243       1393       0       0     5:50:51 Establ
  12.   bgp.evpn.0: 1001/1001/1001/0
  13.   EVPN-100.evpn.0: 1001/1001/1001/0
  14.   __default_evpn__.evpn.0: 0/0/0/0
  15. tim@MX5-1>

 

With our 3k host setup, it’s obvious that things will be just fine – and I imagine a good 20-30k hosts in a well designed network running on routers with big CPUs and memory would probably be ok, however I suspect that in a large network already carrying the full BGP table, + L3VPNs and everything else, adding an additional 90-100k EVPN routes might not be such a good idea.

So what’s available for large-scale layer-2 networks?

Normally in large layer-2 networks, QinQ is enough to provide sufficient scale for most large-enterprises, with QinQ (802.1ad) we simply multiplex VLANs by adding a second VLAN service-tag (S-TAG) which allows us to represent many different customer tags, or C-TAGs – because the size of the dot1q header allows for 4096 different VLANs, if we add a second tag – that gives us 4096 x 4096 possible combinations which equals over 1.6 million;

As a quick recap on QinQ – in the below example, all frames from Customer 1 for Vlans 1-4096 are encapsulated with an S-TAG of 100, whilst all frames from Customer 2, for the same VLAN range of 1-4096 are encapsulated in S-TAG 200;

qinq

The problem with QinQ (PBN Provider-bridged-networks) is that it’s essentially limited to 1.6 Million possible combinations – which sounds like a lot, however if you’re a large service provider with tens of millions of consumers, businesses and big-data products – 1.6 Million isn’t very much.

Whether you run QinQ across a large switched enterprise network, or breakout individual high-bandwidth interfaces into a switch and sell hundreds of leased-line services in a multi-tenant design using VPLS – you’re still always going to be limited to 1.6 Million in total – and that’s before we mention things like active-active multi-homing which doesn’t work with VPLS.

Another disadvantage is that with QinQ, every device in the network is required to learn all the customer mac addresses – so it’s quite easy to see the scaling problems from the beginning.

For truly massive scale, Big-Data, DCI, provider-transport etc – we need to make the leap to PBB (provider backbone bridging) but what exactly is PBB?

Before we look at PBB-EVPN, we should first take some time to understand why basic PBB is and what problems it solves;

PBB was originally defined as 802.1ah or “mac in mac” and is layer-2 end to end, instead of merely adding another VLAN tag PBB, actually duplicates the mac layer of the customer frame and separates it from the provider domain, by encapsulating it in a new set of headers. This allows for complete transparency between the customer network and the provider network.

Instead of multiplexing VLANs, PBB uses a 24bit I-SID (service ID) the fact that it’s 24 bit gives us an immediate idea of the scale we’re talking about here – 16 million possible services. PBB also introduces the concept of B-TAG or Backbone tag – this essentially hides the customer source/destination mac addresses behind a backbone entity, removing the requirement for every device in the network to learn every single customer mac-address – analogous in some ways with IP address aggregation because of the reduction in network state.

Check the below diagram for a basic topology and list of basic PBB terms;

diag1

  • PB = Provider bridge (802.1ad)
  • PEB = Provider edge bridge (802.1ad)
  • BEB = Backbone edge bridge (802.1ah)
  • BCB = Backbone core bridge (802.1ah)

The “BEB” backbone-edge-bridge is the first immediate point of interest within PBB, essentially it forms the boundary between the access-network and the core network and introduces 2x new concepts;

  • I-Component
  • B-Component

The I-Component essentially forms the customer or access facing interface or routing instance, the B-Component is the backbone facing PBB core instance – the B-Component uses B-MAC addressing (backbone MAC) in order to forward customer frames to the core based on the new imposed B-MAC, instead of any of the original S or C VLAN tags and C-MAC (customer-MAC) which would have been the case in a regular PBN QinQ setup.

In this case – the “BEB” or Backbone edge bridge forms the connectivity between the access and core, where on one side it maintains a full layer-2 state with the access-network, however on the other side – it operates only in B-MAC forwarding, where enormous services and huge numbers of C-MAC (customer MACs) on the access-side can be represented by individual B-MAC addresses on the core side. this obviously drastically reduces the amount of control-plane processing – especially in the core on the “BCB” Backbone core bridges – where forwarding is performed using B-MACs only.

In terms of C-MAC and B-MAC it makes it easier if you break the network up into two distinct sets of space, ideally “C-MAC space” and “B-MAC space”

diag2

It’s pretty easy to talk about C-MAC or customer mac addressing as that’s something we’ve all been dealing with since we sent our first ping packet – however B-MACs are a new concept.

Within the PBB network, each “BEB” Backbone-edge-bridge has one or more B-MAC identifiers which are unique on the entire network, can be assigned automatically or statically by design;

diag3

The interesting part starts when we begin looking at the packet flow from one side of the network to the other – in this case we’ll use the above diagram to send a packet from left to right – note how the composition of the frame changes at each section of the network, including the new PBB encapsulated frame;

diag4

If we look at the packet flow from left to right – a series of interesting things happen;

  1. The user on the left hand side, sends a regular single-tagged frame with a destination MAC address of “C2” on the far right hand side of the network.
  2. The ingress PEB (provider edge bridge) is performing regular QinQ, where it pushes a new “S-VLAN” onto the frame to create a brand new QinQ frame
  3. The double-tagged packet traverses the network, where it lands on the first BEB router – here the frame is received and the BEB generates a unique I-SID based on the S-VLAN and the B-MAC
  4. The BEB encapsulates the original frame with it’s C-VLAN and S-VLAN intact, and adds the new PBB encapsulation, complete with the PBB Source and destination B-MACs (B4 and B2) and I-SID – and forwards it into the core of the network
  5. The BCBs in the core forward frame based only on the source and destination B-MAC and take no notice of the internal original frame information
  6. The egress BEB strips the PBB header away and forwards the original QinQ frame onto the access network,
  7. Eventually the egress PEB switch pops the S-VLAN and forwards the original frame with the destination mac of C2, to the user interface.

So that’s vanilla PBB in a nutshell – essentially, it’s a way of hiding the gigantic amount of customer mac-addresses behind a drastically smaller number of backbone mac-addresses, without the devices in the core having to learn and process all of the individual customer state. Combined with a new I-SID service identifier we can create an encapsulation that allows for a huge number of services.

But like most things – it’s not perfect.

In 2016 (and for literally the last decade) most modern networks have a simple MPLS core comprising of PE and P routers, when it comes to PBB – we need the devices in the core to act as switches (the BCB backbone-core-bridge), performing forwarding decisions based on B-MAC addresses, which is obviously incompatible with a modern MPLS network where we’re switching packets between edge loopback addresses, using MPLS labels and recursion.

So the obvious question is – can we replace the BCB element in the middle with MPLS – whilst stealing the huge service scaling properties of the BEB PBB edge?

The answer is yes! by combining PBB with EVPN – we can replace the BCB element of the core and signal the “B-Component” using EVPN BGP signalling and encapsulate the whole thing inside MPLS using PE and P routers so that the PBB-EVPN architecture now reflects something we’re all a little more used to;

diag5

We now have the vast scale of PBB – combined with the simplicity and elegance of a traditional basic MPLS core network where the amount of network-wide state information has been drastically reduced, as opposed to regular PBB which is layer-2 over layer-2 we’ve moved to a model which is much more like layer-2 over layer-2 over layer-3.

The next question is – how do we configure it ? whilst PBB-EVPN simplifies the control-plane across the core and allows huge numbers of layer-2 services to transit the network in a much more simple manner – it is a little more complex to configure on Juniper MX series routers, but we’ll go through it step by step 🙂

Before we look at the configuration – it’s easier to understand if we visualise what’s happening inside the router itself, by breaking the whole thing up into blocks;

diag6

Basically, we break the router up into several blocks – in Juniper both the customer facing I-Component and backbone facing B-Component are configured as two separate routing-instances, with each routing-instance containing a bridge domain. Each bridge-group is different – the I-Component bridge-domain (BR-I-100) contains the physical tagged interface facing the customer including some service-options and the service-type which is “ELAN” a multi-point MEF carrier Ethernet standard, and the I-SID that we’re going to use to identify the service – in this case “100100” for VLAN 100.

The B-Component also contains a bridge-domain “BR-B-100100” which forms the backbone facing bridge where the B-MAC is sourced from, it also defines the EVPN PBB options used to signal the core.

These routing-instances are connected together by a pair of special interfaces;

  • PIP – Provider instance port
  • CBP – Customer backbone port

These interfaces join the I-Component and B-Component routing-instances together and are a bit like logical psuedo-interfaces normally found inside Juniper routers, used to connect certain logical elements together.

Lets take a look at the configuration of the routing-instances on Juniper MX-

  • Note, PBB-EVPN seems to have been supported only in very recent versions of Junos, these devices are MX-5 routers running Junos 16.1R2.11
  • All physical connectivity is done on TRIO via a “MIC-3D-20GE-SFP” card
  1. PBB-EVPN-B-COMP {
  2.     instance-type virtual-switch;
  3.     interface cbp0.1000;
  4.     route-distinguisher 1.1.1.1:100;
  5.     vrf-target target:100:100;
  6.     protocols {
  7.         evpn {
  8.             control-word;
  9.             pbb-evpn-core;
  10.             extended-isid-list 100100;
  11.         }
  12.     }
  13.     bridge-domains {
  14.         BR-B-100100 {
  15.             vlan-id 999;
  16.             isid-list 100100;
  17.             vlan-id-scope-local;
  18.         }
  19.     }
  20. }
  21. PBB-EVPN-I-COMP {
  22.     instance-type virtual-switch;
  23.     interface pip0.1000;
  24.     bridge-domains {
  25.         BR-I-100 {
  26.             vlan-id 100;
  27.             interface ge-1/0/0.100;
  28.         }
  29.     }
  30.     pbb-options {
  31.         peer-instance PBB-EVPN-B-COMP;
  32.     }
  33.     service-groups {
  34.         CUST1 {
  35.             service-type elan;
  36.             pbb-service-options {
  37.                 isid 100100 vlan-id-list 100;
  38.             }
  39.         }
  40.     }
  41. }

 

If we look at the configuration line by line, it works out as follows – for the B-Component

  • Lines 2,4 and 5 represent normal EVPN route-distribution properties, (RD/RT etc)
  • Line 3 brings the customer backbone port into the B-Component routing-instance – this logically links the B-Component to the I-Component
  • Lines 8 and 9 specify the control-word and switch on the PBB-EVPN-CORE feature
  • Line 10 allows only a I-Component service with an I-SID of 100100 to be processed by the B-Component
  • Lines 13-17 are bridge-domain options
  • Line 15 references VLAN-999 – this is currently unused but needs to be configured any value can be added here
  • Line 16 species the I-SID mapping

For the I-Component;

  • Line 23 adds the PIP (provider instance port) to the I-Component routing-instance
  • Lines 24 -29 are standard bridge-domain settings which add the physical customer facing interface (ge-1/0/0.100) for VLAN-ID 100, to the bridge-domain and routing-instance
  • Lines 30 and 31 activate the PBB service – and reference the “PBB-EVPN-B-COMP” routing-instance as the peer-instance for the service, this is how the I-Component is linked to the B-Component
  • Lines 33-37 reference the service group, in this case “CUST1” with the service-type set as ELAN (the MEF standard for multipoint layer-2 connectivity) the I-Component I-SID for this service for VLAN-100 is 100100 as defined on line 37

Lets examine the PIP and CBP interfaces;

  1. interfaces {
  2.     ge-1/0/0 {
  3.         flexible-vlan-tagging;
  4.         encapsulation flexible-ethernet-services;
  5.         unit 100 {
  6.             encapsulation vlan-bridge;
  7.             vlan-id 100;
  8.         }
  9.     }
  10. cbp0 {
  11.         unit 1000 {
  12.             family bridge {
  13.                 interface-mode trunk;
  14.                 bridge-domain-type bvlan;
  15.                 isid-list all;
  16.             }
  17.         }
  18.     }
  19. pip0 {
  20.         unit 1000 {
  21.             family bridge {
  22.                 interface-mode trunk;
  23.                 bridge-domain-type svlan;
  24.                 isid-list all-service-groups;
  25.             }
  26.         }
  27.     }
  28. }

 

  • Lines 2 through 9 represent a standard gigabit Ethernet interface configured with vlan-bridge encapsulation for VLAN 100 – standard stuff we’re all used to seeing;
  • Lines 10 15 represent the CBP interface (customer backbone port) for unit 1000 where the bridge-domain-type is set to bvlan (backbone vlan) and accept any I-SID, this connects the B-Component to the I-Component
  • Lines 19 through 24 represent the PIP (provider instance port) for the same unit 1000, as an svlan bridge – using an I-SID list for any service-group
  • The PIP0 interface, connects the I-Component to the B-Component

A lot to remember so far! – another point worth mentioning is that PBB-EVPN doesn’t work unless you have the router set for “enhanced-ip mode”;

  1. chassis {
  2.     network-services enhanced-ip;
  3. }

 

And – like our regular EVPN configuration from previous blog posts – we just have basic EVPN signalling turned on inside BGP;

  1. protocols {
  2.     bgp {
  3.         group mesh {
  4.             type internal;
  5.             local-address 10.10.10.1;
  6.             family evpn {
  7.                 signaling;
  8.             }
  9.             neighbor 10.10.10.2;
  10.             neighbor 10.10.10.3;
  11.         }
  12.     }

 

The fact that we have basic BGP EVPN signalling turned on, is a real advantage – as it keeps the service inline with modern MPLS based offerings – where we have a simple LDP/IGP core with all the edge services (L3VPN, Multicast, IPv4, IPv6, L2VPN) controlled by a single protocol – BGP which we all know and love.

So I have this configuration running on 3x MX5 routers – the configurations from the above snippets are identical across all 3x MX5 routers, with the obvious exception of IP addresses – lets recap the diagram;

diag7

With the configuration applied – I’ll go ahead and spawn 3000 hosts using IXIA, each host is an emulated machine sat on the same layer-2 /16 subnet with VLAN-100 spanned across all three sites in the PBB-EVPN – basically just imagine MX-1 MX-2 and MX-3 as switches with 1000 laptops plugged into each one 🙂 to keep things simple – I’m only going to use single-tagged frames directly from IXIA and send a full-mesh of traffic to all sites, with each stream being 300Mbps, with 3x sites – that’s 900Mbps of traffic in total;

diag8

Traffic appears to be successfully forwarded end to end without delay – lets check some of the show commands on MX-1;

We can clearly see 3000 hosts inside the bridge-mac table on MX5-1;

  1. tim@MX5-1> show bridge mac-table count
  2. 3 MAC address learned in routing instance PBB-EVPN-B-COMP bridge domain BR-B-100100
  3.   MAC address count per learn VLAN within routing instance:
  4.     Learn VLAN ID            MAC count
  5.               999                    3
  6. 3000 MAC address learned in routing instance PBB-EVPN-I-COMP bridge domain BR-I-100
  7.   MAC address count per interface within routing instance:
  8.     Logical interface        MAC count
  9.     ge-1/0/0.100:100              1000
  10.     rbeb.32768                    1000
  11.     rbeb.32769                    1000
  12.   MAC address count per learn VLAN within routing instance:
  13.     Learn VLAN ID            MAC count
  14.               100                 3000
  15. tim@MX5-1>

 

  • Line 2 shows the B-Component and 3x B-MACs being learnt via the PBB-EVPN
  • Line 6 shows the I-component and all 3000 mac-addresses live on the network, 1000 learnt locally via the directly connected interface, and 2000 learnt via RBEB.32768 and RBEB.32769 – the RBEB is the remote-backbone-edge-bridge – once the frames come in from the EVPN and the PBB headers are popped, the original C-MACs are learnt, which is why we see 3000 MACs in the I-Component locally, whilst we see only 3x B-MACs learnt remotely from the B-Component.

Lets look at the BGP table;

  1. tim@MX5-1> show bgp summary
  2. Groups: 1 Peers: 2 Down peers: 0
  3. Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
  4. bgp.evpn.0
  5.                        4          4          0          0          0          0
  6. Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped…
  7. 10.10.10.2              100       1183       1177       0       0     8:50:22 Establ
  8. bgp.evpn.0: 2/2/2/0
  9.   PBB-EVPN-B-COMP.evpn.0: 2/2/2/0
  10.   __default_evpn__.evpn.0: 0/0/0/0
  11. 10.10.10.3              100       1183       1179       0       0     8:50:18 Establ
  12.   bgp.evpn.0: 2/2/2/0
  13. PBB-EVPN-B-COMP.evpn.0: 2/2/2/0
  14.   __default_evpn__.evpn.0: 0/0/0/0
  15. tim@MX5-1> show route protocol bgp table PBB-EVPN-B-COMP.evpn.0
  16. PBB-EVPN-B-COMP.evpn.0: 6 destinations, 6 routes (6 active, 0 holddown, 0 hidden)
  17. + = Active Route, – = Last Active, * = Both
  18. 2:1.1.1.1:100::100100::a8:d0:e5:5b:75:c8/304 MAC/IP
  19.                    *[BGP/170] 08:25:43, localpref 100, from 10.10.10.2
  20.                       AS path: I, validation-state: unverified
  21.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  22. 2:1.1.1.1:100::100100::a8:d0:e5:5b:94:60/304 MAC/IP  
  23.                    *[BGP/170] 08:24:56, localpref 100, from 10.10.10.3
  24.                       AS path: I, validation-state: unverified
  25.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299920
  26. 3:1.1.1.1:100::100100::10.10.10.2/304 IM
  27.                    *[BGP/170] 08:25:47, localpref 100, from 10.10.10.2
  28.                       AS path: I, validation-state: unverified
  29.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  30. 3:1.1.1.1:100::100100::10.10.10.3/304 IM
  31.                    *[BGP/170] 08:24:57, localpref 100, from 10.10.10.3
  32.                       AS path: I, validation-state: unverified
  33.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299920
  34. tim@MX5-1>

 

Here we can see quite quickly the savings made in memory, CPU and control-plane processing, we have a stretched layer-2 network with 3000 hosts, in regular EVPN by now we’d have 3000 EVPN MAC routes being advertised and received across the network despite only 3x sites being in play. Here with PBB-EVPN we only have 3x B-MACs in the BGP table – with 2x being learnt remotely (shown in purple on lines 18 and 22)

Technically we could have a million mac-addresses at each site – provided the switches could handle that many mac-addresses, we’d still only be advertising the 3x B-MACs across the core from the B-Component, so PBB-EVPN does provide massive scale – it’s true that the locally we’d still need to learn 1 million C-MACs, but the difference is we don’t need to advertise them all back and forth across the network – that state remains local and is represented by a B-MAC and I-SID for that specific customer or service.

We can take a look at the bridge mac-table to see the different mac-addresses in play, for both the B-Component and the I-Component;

  1. tim@MX5-1> show bridge mac-table
  2. MAC flags       (S -static MAC, D -dynamic MAC, L -locally learned, C -Control MAC
  3.     O -OVSDB MAC, SE -Statistics enabled, NM -Non configured MAC, R -Remote PE MAC)
  4. Routing instance : PBB-EVPN-B-COMP
  5.  Bridging domain : BR-B-100100, VLAN : 999
  6.    MAC                 MAC      Logical          NH     RTR
  7.    addresssss          flags    interface        Index  ID
  8.    01:1e:83:01:87:04   DC                        1048575 0      
  9.    a8:d0:e5:5b:75:c8   DC                        1048576 1048576
  10.    a8:d0:e5:5b:94:60   DC                        1048578 1048578
  11. MAC flags (S -static MAC, D -dynamic MAC,
  12.            SE -Statistics enabled, NM -Non configured MAC)
  13. Routing instance : PBB-EVPN-I-COMP
  14.  Bridging domain : BR-I-100, ISID : 100100, VLAN : 100
  15.    MAC                 MAC      Logical                 Remote
  16.    address             flags    interface               BEB address
  17.    00:00:00:bc:25:2f   D        ge-1/0/0.100        
  18.    00:00:00:bc:25:31   D        ge-1/0/0.100        
  19.    00:00:00:bc:25:33   D        ge-1/0/0.100        
  20.    00:00:00:bc:25:35   D        ge-1/0/0.100        
  21.    00:00:00:bc:25:37   D        ge-1/0/0.100        
  22.    00:00:00:bc:25:39   D        ge-1/0/0.100        
  23.    00:00:00:bc:25:3b   D        ge-1/0/0.100        
  24.    00:00:00:bc:25:3d   D        ge-1/0/0.100    
  25.         <omitted>
  26.    00:00:00:bc:2f:67   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  27.    00:00:00:bc:2f:69   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  28.    00:00:00:bc:2f:6b   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  29.    00:00:00:bc:2f:6d   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  30.    00:00:00:bc:2f:6f   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  31.    00:00:00:bc:2f:71   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  32.    00:00:00:bc:2f:73   D        rbeb.32768              a8:d0:e5:5b:75:c8  
  33.         <omitted>
  34.    00:00:00:bc:3c:65   D        rbeb.32769              a8:d0:e5:5b:94:60  
  35.    00:00:00:bc:3c:67   D        rbeb.32769              a8:d0:e5:5b:94:60  
  36.    00:00:00:bc:3c:69   D        rbeb.32769              a8:d0:e5:5b:94:60  
  37.    00:00:00:bc:3c:6b   D        rbeb.32769              a8:d0:e5:5b:94:60  
  38.    00:00:00:bc:3c:6d   D        rbeb.32769              a8:d0:e5:5b:94:60  
  39.    00:00:00:bc:3c:6f   D        rbeb.32769              a8:d0:e5:5b:94:60  
  40.    00:00:00:bc:3c:71   D        rbeb.32769              a8:d0:e5:5b:94:60

 

Because there are 3000x MAC addresses currently on the network I’ve omitted both of them so you can see the important differences;

  • Lines 10 through 12, show the B-MACs with line 10 showing the local B-MAC for MX-1, and lines 11 and 12 showing the B-MACs learnt from MX-2 and MX-3 via BGP (we can see these in the BGP routing table on lines 18 and 22 from the previous example)
  • Lines 21 through 28 give a small sample of the locally learnt mac-addresses connected to ge-1/0/0.100 in the I-Component
  • Lines 32 through 38 give a small sample of the remotely learnt mac-addresses from MX-2’s B-MAC (a8:d0:e5:5b:75:c8)
  • Lines 42 through 48 give a small sample of the remotely learnt mac-addresses from MX-3’s B-MAC (a8:d0:e5:5b:94:60)
  • Essentially – the EVPN control-plane is only present for B-MACs, whilst the C-MAC forwarding is handled in the same way as VPLS on the forwarding plane, the big advantage is that all this information isn’t thrown into BGP – it’s kept locally.

Finally, with traffic running – I have the connection between MX-1 and P1 tapped so I can capture packets into wireshark at line-rate, lets look at a packet in the middle of the network to see what it looks like;

diag9

We can see the MPLS lables (2x labels, one for the IGP transport and one for the EVPN) below that we have our backbone Ethernet header with source and destination B-MAC (802.1ah provider backbone bridge) below that we have our 802.1ah PBB I-SID field with the customer’s original C-MAC, and last we have the original dot1q frame, (single-tagged in this case)

So that’s pretty much it – as far as the basics of PBB-EVPNs are concerned, a few basic points;

  • PBB-EVPNs are considerably more complicated to configure than regular EVPN, however if you need massive scale and the ability to handle hundreds of thousands or millions of mac-addresses on the network – it’s currently one of the best technologies to look at
  • Unfortunately PBB-EVPN is pretty much layer-2 only, most of the fancy layer-3 hooks build into regular EVPN which I demonstrated in previous blog posts aren’t supported for PBB-EVPN, it is essentially a layer-2 solution
  • PBB-EVPN does support layer-2 multi-homing which I might look into with a later blog-post

I hope anyone reading this found it useful 🙂

Segment Routing on JUNOS – The basics

Anybody who’s been to any seminar, associated with any major networking systems manufacturer or bought any recent study material, will almost certainly have come across something new called “Segment Routing” it sounds pretty cool – but what is it and why has it been created?

To understand this we first need to rewind to what most of us are used to doing on a daily basis – designing/building/maintaining/troubleshooting networks, that are built mostly around LDP or RSVP-TE based protocols. But what’s wrong with these protocols? why has Segment-Routing been invented and what problems does it solve?

Before we delve into the depths of Segment-Routing, lets first remind ourselves of what basic LDP based MPLS is. LDP or “Label Distribution Protocol” was first invented around 1999, superseding the now defunct “TGP” or “Tag distribution protocol” in order to solve the problems of traditional IPv4 based routing. Where control-plane resources were finite in nature, MPLS enabled routers to forward packets based solely on labels, rather than destination IP address, allowing for a much more simple design. The fact that the “M” in MPLS stands for “Multiprotocol” allowed engineers to support a whole range of different services and encapsulations, that could be tunnelled between devices in a network running nothing other than traditional IPv4, the role of LDP was to generate and distribute MPLS label bindings to other devices in a network, alongside a common IGP such as ISIS or OSPF.

Back in the late 1990’s and early 2000’s, routers were much smaller and far less powerful – especially where relatively resource intense protocols like OSPF or ISIS were concerned, there was also the problem that protocols like OSPF – which is based on IP were very difficult to modify due to the size of the IP header, as a result rather than modify the IGPs to support MPLS natively – the decision was made to invent a totally separate protocol (LDP) to run alongside the IGPs simply to provide the MPLS label distribution and binding capability – many people today regard LDP as a “Sticking plaster” I myself prefer the phrase “Gaffer tape” 🙂

A quick refresher on how LDP works using a pile of MX routers, consider the following basic topology;

seg3

All routers have an identical configuration, the only difference is the ISIS ISO address and the IP addressing;

  1. tim@MX-1> show configuration protocols
  2. isis {
  3.     level 1 disable;
  4.     interface xe-2/0/0.0 {
  5.         point-to-point;
  6.     }
  7.     interface lo0.0 {
  8.         passive;
  9.     }
  10. }
  11. ldp {
  12.     interface xe-2/0/0.0;
  13. }

 

Assume LDP adjacencies are established between all devices, the following sequence of events occurs;

  • MX-4 injects it’s local loopback 4.4.4.4/32 into ISIS, this is advertised throughout the network – LDP also creates an MPLS label-binding for label-value 3 (the implicit-null label) which is advertised towards MX-3;

seg4

  • MX-3 receives the prefix with the label-binding of 3 (implicit-null) and creates an entry in it’s forwarding table with a “pop” action, for any traffic destined for 4.4.4.4 out of interface xe-0/0/0 (essentially sending the packet unlabelled) at the same time it generates a new outgoing label of “299780” for 4.4.4.4 which is advertised towards MX-2;

seg5

  • When MX-2 receives 4.4.4.4 with a label binding of 299780, it adds the entry to it’s forwarding table out of interface xe-0/0/1, whilst at the same time forwarding the prefix towards MX-1 with a different label of, “299781” MX-2 is now aware of 2x MPLS labels for 4.4.4.4 – the label of 299780 it received from MX-3 and the new label of 299781 it generated and sent to MX-1, this essentially means any packets coming from MX-1 towards 4.4.4.4, tagged with label 299781 on xe-0/0/0 will be swapped to 299780 and forwarded out of xe-0/0/1 – hence the “hop by hop” forwarding paradigm;

seg6

With such a small network involving only 4x routers, it’s difficult to imagine running into problems with LDP because it’s so simple and easy, however the moment you go from 4x routers to 1000x routers or beyond it starts to become far less efficient;

  • Because LSRs generate labels for remote FEC’s on a hop-by-hop basis you end up with a large amount of MPLS labels contained in the LFIB which have to be distributed alongside the IGP, resulting in a large amount of overhead. In the above example we have multiple labels for a single prefix with only 3 routers (with the fourth performing PHP)
  • We have to run LDP alongside the IGP everywhere, simply for MPLS to work – it’s true that we’ve all been doing this for years so why complain about it now when it works just fine? A simple solution is always the best solution, larger networks would be much simpler if the IGP could be made to accommodate the MPLS label advertisement functionality.
  • No traffic-engineering functionality; ultimately at the end of the day, in 99% of networks LDP simply “follows” the IGP best-path mechanism, if you change the IGP metrics you end up shifting large amounts of traffic around which is often undesirable – as such LDP tends to be a pain in the neck, if you have more complex traffic requirements, for example making sure that 40Gbps of streaming video avoids a certain link in the network – with LDP it can’t be done very easily without resorting to endless hacks and tactical tweaks.

So LDP is far from perfect when we get into more complicated scenarios, if we have a larger network where we want to do any sort of traffic-engineering – the only real alternative is RSVP-TE.

RSVP-TE – essentially is an extension of the original “RSVP” Resource Reservation Protocol that allows it to generate MPLS labels for prefixes, whilst at the same time using it’s Resource reservation capabilities to reserve specific LSPs through the network, that require a certain amount of bandwidth – or simply reserving a path that’s determined by the network designer, rather than the IGP and it’s lowest-path-cost mentality.

The rather obvious cost with RSVP-TE is that it’s a lot more complex, I’ve lost count of the amount of times I’ve suggested a relatively simple RSVP-TE solution to a traffic-engineering problem, for the people in the room to simply rule it out just because it’s just too complex in nature – I’ve worked with a small number of global carrier/mobile networks who almost exclusively use RSVP-TE along with it’s fancy features, such as “auto-bandwidth” but the vast majority of smaller networks tend to stay away from it.

A further problem with RSVP-TE is that in large networks with numerous “P” routers and “PE” routers, the LSP state between the ingress and egress LSR must be maintained – in a network with 1000’s of routers, all of that information needs to be signalled – including bandwidth reservations, path reservations so on and so fourth, as opposed to LDP where we simply bind an MPLS label. The end result can be that in some networks control-plane processing can be extremely intense on the route engines if the network encounters a significant failure – imagine a P router with 5k signalled LSPs traversing it, if it drops a link or card – those 5k LSPs need to be recalculated and re-signalled throughout the entire network.

To make matters worse, many networks run LDP and RSVP-TE at the same time, LDP for traditional basic MPLS connectivity, with RSVP-TE LSPs running over the top to provide the traffic-engineering capability, that might be needed in certain niche parts of the network – like keeping sensitive VOIP traffic separate from bulk internet traffic – the complexity ramps up pretty quickly in these environments and you end up with a lot of different protocols stacked up on top of each other – when all we really want to do is just forward packets between routers in a network………. 😀

 

Which brings me finally to Segment routing!

 

Segment routing is essentially proposed as a replacement for LDP or RSVP-TE, where the IGP (currently ISIS or OSPF) has been extended to incorporate the MPLS labelling and segment-routing functions internally, leading to the immediate obvious benefit, of not having to run an additional protocol alongside the IGP to provide the MPLS functionality – we can do everything inside ISIS or OSPF.

To make things even cooler, Segment-routing can operate over an IPv4 or IPv6 data-plane, supports ECMP and also has extensions built into it, which allow it cater for things like L3-VPNs or VPLS running over the top. The only thing it can’t do is reserve bandwidth in the same way that RSVP-TE can, but this can be accomplished via the use of an external controller (SDN)

Segment routing support was released on Juniper MX routers under 15.1F6

For now lets look at a basic topology, along with some of the basic concepts and configurations, consider the below expanded topology from the LDP examples above;

seg7

Everything is the same, except that I’ve gone an added an additional link between MX-2 and MX-4. The first step is to enable segment-routing, for this network I’m using ISIS as the IGP. Turning segment-routing on is pretty simple – I just need to have MPLS and ISIS enabled on the correct interfaces and switch on “source-packet-routing” under ISIS;

  1. tim@MX-1# show protocols
  2. mpls {
  3.     interface xe-2/0/0.0;
  4. }
  5. isis {
  6.     source-packet-routing;
  7.     level 1 disable;
  8.     interface xe-2/0/0.0 {
  9.         point-to-point;
  10.     }
  11.     interface lo0.0 {
  12.         passive;
  13.     }
  14. }

 

Notice how it’s called “source-packet-routing” essentially, Segment-routing uses a source routing paradigm, where the ingress PE determines the path through the network based on a set of instructions or “segments”

Take this on contrast with RSVP-TE, where the control-plane is source routed (the head-end LSR computes the path through the network to the tail-end) but the packets are only sent with a single RSVP MPLS label, and so the control-plane is source-routed, but the data-plane is not. 

With “segment-routing” enabled on all the routers in the network, lets take a look and see what’s what;

We have a normal ISIS adjacency on MX-1;

  1. tim@MX-1> show isis adjacency
  2. Interface             System         L State        Hold (secs) SNPA
  3. xe-2/0/0.0            MX-2           2  Up                   21
  4. {master}
  5. tim@MX-1>

 

Let’s check out the ISIS database and see if anything new is present;

  1. tim@MX-1> show isis database extensive MX-2.00
  2. IS-IS level 1 link-state database:
  3. IS-IS level 2 link-state database:
  4. MX-2.00-00 Sequence: 0x28, Checksum: 0x4cff, Lifetime: 616 secs
  5.    IS neighbor: MX-1.00                       Metric:       10
  6.      Two-way fragment: MX-1.00-00, Two-way first fragment: MX-1.00-00
  7.    IS neighbor: MX-3.00                       Metric:       10
  8.      Two-way fragment: MX-3.00-00, Two-way first fragment: MX-3.00-00
  9.    IS neighbor: MX-4.00                       Metric:       10
  10.      Two-way fragment: MX-4.00-00, Two-way first fragment: MX-4.00-00
  11.    IP prefix: 2.2.2.2/32                      Metric:        0 Internal Up
  12.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  13.    IP prefix: 10.10.10.2/31                   Metric:       10 Internal Up
  14.    IP prefix: 10.10.10.4/31                   Metric:       10 Internal Up
  15.   Header: LSP ID: MX-2.00-00, Length: 315 bytes
  16.     Allocated length: 335 bytes, Router ID: 2.2.2.2
  17.     Remaining lifetime: 616 secs, Level: 2, Interface: 327
  18.     Estimated free bytes: 81, Actual free bytes: 20
  19.     Aging timer expires in: 616 secs
  20.     Protocols: IP, IPv6
  21.   Packet: LSP ID: MX-2.00-00, Length: 315 bytes, Lifetime : 1198 secs
  22.     Checksum: 0x4cff, Sequence: 0x28, Attributes: 0x3 <L1 L2>
  23.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  24.     Packet type: 20, Packet version: 1, Max area: 0
  25.   TLVs:
  26.     Area address: 49.0001 (3)
  27.     LSP Buffer Size: 1492
  28.     Speaks: IP
  29.     Speaks: IPV6
  30.     IP router id: 2.2.2.2
  31.     IP address: 2.2.2.2
  32.     Hostname: MX-2
  33.     Router Capability:  Router ID 2.2.2.2, Flags: 0x01
  34. SPRING Algorithm – Algo: 0
  35.     IS neighbor: MX-1.00, Internal, Metric: default 10
  36.     IS neighbor: MX-3.00, Internal, Metric: default 10
  37.     IS neighbor: MX-4.00, Internal, Metric: default 10
  38.     IS extended neighbor: MX-1.00, Metric: default 10
  39.       IP address: 10.10.10.1
  40.       Neighbor’s IP address: 10.10.10.0
  41.       Local interface index: 328, Remote interface index: 327
  42. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299784
  43.     IS extended neighbor: MX-3.00, Metric: default 10
  44.       IP address: 10.10.10.2
  45.       Neighbor’s IP address: 10.10.10.3
  46.       Local interface index: 329, Remote interface index: 333
  47. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299783
  48.     IS extended neighbor: MX-4.00, Metric: default 10
  49.       IP address: 10.10.10.4
  50.       Neighbor’s IP address: 10.10.10.5
  51.       Local interface index: 331, Remote interface index: 333
  52. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299785
  53.     IP prefix: 2.2.2.2/32, Internal, Metric: default 0, Up
  54.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  55.     IP prefix: 10.10.10.2/31, Internal, Metric: default 10, Up
  56.     IP prefix: 10.10.10.4/31, Internal, Metric: default 10, Up
  57.     IP extended prefix: 2.2.2.2/32 metric 0 up
  58.     IP extended prefix: 10.10.10.0/31 metric 10 up
  59.     IP extended prefix: 10.10.10.2/31 metric 10 up
  60.     IP extended prefix: 10.10.10.4/31 metric 10 up
  61.   No queued transmissions
  62. {master}
  63. tim@MX-1>

 

So if we look at the ISIS database against MX-1’s neighbour (MX-2) we can see some additional things happening in ISIS;

  • We can see that SPRING (Segment-routing) is turned on and is a known TLV
  • We can see something called a “P2P IPv4 Adj-SID” with an associated MPLS label

The “IPv4 Adj-SID” is known as the IGP adjacency segment, and is essentially a segment attached to a directly connected IGP adjacency, it’s injected locally by the router at either side of the adjacency – this can easily be demonstrated if we simply have a link between MX-1 and MX-2;

seg8

We take another look at the ISIS database on MX1;

  1. tim@MX-1> show isis database extensive
  2. IS-IS level 1 link-state database:
  3. IS-IS level 2 link-state database:
  4. MX-1.00-00 Sequence: 0x2, Checksum: 0xf229, Lifetime: 827 secs
  5.    IS neighbor: MX-2.00                       Metric:       10
  6.      Two-way fragment: MX-2.00-00, Two-way first fragment: MX-2.00-00
  7.    IP prefix: 1.1.1.1/32                      Metric:        0 Internal Up
  8.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  9.   Header: LSP ID: MX-1.00-00, Length: 171 bytes
  10.     Allocated length: 1492 bytes, Router ID: 1.1.1.1
  11.     Remaining lifetime: 827 secs, Level: 2, Interface: 0
  12.     Estimated free bytes: 1273, Actual free bytes: 1321
  13.     Aging timer expires in: 827 secs
  14.     Protocols: IP, IPv6
  15.   Packet: LSP ID: MX-1.00-00, Length: 171 bytes, Lifetime : 1198 secs
  16.     Checksum: 0xf229, Sequence: 0x2, Attributes: 0x3 <L1 L2>
  17.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  18.     Packet type: 20, Packet version: 1, Max area: 0
  19.   TLVs:
  20.     Area address: 49.0001 (3)
  21.     LSP Buffer Size: 1492
  22.     Speaks: IP
  23.     Speaks: IPV6
  24.     IP router id: 1.1.1.1
  25.     IP address: 1.1.1.1
  26.     Hostname: MX-1
  27.     Router Capability:  Router ID 1.1.1.1, Flags: 0x01
  28.       SPRING Algorithm – Algo: 0
  29.     IP prefix: 1.1.1.1/32, Internal, Metric: default 0, Up
  30.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  31.     IP extended prefix: 1.1.1.1/32 metric 0 up
  32.     IP extended prefix: 10.10.10.0/31 metric 10 up
  33.     IS neighbor: MX-2.00, Internal, Metric: default 10
  34.     IS extended neighbor: MX-2.00, Metric: default 10
  35.       IP address: 10.10.10.0
  36.       Neighbor’s IP address: 10.10.10.1
  37.       Local interface index: 327, Remote interface index: 328
  38. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299856
  39.   No queued transmissions
  40. MX-2.00-00 Sequence: 0x2, Checksum: 0x90bf, Lifetime: 825 secs
  41.    IS neighbor: MX-1.00                       Metric:       10
  42.      Two-way fragment: MX-1.00-00, Two-way first fragment: MX-1.00-00
  43.    IP prefix: 2.2.2.2/32                      Metric:        0 Internal Up
  44.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  45.   Header: LSP ID: MX-2.00-00, Length: 171 bytes
  46.     Allocated length: 284 bytes, Router ID: 2.2.2.2
  47.     Remaining lifetime: 825 secs, Level: 2, Interface: 327
  48.     Estimated free bytes: 113, Actual free bytes: 113
  49.     Aging timer expires in: 825 secs
  50.     Protocols: IP, IPv6
  51.   Packet: LSP ID: MX-2.00-00, Length: 171 bytes, Lifetime : 1198 secs
  52.     Checksum: 0x90bf, Sequence: 0x2, Attributes: 0x3 <L1 L2>
  53.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  54.     Packet type: 20, Packet version: 1, Max area: 0
  55.   TLVs:
  56.     Area address: 49.0001 (3)
  57.     LSP Buffer Size: 1492
  58.     Speaks: IP
  59.     Speaks: IPV6
  60.     IP router id: 2.2.2.2
  61.     IP address: 2.2.2.2
  62.     Hostname: MX-2
  63.     Router Capability:  Router ID 2.2.2.2, Flags: 0x01
  64.       SPRING Algorithm – Algo: 0
  65.     IP prefix: 2.2.2.2/32, Internal, Metric: default 0, Up
  66.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  67.     IP extended prefix: 2.2.2.2/32 metric 0 up
  68.     IP extended prefix: 10.10.10.0/31 metric 10 up
  69.     IS neighbor: MX-1.00, Internal, Metric: default 10
  70.     IS extended neighbor: MX-1.00, Metric: default 10
  71.       IP address: 10.10.10.1
  72.       Neighbor’s IP address: 10.10.10.0
  73.       Local interface index: 328, Remote interface index: 327
  74. P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299784
  75.   No queued transmissions
  76. {master}
  77. tim@MX-1>

 

So we can see from the ISIS database, that each router on either side of the adjacency has locally generated a label for it’s own side of the link. Consider that this information is injected into the ISIS database, and the ISIS database is flooded throughout the entire network – this gives any ingress LSR the required knowledge to perform traffic-engineering by simply imposing whichever adjacency segment instructions it needs for a packet to take a specific path through the network, for the purposes of traffic-engineering.

Take the below example, if MX-1 sends packets containing the IGP Adj-SID of 10  for MX-2’s link to MX-3 (ADJ-SID = 10) traffic can be steered via MX-3 as soon as it lands on MX-2. Note that whilst MX-2 will allocate it’s ADJ-SID of 10 and distribute it via the IGP, only MX-2 will install that label in the forwarding-table – because it’s locally significant.

seg9

The Adjacency segment is of the two main building blocks of segment-routing, and is generally known as a local segment, simply because it’s designed to have a local significance – if a packet arrives on an interface with a specific local-segment instruction in the stack, the device will act on that instruction and forward the packet in a particular way for that segment, or part of the network.

The next type of segment is known as the “nodal segment” or “global segment” and is globally significant, it generally represents the loopback address of each router in the network and is configured as an index, lets go ahead and look at the configuration;

  1. tim@MX-1> show configuration protocols isis
  2. source-packet-routing {
  3.     node-segment ipv4-index 10;
  4. }
  5. level 1 disable;
  6. interface xe-2/0/0.0 {
  7.     point-to-point;
  8. }
  9. interface lo0.0 {
  10.     passive;
  11. }

 

So a relatively straightforward configuration, I’ll go ahead and configure the rest of the network as above but with the following indexes;

  • MX-1 = node-segment index-10
  • MX-2 = node-segment index-20
  • MX-3 = node-segment index-30
  • MX-4 = node-segment index-40

seg10

So with the node-segment index configured on each router, lets check what’s changed inside the ISIS database on MX-1, for the LSAs received for MX-2 to keep things simple for now;

  1. tim@MX-1> show isis database extensive MX-2
  2. IS-IS level 1 link-state database:
  3. IS-IS level 2 link-state database:
  4. MX-2.00-00 Sequence: 0x73, Checksum: 0xd32e, Lifetime: 479 secs
  5.   IPV4 Index: 20
  6.   Node Segment Blocks Advertised:
  7.     Start Index : 0, Size : 4096, Label-Range: [ 800000, 804095 ]
  8.    IS neighbor: MX-1.00                       Metric:       10
  9.      Two-way fragment: MX-1.00-00, Two-way first fragment: MX-1.00-00
  10.    IS neighbor: MX-3.00                       Metric:       10
  11.      Two-way fragment: MX-3.00-00, Two-way first fragment: MX-3.00-00
  12.    IS neighbor: MX-4.00                       Metric:       10
  13.      Two-way fragment: MX-4.00-00, Two-way first fragment: MX-4.00-00
  14.    IP prefix: 2.2.2.2/32                      Metric:        0 Internal Up
  15.    IP prefix: 10.10.10.0/31                   Metric:       10 Internal Up
  16.    IP prefix: 10.10.10.2/31                   Metric:       10 Internal Up
  17.    IP prefix: 10.10.10.4/31                   Metric:       10 Internal Up
  18.   Header: LSP ID: MX-2.00-00, Length: 335 bytes
  19.     Allocated length: 335 bytes, Router ID: 2.2.2.2
  20.     Remaining lifetime: 479 secs, Level: 2, Interface: 327
  21.     Estimated free bytes: 113, Actual free bytes: 0
  22.     Aging timer expires in: 479 secs
  23.     Protocols: IP, IPv6
  24.   Packet: LSP ID: MX-2.00-00, Length: 335 bytes, Lifetime : 1198 secs
  25.     Checksum: 0xd32e, Sequence: 0x73, Attributes: 0x3 <L1 L2>
  26.     NLPID: 0x83, Fixed length: 27 bytes, Version: 1, Sysid length: 0 bytes
  27.     Packet type: 20, Packet version: 1, Max area: 0
  28.   TLVs:
  29.     Area address: 49.0001 (3)
  30.     LSP Buffer Size: 1492
  31.     Speaks: IP
  32.     Speaks: IPV6
  33.     IP router id: 2.2.2.2
  34.     IP address: 2.2.2.2
  35.     Hostname: MX-2
  36.     Router Capability:  Router ID 2.2.2.2, Flags: 0x01
  37.       SPRING Capability – Flags: 0xc0(I:1,V:1), Range: 4096, SID-Label: 800000
  38.       SPRING Algorithm – Algo: 0
  39.     IS neighbor: MX-1.00, Internal, Metric: default 10
  40.     IS neighbor: MX-3.00, Internal, Metric: default 10
  41.     IS neighbor: MX-4.00, Internal, Metric: default 10
  42.     IS extended neighbor: MX-1.00, Metric: default 10
  43.       IP address: 10.10.10.1
  44.       Neighbor’s IP address: 10.10.10.0
  45.       Local interface index: 328, Remote interface index: 0
  46.       P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299784
  47.     IS extended neighbor: MX-3.00, Metric: default 10
  48.       IP address: 10.10.10.2
  49.       Neighbor’s IP address: 10.10.10.3
  50.       Local interface index: 329, Remote interface index: 0
  51.       P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299789
  52.     IS extended neighbor: MX-4.00, Metric: default 10
  53.       IP address: 10.10.10.4
  54.       Neighbor’s IP address: 10.10.10.5
  55.       Local interface index: 331, Remote interface index: 0
  56.       P2P IPV4 Adj-SID – Flags:0x30(F:0,B:0,V:1,L:1,S:0), Weight:0, Label: 299788
  57.     IP prefix: 2.2.2.2/32, Internal, Metric: default 0, Up
  58.     IP prefix: 10.10.10.0/31, Internal, Metric: default 10, Up
  59.     IP prefix: 10.10.10.2/31, Internal, Metric: default 10, Up
  60.     IP prefix: 10.10.10.4/31, Internal, Metric: default 10, Up
  61.     IP extended prefix: 2.2.2.2/32 metric 0 up
  62.       8 bytes of subtlvs
  63.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 20
  64.     IP extended prefix: 10.10.10.0/31 metric 10 up
  65.     IP extended prefix: 10.10.10.2/31 metric 10 up
  66.     IP extended prefix: 10.10.10.4/31 metric 10 up
  67.   No queued transmissions
  68. {master}
  69. tim@MX-1>

 

Some explanations;

  • Line 8 signifies that MX-2 is advertising a nodal segment block or SRGB “Segment-routing global block” this is essentially a range that all networking vendors have agreed, from which to allocate nodal-segment labels, here is starts at value 800000 and has a maximum range of 4096
  • Lines5 51, 56 and 61 show the IGP Adjecency segments we’ve already talked about (for the links to MX-2’s neighbours
  • Line 68 is the important one – here we can see a node SID with a value of 20, which is the value I configured under MX-2;
  1. tim@MX-2> show configuration protocols isis
  2. source-packet-routing {
  3.     node-segment ipv4-index 20;
  4. }
  5. level 1 disable;
  6. interface xe-0/0/0.0 {
  7.     point-to-point;
  8. }
  9. interface xe-0/0/1.0 {
  10.     point-to-point;
  11. }
  12. interface xe-0/0/2.0 {
  13.     point-to-point;
  14. }
  15. interface lo0.0 {
  16.     passive;
  17. }

.

So if I go back onto MX-1 and look at the mpls.0 routing-table – I should see an egress label of 20 for 2.2.2.2?

  1. tim@MX-1> show route table mpls.0
  2. mpls.0: 12 destinations, 12 routes (12 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 0                  *[MPLS/0] 16:12:24, metric 1
  5.                       to table inet.0
  6. 0(S=0)             *[MPLS/0] 16:12:24, metric 1
  7.                       to table mpls.0
  8. 1                  *[MPLS/0] 16:12:24, metric 1
  9.                       Receive
  10. 2                  *[MPLS/0] 16:12:24, metric 1
  11.                       to table inet6.0
  12. 2(S=0)             *[MPLS/0] 16:12:24, metric 1
  13.                       to table mpls.0
  14. 13                 *[MPLS/0] 16:12:24, metric 1
  15.                       Receive
  16. 299856             *[L-ISIS/14] 15:24:07, metric 0
  17.                     > to 10.10.10.1 via xe-2/0/0.0, Pop
  18. 299856(S=0)        *[L-ISIS/14] 00:07:46, metric 0
  19.                     > to 10.10.10.1 via xe-2/0/0.0, Pop
  20. 800020             *[L-ISIS/14] 00:22:52, metric 10
  21.                     > to 10.10.10.1 via xe-2/0/0.0, Pop  
  22. 800020(S=0)        *[L-ISIS/14] 00:07:46, metric 10
  23.                     > to 10.10.10.1 via xe-2/0/0.0, Pop
  24. 800030             *[L-ISIS/14] 00:22:47, metric 20
  25.                     > to 10.10.10.1 via xe-2/0/0.0, Swap 800030
  26. 800040             *[L-ISIS/14] 00:22:40, metric 20
  27.                     > to 10.10.10.1 via xe-2/0/0.0, Swap 800040
  28. {master}
  29. tim@MX-1>

.

Wrong! Label 2o doesn’t seem to be anywhere, instead I have 800020..

Remember from the previous example above on line 42 – we have the “SRGB” base starting at 800000. Because global-segments are unique, all routers use the same SRGB block starting at 800000, then each configured loopback index shifts the SRGB base value by the index value. If I configured an index of “666” on MX-4, then it’s global-segment ID would be 800666 and so on.

If we look at the entire ISIS Database on MX-1 for all routers – we can see all the node segments, and their configured values;

  1. tim@MX-1> show isis database extensive | match node
  2.   Node Segment Blocks Advertised:
  3.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 10
  4.   Node Segment Blocks Advertised:
  5.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 20
  6.   Node Segment Blocks Advertised:
  7.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 30
  8.   Node Segment Blocks Advertised:
  9.       Node SID, Flags: 0x40(R:0,N:1,P:0,E:0,V:0,L:0), Algo: SPF(0), Value: 40
  10. {master}
  11. tim@MX-1>

 

We can look at the inet.3 table to see the loopback prefixes of all the routers in the network, being resolved down to their nodal-segment labels;

  1. tim@MX-1> show route table inet.3
  2. inet.3: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 2.2.2.2/32         *[L-ISIS/14] 00:31:02, metric 10
  5.                     > to 10.10.10.1 via xe-2/0/0.0
  6. 3.3.3.3/32         *[L-ISIS/14] 00:30:57, metric 20
  7.                     > to 10.10.10.1 via xe-2/0/0.0, Push 800030
  8. 4.4.4.4/32         *[L-ISIS/14] 00:30:50, metric 20
  9.                     > to 10.10.10.1 via xe-2/0/0.0, Push 800040

 

We see the node-segments for MX-3 and MX-4, but not for MX-2 simply because of PHP – but nevertheless, we can see how it all fits together quite nicely.

It must be pointed out that in a network where packets are simply being forwarded using the global-segment label of the destination, for example; if we wanted to send packets from MX-1 to MX-4 without any traffic-engineering, the same label will be used end to end, (the SRGB base of 800000 + the index 40 = 800040) as opposed to LDP, where labels for a single destination or FEC, are generated on a hop-by-hop basis, and get swapped to different values at every hop. Routers will also perform the same IGP based ECMP hashing for equal-cost paths, essentially the packet forwarding behaves the same as LDP, but with much less state information in the network.

 

The whole aim of basic segment routing, is to use the global “nodal-segments” alongside local “adjacency-segments” to allow an ingress LSR to calculate an exact path through the network – with much less state than what was previously possible with protocols such as RSVP-TE

For example, if we wanted to perform basic traffic-engineering, and send packets from MX-1 to MX-4, but via the longer path through MX-3, the following things would occur;

seg11

MX-1 imposes 2x labels, label 299784 for the Adj-SID of MX-2’s path via MX-3, and label value 800040, (the node-index 40 configured at Mx-4, plus the SRGB base value of 800000) and forwards the packet to MX-2;

seg12

MX-2 receives the packet, due to the presence of the ADJ-SID=299784 label, it follows the instruction and forwards the packet out of that link, towards MX-3 – popping the ADJ-SID label in the process;

seg13

MX-3 receives the packet with label 800040 (the node-SID of MX-4) performs PHP in the standard way, and forwards the packet direct to MX-4, completing the process. It’s entirely acceptable to use explicit-null to preserve the MPLS label on egress towards MX-4 for the purposes of EXP QoS if you’re running pipe-mode.

 

Clever readers will notice that segment-routing basically all boils down to a head-end LSR programming it’s own path through the network, by imposing a number of MPLS labels which are treated as instructions – this leads the obvious question of hardware support, even high-end routers have a limitation to the number of MPLS labels that can be handled by an ASIC, the maximum label-depth tends to be 3-5 depending on which model of router or chipset you’re using, so it might be a while until more hardware vendors accommodate larger numbers of labels in the label stack.

Consider the fact that with segment-routing, it’s possible to perform VPN connectivity along with traffic-engineering purely inside ISIS or OSPF, by simply using a much deeper label stack – we could quite quickly end up with 3-5 labels in the stack and hit the limits of our already very expensive linecards.

In terms of providing VPN services and performing things like traffic-engineering, as far as I can tell it’s not possible to do this manually on Juniper router inside the CLI at this time – you need a centralised controller to do this, or a “PCE” – “path computational element” which is generally a server running the controller software, this connects into a “PCC” – “path computational client” which would be the head-end LSR node performing the signalling, as directed by the server (PCE). This generally takes place via a protocol known as PCEP (path computational element protocol)

Essentially the difference between a PCE that’s provisioning RSVP-TE tunnels, and a PCE that’s signalling segments – both tell the head-end LSR how to forward traffic, except with segment-routing, no LSP’s are provisioned – it simply imposes a set of instructions (labels) as opposed to constructing an actual LSP through a chain of devices – again saving on state in the network.

At this time there are a few different controllers on the Market, Juniper’s Northstar, Cisco’s Open SDN, and a freeware controller known as “open daylight” one of my colleagues has managed to get open daylight working with IOS-XR to good effect, I may try and get hold of a demo Northstar license so I can demo this technology in action with IXIA – but that’ll be for next time,

Thanks for reading 🙂

 

 

Pseudowire Headend Termination (PWHT) For Juniper MX

I’ve been doing quite a lot of MX BNG stuff this year, so I thought I’d run through another quite flexible way of terminating broadband subscribers onto a Juniper MX router.

The feature is called Psuedowire headend termination, “PWHT” or simply Psuedowire head-end “PWHE” depending on whether you work for Cisco or Juniper 😉 but it essentially solves a relatively simple problem.

In traditional broadband designs – especially in DSL “FTTC” or Fibre Ethernet “FTTP” we’re used to seeing large numbers of subscribers, connecting into the ISP edge at layer-2 with PPPoE or plain Ethernet. This is normally performed with VLANS, either via an MSAN (DSL/FTTC) or as is the case with Ethernet FTTP subscribers – a plain switched infrastructure or some form of passive-optical (PON/GPON) presentation:

 

Capture

These subscribers then terminate on a BNG node on the edge of the network, which would historically have been a Cisco 7200, GSR10k, Juniper ERX or Redback router, which essentially bridges the gap between the access network and the internet.

For very large service providers with millions of subscribers this sort of approach normally works well, because their customer base is so large; it makes sense for them to provision a full-size BNG node in every town in the country and so subscribers terminate directly at the edge of the network.

However – modern BNG can be expensive. In order to provide the required throughput and features, (IPv4 / IPv6 / VPN / Quad play / QoS) it requires a significant investment in router chassis, fancy line cards and expensive licenses, at every point in the network where BNG is to be performed. For smaller ISPs this can be a deal breaker – especially if they have small chunks of subscribers dotted around.

One way of getting around this problem is to provision a centralised BNG deployment, where that function is performed somewhere centrally inside the service-provider network. Edge connectivity (PPPoE / Ethernet VLAN) is tunnelled directly from the access network, through an intermediate edge-router (U-PE) and onto a centralised BNG node where it terminates, allowing for the ISP to service a large number of subscribers from many different remote areas – using a single BNG function:

Capture2

Essentially, in the above topology – the “U-PE” or access facing PE is running a standard EoMPLS LDP signalled “martini tunnel” back towards the centralised BNG router, buried deep inside the core somewhere.

The U-PE itself can be a cheaper, standard edge router, so long as it supports MPLS and LDP signalled EoMPLS tunnels – these can be provisioned anywhere on the network edge, whilst providing direct connectivity back to the BNG node at layer-2 – all this is done using PWHT (Pseudowire headend termination)

On Juniper – PWHT as a feature came into existence on, or around JUNOS version 13.1, before then there was a relatively simple “hack” that had to be performed, in order to provide the functionality. It basically involved the good ole trick of using physical loopback cables on the same device, in order to “make it work” as shown below:

Capture3

This is a pretty heavy handed approach and also quite expensive – as it involves burning up expensive ports on the router, simply to bridge the gap between the access network and the subscriber termination interface.

With the PWHT feature a new type of interface is defined, known as the psuedowire service interface “PS” this is bound to a tunnel-services PIC, which essentially performs the heavy lifting.

Looking at this in a lab, I have the following topology setup on an MX480 containing a MPC2E-Q and a 4x10GE MIC:

Capture4

If we look at the configuration, there’s a few things we need to do – lets check out the PS interface and the l2circuit configuration:

  1. tim@MX480-3> show configuration chassis
  2. pseudowire-service {
  3.     device-count 2048;
  4. }
  5. fpc 1 {
  6.     pic 0 {
  7.         tunnel-services {
  8.             bandwidth 10g;
  9.         }
  10.     }
  11. }
  12. network-services enhanced-ip;
  13. tim@MX480-3>

 

The command “pseudowire-service” basically enables the PWHT feature, and an MX chassis supports a total of 2048 pseudowire-service interfaces – each interface is bound to a l2circuit that points back to a “U-PE” edge device, that provides more than enough for most deployments,

It’s also necessary to enable tunnel-services, then when we take look at the “PS” interface, it’s easy to see how this fits together:

  1. ps0 {
  2.     anchor-point {
  3.         lt-1/0/0;
  4.     }
  5.     flexible-vlan-tagging;
  6.     auto-configure {
  7.         stacked-vlan-ranges {
  8.             dynamic-profile vlan-prof-0 {
  9.                 accept [ inet pppoe ];
  10.                 ranges {
  11.                     10-100,100-4000;
  12.                 }
  13.                 access-profile aaa-profile;
  14.             }
  15.         }
  16.         remove-when-no-subscribers;
  17.     }
  18.     mtu 1530;
  19.     unit 0 {
  20.         encapsulation ethernet-ccc;
  21.     }
  22. }

 

The anchor-point statement basically binds the logical-tunnel interface directly to the PS interface, so that the “heavy lifting” can be done by the MIC, unit 0 binds directly to the l2circuit configuration – which creates the EoMPLS connectivity to the U-PE:

  1. tim@MX480-3> show configuration protocols l2circuit
  2. neighbor 1.1.1.2 {
  3.     interface ps0.0 {
  4.         virtual-circuit-id 10;
  5.         no-vlan-id-validate;
  6.     }
  7. }

 

Essentially we have a standard l2circuit configuration pointing at the U-PE (the U-PE simply has a reciprocal configuration bound to it’s physical access-facing interface. Because this psuedowire will be carrying multiple VLANs (S-VLAN and C-VLAN) we don’t want to consider that information when the psuedowire is signalled, so “no-vlan-id-validate” command takes care of this.

Lets take a look at the wider BNG configuration for completeness:

  1. dynamic-profiles {
  2.     vlan-prof-0 {
  3.         interfaces {
  4.             “$junos-interface-ifd-name” {
  5.                 unit “$junos-interface-unit” {
  6.                     no-traps;
  7.                     vlan-tags outer “$junos-stacked-vlan-id” inner “$junos-vlan-id”;
  8.                     family inet {
  9.                         unnumbered-address lo0.0;
  10.                     }
  11.                     family pppoe {
  12.                         dynamic-profile pppoe-client-profile;
  13.                     }
  14.                 }
  15.             }
  16.         }
  17.     }
  18.     pppoe-client-profile {
  19.         interfaces {
  20.             pp0 {
  21.                 unit “$junos-interface-unit” {
  22.                     no-traps;
  23.                     ppp-options {
  24.                         chap;
  25.                     }
  26.                     pppoe-options {
  27.                         underlying-interface “$junos-underlying-interface”;
  28.                         server;
  29.                     }
  30.                     keepalives interval 30;
  31.                     family inet {
  32.                         unnumbered-address lo0.0;
  33.                     }
  34.                 }
  35.             }
  36.         }
  37.     }
  38. }
  39. access {
  40.     radius-server {
  41.         192.168.3.158 {
  42.             port 1812;
  43.             accounting-port 1813;
  44.             secret “xxx”; ## SECRET-DATA
  45.             timeout 10;
  46.             retry 10;
  47.             source-address 192.168.3.54;
  48.         }
  49.     }
  50.     profile aaa-profile {
  51.         authentication-order radius;
  52.         radius {
  53.             authentication-server 192.168.3.158;
  54.             accounting-server 192.168.3.158;
  55.             options {
  56.                 interface-description-format {
  57.                     exclude-sub-interface;
  58.                 }
  59.                 nas-identifier mx5-1;
  60.                 accounting-session-id-format decimal;
  61.                 vlan-nas-port-stacked-format;
  62.             }
  63.         }
  64.         radius-server {
  65.             192.168.3.158 {
  66.                 port 1812;
  67.                 accounting-port 1813;
  68.                 secret “xxx”; ## SECRET-DATA
  69.                 timeout 10;
  70.                 retry 10;
  71.                 source-address 192.168.3.54;
  72.             }
  73.         }
  74.         accounting {
  75.             order radius;
  76.             accounting-stop-on-failure;
  77.             accounting-stop-on-access-deny;
  78.             immediate-update;
  79.             coa-immediate-update;
  80.             update-interval 60;
  81.             statistics volume-time;
  82.         }
  83.     }
  84.     profile no-radius-auth {
  85.         authentication-order none;
  86.     }
  87.     address-assignment {
  88.         pool Subscriber-pool {
  89.             family inet {
  90.                 network 130.0.0.0/8;
  91.                 range Sub-range-0 {
  92.                     low 130.16.0.1;
  93.                     high 130.31.255.255;
  94.                 }
  95.                 dhcp-attributes {
  96.                     maximum-lease-time 25200;
  97.                 }
  98.             }
  99.         }
  100.     }
  101. }

 

That’s the basic configuration, lets fire up some subscribers and see what it looks like – I’m using IXIA to generate PPPoE simulated clients, we’ll start with a single double-tagged subscriber, (The S-VLAN normally represents the MSAN, the C-VLAN normally represents the subscriber’s own VLAN)

Capture

Lets check the outputs from the MX BNG:

  1. tim@MX480-3> show subscribers
  2. Interface           IP Address/VLAN ID                      User Name                      LS:RI
  3. ps0.1073761693      10 111                                                            default:default
  4. pp0.1073761694      130.16.0.3                              user1@users.com           default:default
  5. tim@MX480-3> show subscribers detail
  6. Type: VLAN
  7. Logical System: default
  8. Routing Instance: default
  9. Interface: ps0.1073761693
  10. Interface type: Dynamic
  11. Underlying Interface: ps0
  12. Dynamic Profile Name: vlan-prof-0
  13. State: Active
  14. Session ID: 19870
  15. Stacked VLAN Id: 10
  16. VLAN Id: 111
  17. Login Time: 2016-07-10 12:34:46 UTC
  18. Type: PPPoE
  19. User Name: user1@users.com
  20. IP Address: 130.16.0.3
  21. IP Netmask: 255.255.255.255
  22. Logical System: default
  23. Routing Instance: default
  24. Interface: pp0.1073761694
  25. Interface type: Dynamic
  26. Underlying Interface: ps0.1073761693
  27. Dynamic Profile Name: pppoe-client-profile
  28. MAC Address: 00:11:01:00:00:01
  29. State: Active
  30. Radius Accounting ID: 19871
  31. Session ID: 19871
  32. Stacked VLAN Id: 10
  33. VLAN Id: 111
  34. Login Time: 2016-07-10 12:34:57 UTC
  35. tim@MX480-3>

 

So we can see the subscriber coming in, with an S-VLAN of 10 and a C-VLAN of 111, with an address handed out from the subscriber pool. Readers familiar with MX BNG will be used to using a “demux” interface, for the layer-2 side of the service, when PWHT is used – demux is replaced with the PS interface as shown in line 3.

Everything else remains the same, the subscriber layer-3 virtual interface is a “pp0” interface with an attached IP address placed into the inet routing table, this can be inserted into a routing-instance or logical-system if needed, by altering the BNG configuration and Radius config – for radius I’m using Freeradius with a basic configuration.

If we send some traffic – we should see it function end to end, and also see it on the PS0 interface:

Traffic works as expected:

Capture

Outputs from the “PS0” interface and attached subscriber units:

  1. tim@MX480-3> show interfaces ps0
  2. Physical interface: ps0, Enabled, Physical link is Up
  3.   Interface index: 154, SNMP ifIndex: 599
  4.   Type: Software-Pseudo, Link-level type: 90, MTU: 1530, Clocking: 1, Speed: 10000mbps
  5.   Device flags   : Present Running
  6.   Interface flags: Point-To-Point Internal: 0x4000
  7.   Current address: dc:38:e1:fc:85:4a, Hardware address: dc:38:e1:fc:85:4a
  8.   Last flapped   : Never
  9.   Input rate     : 979688 bps (89 pps)
  10.   Output rate    : 989712 bps (89 pps)
  11.   Logical interface ps0.0 (Index 336) (SNMP ifIndex 601)
  12.     Flags: Up Point-To-Point 0x4000 Encapsulation: Ethernet-CCC
  13.     Input packets : 5061878827
  14.     Output packets: 5068397825
  15.     Protocol ccc, MTU: 1514
  16.       Flags: Is-Primary
  17.   Logical interface ps0.32767 (Index 337) (SNMP ifIndex 600)
  18.     Flags: Up 0x4000 VLAN-Tag [ 0x0000.0 ]  Encapsulation: ENET2
  19.     Input packets : 9950
  20.     Output packets: 0
  21.   Logical interface ps0.1073761693 (Index 325) (SNMP ifIndex 527)
  22.     Flags: Up 0x4000 VLAN-Tag [ 0x8100.10 0x8100.111 ]  Encapsulation: ENET2
  23.     Input packets : 77850
  24.     Output packets: 6605
  25.     Protocol inet, MTU: 1508
  26.       Flags: Sendbcast-pkt-to-re, Unnumbered
  27.       Donor interface: lo0.0 (Index 322)
  28.       Addresses, Flags: Is-Default Is-Primary
  29.         Local: 1.1.1.1
  30.     Protocol pppoe
  31.       Dynamic Profile: pppoe-client-profile,
  32.       Service Name Table: None,
  33.       Max Sessions: 32000, Max Sessions VSA Ignore: Off,
  34.       Duplicate Protection: Off, Short Cycle Protection: Off,
  35.       Direct Connect: Off,
  36.       AC Name: MX480-3
  37. tim@MX480-3>

 

That’s about it! PWHT is a pretty cool feature for tunnelling subscriber connectivity into a centralised BNG environment, it’s also possible to design resilient active/standby or active/active solutions by using multiple l2circuits.

It’s also worth pointing out, that provided you have the standard subscriber management licenses, no additional licenses are required to enable PWHT.

 

EVPN – All-active multihoming

So this is the fourth blog on EVPN, the previous blogs covered the following topics:

  • EVPN basics, route-types and basic L2 forwarding
  • EVPN IRB and Inter-VLAN routing
  • EVPN single-active multi-homing

This post will cover the ability of EVPN to provide all-active multi-homing for layer-2 traffic, where the topology contains two different active PE routers, connecting to a switch via a LAG, the setup is similar to the previous labs. Due to some restrictions and in the interests of simplicity, this lab will cover all-active multi-homing for a single VLAN only, (VLAN 100 in this case) consider the network topology:

Capture5

The topology and general connectivity is the same as the other previous examples, the two big differences are that only VLAN 100 is present here and the connectivity between MX-1 and MX-2 is now using MC-LAG.

The first consideration that needs to be made when running EVPN in all-active mode, is that it must connect to the upstream devices using some sort of LAG, or MC-LAG – consider the wording from the RFC 7432:


https://tools.ietf.org/html/rfc7432#section-14.1.2

“If a bridged network is multihomed to more than one PE in an EVPN network via switches, then the support of All-Active redundancy mode requires the bridged network to be connected to two or more PEs using a LAG.”

Essentially, this boils down to some basic facts around how switches work – you can’t have two different PE routers with active access-interfaces configured with the same mac-address, spanning two different control-planes, for the simple reason that you’ll create a duplicate mac-address in the layer-2 network, which will cause a nightmare.

Consider the below scenario:

Capture6

I tried this in a lab before I read the RFC, and discovered that EX4200-1 floods egress traffic to MX-1 and MX-2, resulting in lots of traffic duplication and flooding, simply because each time a packet lands on ge-0/0/0 or ge-0/0/1 from MX-1 or MX-2 with mac-address “X” the switch has to update it’s CAM table, so essentially the whole thing is broken – which explains the wording of the RFC in relation to all-active mode.

With Juniper the way to get around this problem is simply to convert the Ethernet interfaces connecting to EX4200-1 to a basic MC-LAG configuration, we don’t need to configure ICCP or any serious multi-chassis configuration – we just need to make sure the LACP system-id is identical on MX-1 and MX-2, so that the EX4200 think’s it’s connected to a single downstream device,

Lets check the LAG configuration on MX-1 and MX-2;

MX-1

  1. tim@MX5-1> show configuration interfaces ae0
  2. description “MCLAG to EX4500-1”;
  3. flexible-vlan-tagging;
  4. encapsulation flexible-ethernet-services;
  5. esi {
  6.     00:11:22:33:44:55:66:77:88:99;
  7.     all-active;
  8. }
  9. aggregated-ether-options {
  10.     lacp {
  11.         system-id 00:00:00:00:00:01;
  12.     }
  13. }
  14. unit 100 {
  15.     encapsulation vlan-bridge;
  16.     vlan-id 100;
  17.     family bridge;
  18. }

 

MX-2

  1. tim@MX5-2> show configuration interfaces ae0
  2. description “MCLAG to EX4500-1”;
  3. flexible-vlan-tagging;
  4. encapsulation flexible-ethernet-services;
  5. esi {
  6.     00:11:22:33:44:55:66:77:88:99;
  7.     all-active;
  8. }
  9. aggregated-ether-options {
  10.     lacp {
  11.         system-id 00:00:00:00:00:01;
  12.     }
  13. }
  14. unit 100 {
  15.     encapsulation vlan-bridge;
  16.     vlan-id 100;
  17.     family bridge;
  18. }

 

And finally on EX4200-1 we have a basic standard LAG configuration, with nothing fancy or sexy going on 🙂

EX4200-1

 

  1. imtech@ex4200-1> show configuration interfaces ae0
  2. aggregated-ether-options {
  3.     lacp {
  4.         active;
  5.     }
  6. }
  7. unit 0 {
  8.     family ethernet-switching {
  9.         port-mode trunk;
  10.         vlan {
  11.             members vlan-100;
  12.         }
  13.     }
  14. }
  15. {master:0}
  16. imtech@ex4200-1>

 

 

From the perspective of the EX4200, it’s just a totally standard LAG with two interfaces running LACP, so long as we have EVPN all-active configured correctly on MX-1 and MX-2 everything is taken care of.

EX4200-1 verification:

  1. imtech@ex4200-1> show lacp interfaces
  2. Aggregated interface: ae0
  3.     LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
  4.       ge-0/0/0       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
  5.       ge-0/0/0     Partner    No    No   Yes  Yes  Yes   Yes     Fast   Passive
  6.       ge-0/0/1       Actor    No    No   Yes  Yes  Yes   Yes     Fast    Active
  7.       ge-0/0/1     Partner    No    No   Yes  Yes  Yes   Yes     Fast   Passive
  8.     LACP protocol:        Receive State  Transmit State          Mux State
  9.       ge-0/0/0                  Current   Fast periodic Collecting distributing
  10.       ge-0/0/1                  Current   Fast periodic Collecting distributing
  11. {master:0}
  12. imtech@ex4200-1>

 

Aside from the fact we’ve converted the access Ethernet interfaces to MC-LAG on MX-1 and MX-2, lets check to see what’s changed with the EVPN configuration in order to get all-active EVPN working, first lets check MX-1:

  1. tim@MX5-1> show configuration routing-instances
  2. EVPN-100 {
  3.     instance-type virtual-switch;
  4.     route-distinguisher 1.1.1.1:100;
  5.     vrf-target target:100:100;
  6.     protocols {
  7.         evpn {
  8.             extended-vlan-list 100;
  9.             default-gateway do-not-advertise;
  10.         }
  11.     }
  12.     bridge-domains {
  13.         VL-100 {
  14.             vlan-id 100;
  15.             interface ae0.100;
  16.             routing-interface irb.100;
  17.         }
  18.     }
  19. }
  20. VPN-100 {
  21.     instance-type vrf;
  22.     interface irb.100;
  23.     route-distinguisher 100.100.100.1:100;
  24.     vrf-target target:1:100;
  25.     vrf-table-label;
  26. }
  27. tim@MX5-1>

 

The configuration is absolutely identical on MX-2, you’ll notice that the only thing which has changed on MX-1, is that the physical interface of ge-1/1/5 has changed to the new LAG interface of ae0.100 for VLAN 100, everything else is exactly the same as the previous single-active example from last week, lets take a closer look at the interface on MX-1

  1. tim@MX5-1> show configuration interfaces ae0
  2. description “MCLAG to EX4500-1”;
  3. flexible-vlan-tagging;
  4. encapsulation flexible-ethernet-services;
  5. esi {
  6.     00:11:22:33:44:55:66:77:88:99;
  7.     all-active;
  8. }
  9. aggregated-ether-options {
  10.     lacp {
  11.         system-id 00:00:00:00:00:01;
  12.     }
  13. }
  14. unit 100 {
  15.     encapsulation vlan-bridge;
  16.     vlan-id 100;
  17.     family bridge;
  18. }

 

It’s clear to see that under the interface ESI configuration, we’re changed the ESI mode from single-active, to “all-active” which again should be self explanatory to most readers 🙂 and again note, that this configuration is 100% identical on both Mx-1 and MX-2,

Lets check the EVPN instance and see what’s changed since the single-active example:

  1. tim@MX5-1> show evpn instance extensive
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.1:100
  4.   Per-instance MAC route label: 299776
  5.   MAC database status                Local  Remote
  6.     Total MAC addresses:                13      96
  7.     Default gateway MAC addresses:       1       0
  8.   Number of local interfaces: 1 (1 up)
  9.     Interface name  ESI                            Mode             Status
  10. ae0.100         00:11:22:33:44:55:66:77:88:99  all-active       Up
  11.   Number of IRB interfaces: 1 (1 up)
  12.     Interface name  VLAN ID  Status  L3 context
  13.     irb.100         100      Up      VPN-100
  14.   Number of bridge domains: 1
  15.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  16.     100          1   1     Extended         Enabled   300432
  17.   Number of neighbors: 2
  18.     10.10.10.2
  19.       Received routes
  20.         MAC address advertisement:             49
  21.         MAC+IP address advertisement:           0
  22.         Inclusive multicast:                    1
  23.         Ethernet auto-discovery:                2
  24.     10.10.10.3
  25.       Received routes
  26.         MAC address advertisement:             60
  27.         MAC+IP address advertisement:           0
  28.         Inclusive multicast:                    1
  29.         Ethernet auto-discovery:                0
  30.   Number of ethernet segments: 1
  31.     ESI: 00:11:22:33:44:55:66:77:88:99
  32.       Status: Resolved by IFL ae0.100
  33. Local interface: ae0.100, Status: Up/Forwarding
  34.       Number of remote PEs connected: 1
  35.         Remote PE        MAC label  Aliasing label  Mode
  36.         10.10.10.2       300416     300416          all-active  
  37.       Designated forwarder: 10.10.10.1
  38.       Backup forwarder: 10.10.10.2
  39.       Advertised MAC label: 300400
  40.       Advertised aliasing label: 300400
  41.       Advertised split horizon label: 300416
  42. Instance: __default_evpn__
  43.   Route Distinguisher: 10.10.10.1:0
  44.   Number of bridge domains: 0
  45.   Number of neighbors: 1
  46.     10.10.10.2
  47.       Received routes
  48.         Ethernet Segment:                       1
  49. tim@MX5-1>

 

So we can see that MX-1 has changed from single-active to all-active, and is in the up/forwarding state,

Lets check MX-2 to see what it looks like:

  1. tim@MX5-2> show evpn instance extensive
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.2:100
  4.   Per-instance MAC route label: 299776
  5.   MAC database status                Local  Remote
  6.     Total MAC addresses:                47      64
  7.     Default gateway MAC addresses:       1       0
  8.   Number of local interfaces: 1 (1 up)
  9.     Interface name  ESI                            Mode             Status
  10. ae0.100         00:11:22:33:44:55:66:77:88:99  all-active       Up
  11.   Number of IRB interfaces: 1 (1 up)
  12.     Interface name  VLAN ID  Status  L3 context
  13.     irb.100         100      Up      VPN-100
  14.   Number of bridge domains: 1
  15.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  16.     100          1   1     Extended         Enabled   300528
  17.   Number of neighbors: 2
  18.     10.10.10.1
  19.       Received routes
  20.         MAC address advertisement:             14
  21.         MAC+IP address advertisement:           1
  22.         Inclusive multicast:                    1
  23.         Ethernet auto-discovery:                2
  24.     10.10.10.3
  25.       Received routes
  26.         MAC address advertisement:             60
  27.         MAC+IP address advertisement:           0
  28.         Inclusive multicast:                    1
  29.         Ethernet auto-discovery:                0
  30.   Number of ethernet segments: 1
  31.     ESI: 00:11:22:33:44:55:66:77:88:99
  32.       Status: Resolved by IFL ae0.100
  33. Local interface: ae0.100, Status: Up/Forwarding
  34.       Number of remote PEs connected: 1
  35.         Remote PE        MAC label  Aliasing label  Mode
  36. 10.10.10.1       300400     300400          all-active
  37. Designated forwarder: 10.10.10.1
  38.       Backup forwarder: 10.10.10.2
  39.       Advertised MAC label: 300416
  40.       Advertised aliasing label: 300416
  41.       Advertised split horizon label: 300432
  42. Instance: __default_evpn__
  43.   Route Distinguisher: 10.10.10.2:0
  44.   Number of bridge domains: 0
  45.   Number of neighbors: 1
  46.     10.10.10.1
  47.       Received routes
  48.         Ethernet Segment:                       1
  49. tim@MX5-2>

 

Excellent! both MX-1 and MX-2 are in the up/forwarding state for VLAN 100, meaning that in theory – they can both send and receive traffic received on their access LAG interface, and the MPLS side – you’ll also notice how simple it is to get working.

I currently have 50x IXIA hosts sat behind MX-1 and MX-2, and a further 50x hosts sat behind MX-3, 50Mbps of traffic is being sent bi-bidirectionally between each IXIA host, lets recap the diagram:

Capture7

With an active-active configuration, traffic from multiple hosts at the top of the network, should be sent towards MX-1 and MX-2 by EX4200-1 according to it’s standard LAG hashing algorithm, (source/destination mac) because I have 100 hosts in total, there should be enough granularity at layer-2 to perform rough distribution of some traffic on MX-1 and some traffic on MX-2

Lets send the IXIA traffic:

IXIA

Now lets look at the physical access interfaces on MX-1 and Mx-2 to see how the traffic is being handled:

Mx-1


tim@MX5-1> show configuration interfaces ge-1/1/5 
gigether-options {
 802.3ad ae0;
}

tim@MX5-1> show interfaces ae0 | match pps 
 Input rate : 5404040 bps (484 pps)
 Output rate : 10384856 bps (929 pps)

So 5Mbps in and 10Mbps out on Mx-1

Lets check MX-2


tim@MX5-2> show configuration interfaces ge-1/0/5 
gigether-options {
 802.3ad ae0;
}

tim@MX5-2> show interfaces ae0 | match pps 
 Input rate : 19535296 bps (1750 pps)
 Output rate : 14546816 bps (1302 pps)

So it seems to be working – MX-1 and MX-2 are both sending and receiving traffic in the same layer-2 broadcast domain,

Lets check their MPLS facing interfaces:

MX-1


tim@MX5-1> show isis adjacency 
Interface System L State Hold (secs) SNPA
ge-1/1/0.0 m10i-1 2 Up 19

tim@MX5-1> show interfaces ge-1/1/0 | match pps 
 Input rate : 10415216 bps (930 pps)
 Output rate : 5404040 bps (484 pps)

tim@MX5-1>

MX-2


tim@MX5-2> show isis adjacency 
Interface System L State Hold (secs) SNPA
ge-1/1/0.0 m10i-2 2 Up 24

tim@MX5-2> show interfaces ge-1/1/0 | match pps 
 Input rate : 14583752 bps (1303 pps)
 Output rate : 19535576 bps (1751 pps)

tim@MX5-2>

 

And so all seems right with the world, traffic from the MPLS network is being sent from MX-3 to both MX-1 and MX-2, lets look at the EVPN BGP control-plane on MX-3 to see what’s going on with all-active – we’ll take a look at a slice of the BGP table for brevity:

 

  1. 2:1.1.1.1:100::100::00:00:66:cf:82:df/304
  2.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.1
  3.                       AS path: I, validation-state: unverified
  4.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300944
  5. 2:1.1.1.1:100::100::00:00:66:cf:82:e1/304
  6.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.1
  7.                       AS path: I, validation-state: unverified
  8.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300944
  9. 2:1.1.1.1:100::100::00:00:66:cf:82:e3/304
  10.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.1
  11.                       AS path: I, validation-state: unverified
  12.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300944
  13. 2:1.1.1.1:100::100::00:00:66:d0:5d:f3/304
  14.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.1
  15.                       AS path: I, validation-state: unverified
  16.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300944
  17. 2:1.1.1.2:100::100::00:00:2e:18:6d:e1/304
  18.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.2
  19.                       AS path: I, validation-state: unverified
  20.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300960
  21. 2:1.1.1.2:100::100::00:00:2e:18:f3:c4/304
  22.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.2
  23.                       AS path: I, validation-state: unverified
  24.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300960
  25. 2:1.1.1.2:100::100::00:00:66:cf:82:d1/304
  26.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.2
  27.                       AS path: I, validation-state: unverified
  28.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300960
  29. 2:1.1.1.2:100::100::00:00:66:cf:82:d3/304
  30.                    *[BGP/170] 01:28:27, localpref 100, from 10.10.10.2
  31.                       AS path: I, validation-state: unverified
  32.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300960

 

 

You’ll notice that in MX-3’s BGP EVPN table, it’s receiving those good old type-2 MAC routes, however some of them are being learnt from MX-1 and MX-2, which is exactly what we want and exactly what MX-3 needs in order for egress traffic to be sent towards MX-1 and MX-2 in the all-active fashion that we desire.

Remember that because EVPN maintains an forwarding-based layer-2 control plane, the determination on whether traffic should go to MX-1 or MX-2, from MX-3 depends on how EX4200-1 hashes egress traffic in the first place, see the below diagram for an at attempt at a better explanation:

Capture8

 

But what happens if the EX4200 switch has a really rubbish hashing algorithm, or there’s no granularity – to the point where nearly all the traffic comes from MX-1 and hardly any comes from MX-2, you’d end up with traffic polarisation and really bad load-balancing. EVPN solves this problem by using an aliasing label.

MX-3 for example has a full table of EVPN MAC routes, so it can load-balance traffic on a per-flow basis back to MX-1 and Mx-2 by making use of the aliasing label. In the case of the IXIA hosts at the top of the network, they’re all being advertised with an ESI of 00:11:22:33:44:55:66:77:88:99, which means they’re all coming from the same place – this means MX-3 will simply treat the aliasing route as a normal MAC route and send the traffic anyway.

If there’s a failure somewhere on either MX-1 or MX-2, the aliasing label gets withdrawn and you’re left with MAC routes for one site only – to prevent the black-holing of traffic.

 

The last thing to consider is the concept of “designated forwarder” lets re-check the EVPN instance output from earlier on:

  1. tim@MX5-1> show evpn instance extensive
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.1:100
  4.   Per-instance MAC route label: 299776
  5.   MAC database status                Local  Remote
  6.     Total MAC addresses:                13      96
  7.     Default gateway MAC addresses:       1       0
  8.   Number of local interfaces: 1 (1 up)
  9.     Interface name  ESI                            Mode             Status
  10. ae0.100         00:11:22:33:44:55:66:77:88:99  all-active       Up
  11.   Number of IRB interfaces: 1 (1 up)
  12.     Interface name  VLAN ID  Status  L3 context
  13.     irb.100         100      Up      VPN-100
  14.   Number of bridge domains: 1
  15.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  16.     100          1   1     Extended         Enabled   300432
  17.   Number of neighbors: 2
  18.     10.10.10.2
  19.       Received routes
  20.         MAC address advertisement:             49
  21.         MAC+IP address advertisement:           0
  22.         Inclusive multicast:                    1
  23.         Ethernet auto-discovery:                2
  24.     10.10.10.3
  25.       Received routes
  26.         MAC address advertisement:             60
  27.         MAC+IP address advertisement:           0
  28.         Inclusive multicast:                    1
  29.         Ethernet auto-discovery:                0
  30.   Number of ethernet segments: 1
  31.     ESI: 00:11:22:33:44:55:66:77:88:99
  32.       Status: Resolved by IFL ae0.100
  33. Local interface: ae0.100, Status: Up/Forwarding
  34.       Number of remote PEs connected: 1
  35.         Remote PE        MAC label  Aliasing label  Mode
  36.         10.10.10.2       300416     300416          all-active  
  37.       Designated forwarder: 10.10.10.1
  38.       Backup forwarder: 10.10.10.2
  39.       Advertised MAC label: 300400
  40.       Advertised aliasing label: 300400
  41.       Advertised split horizon label: 300416
  42. Instance: __default_evpn__
  43.   Route Distinguisher: 10.10.10.1:0
  44.   Number of bridge domains: 0
  45.   Number of neighbors: 1
  46.     10.10.10.2
  47.       Received routes
  48.         Ethernet Segment:                       1
  49. tim@MX5-1>

 

When running in all-active mode, it’s obvious that both PE routers are forwarding traffic, but it’s important to know that both PE’s can only forward unicast traffic in an all-active fashion. When two PE routers discover each other on the same EVI via the MPLS network, via BGP auto-discovery routes, they elect a “designated forwarder”

The primary role of the active designated forwarder is to forward BUM (broadcast multicast traffic) it would be highly undesirable for both PE’s to forward broadcasts and so only one is responsible for this in order to prevent traffic duplication.

Anyways, that’s about all I have time for tonight – I hope you found this useful!

EVPN – Single-active redundancy

In the previous 2 posts I looked at the basics of EVPN including the new BGP based control-plane, later I looked at the integration between the layer-2 and layer-3 worlds within EVPN. However – all the previous examples were shown with basic single site networks with no link or device redundancy, this this post I’m going to look at the first and simplest EVPN redundancy mode.

First – consider the new lab topology:

Capture4

The topology and configuration remains pretty much the same, except that MX-1 and MX-2 each connect back to EX4200-1, for VLAN 100 and VLAN 101, with the same IRB interfaces present on each MX router, essentially a very basic site with 2 PEs for redundancy.

Let’s recap the EVPN configuration on each MX1, I’ve got the exact same configuration loaded on MX-2 and MX-3, the only differences being the interface numbers and a unique RD for each site.

MX-1: 

  1. tim@MX5-1> show configuration routing-instances
  2. EVPN-100 {
  3.     instance-type virtual-switch;
  4.     route-distinguisher 1.1.1.1:100;
  5.     vrf-target target:100:100;
  6.     protocols {
  7.         evpn {
  8.             extended-vlan-list 100-101;
  9.             default-gateway do-not-advertise;
  10.         }
  11.     }
  12.     bridge-domains {
  13.         VL-100 {
  14.             vlan-id 100;
  15.             interface ge-1/1/5.100;
  16.             routing-interface irb.100;
  17.         }
  18.         VL-101 {
  19.             vlan-id 101;
  20.             interface ge-1/1/5.101;
  21.             routing-interface irb.101;
  22.         }
  23.     }
  24. }
  25. VPN-100 {
  26.     instance-type vrf;
  27.     interface irb.100;
  28.     interface irb.101;
  29.     route-distinguisher 100.100.100.1:100;
  30.     vrf-target target:1:100;
  31.     vrf-table-label;
  32. }
  33. tim@MX5-1>

 

 

Essentially, each site is configured exactly the same, except for a unique RD per site, and differences in the interface numbering.

In terms of providing active/standby redundancy at the main site, for layer-2 and layer-3 simultaneously, we would historically use VPLS combined with VRRP on the IRB interfaces to provide connectivity.

However this isn’t a perfect solution, for the following reasons:

  1. Unlike EVPN – VPLS needs unique IPv4 GW/MAC addresses at each site, inside the same VPN, so the only way to do active-standby redundancy is with VRRP.
  2. VRRP designs can become complex, ensuring that everything is tracked and monitored – partial failures can be hard to track and things can get over-complicated.
  3. Traffic tromboning can occur where VRRP is used

Regarding point 3

Imagine a scenario where each PE is providing a layer-3 default gateway for each VLAN on each PE, where MX1 is active for VLAN 100 and MX2 is active for VLAN 101

Capture5

It looks simple enough, but traffic tromboning can occur quite easily – due to the reliance on VRRP, for example if host-1 in VLAN 100 wants to send traffic to host-2 in VLAN 101, connected to the same switch – the following things happen:

  1. The packet hits the VRRP active VLAN 100 IRB interface on MX1
  2. Because VLAN 101 is in standby mode on MX1 – it can’t be switched locally
  3. MX1 forwards the packet towards the MPLS network, because there’s a BGP route coming from MX2 (because it’s VRRP active for VLAN 101)
  4. Rather than being routed locally, the packet has to traverse the MPLS network, in order to route between VLANs:

Capture6

Things like this are a pain, and can be mitigated by design and awareness from the start – but in my opinion these sorts of scenarios are good examples of why EVPN was invented, because VPLS never properly solved the basic problems that we get in day to day designs, for simple bread and butter problems like routing between VLANs you end up having a nightmare.

So how does EVPN do it differently?

First, lets look at the configuration required to convert the lab topology into EVPN active-standby, it’s pretty simple:

MX-1: 

  1. tim@MX5-1# run show configuration interfaces ge-1/1/5
  2. flexible-vlan-tagging;
  3. encapsulation flexible-ethernet-services;
  4. esi {
  5.     00:11:22:33:44:55:66:77:88:99;
  6.     single-active;
  7. }
  8. unit 100 {
  9.     encapsulation vlan-bridge;
  10.     vlan-id 100;
  11. }
  12. unit 101 {
  13.     encapsulation vlan-bridge;
  14.     vlan-id 101;
  15. }
  16. [edit]
  17. tim@MX5-1#

 

MX-2:

  1. tim@MX5-2# run show configuration interfaces ge-1/0/5
  2. flexible-vlan-tagging;
  3. encapsulation flexible-ethernet-services;
  4. esi {
  5.     00:11:22:33:44:55:66:77:88:99;
  6.     single-active;
  7. }
  8. unit 100 {
  9.     encapsulation vlan-bridge;
  10.     vlan-id 100;
  11. }
  12. unit 101 {
  13.     encapsulation vlan-bridge;
  14.     vlan-id 101;
  15. }
  16. [edit]
  17. tim@MX5-2#

 

In basic EVPN where sites are single-homed, the “ESI” (Ethernet segment identifier) remains at zero, however whenever you have single-active multi-homing or active-active multi-homing, the ESI value  must be configured to a non-default value. It’s purpose is to identify an Ethernet segment and as such it identifies the entire “site” or “data-centre” to other PE routers on the network, it’s configured under the physical Ethernet interface and must be the same across the segment, in this case for MX1 and MX2 access-facing interfaces

Secondly, under the ESI configuration the PE interfaces are configured to operate in “single-active” mode, which should be self explanatory to most readers 🙂

How does this alter the EVPN control-plane? lets have a more detailed look at the EVPN instance on MX-1

 

  1. tim@MX5-1> show evpn instance extensive
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.1:100
  4.   Per-instance MAC route label: 299776
  5.   MAC database status                Local  Remote
  6.     Total MAC addresses:                 2       2
  7.     Default gateway MAC addresses:       2       0
  8.   Number of local interfaces: 2 (2 up)
  9.     Interface name  ESI                            Mode             Status
  10.     ge-1/1/5.100    00:11:22:33:44:55:66:77:88:99  single-active    Up    
  11.     ge-1/1/5.101    00:11:22:33:44:55:66:77:88:99  single-active    Up    
  12.   Number of IRB interfaces: 2 (2 up)
  13.     Interface name  VLAN ID  Status  L3 context
  14.     irb.100         100      Up      VPN-100
  15.     irb.101         101      Up      VPN-100
  16.   Number of bridge domains: 2
  17.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  18.     100          1   1     Extended         Enabled   302080
  19.     101          1   1     Extended         Enabled   301872
  20.   Number of neighbors: 2
  21.     10.10.10.2
  22.       Received routes
  23.         MAC address advertisement:              0
  24.         MAC+IP address advertisement:           0
  25.         Inclusive multicast:                    2
  26.         Ethernet auto-discovery:                1
  27.     10.10.10.3
  28.       Received routes
  29.         MAC address advertisement:              2
  30.         MAC+IP address advertisement:           2
  31.         Inclusive multicast:                    2
  32.         Ethernet auto-discovery:                0
  33.   Number of ethernet segments: 1
  34.     ESI: 00:11:22:33:44:55:66:77:88:99
  35.       Status: Resolved by IFL ge-1/1/5.100
  36.       Local interface: ge-1/1/5.100, Status: Up/Forwarding
  37.       Number of remote PEs connected: 1
  38.         Remote PE        MAC label  Aliasing label  Mode
  39.         10.10.10.2       301008     0               single-active
  40.       Designated forwarder: 10.10.10.1
  41.       Backup forwarder: 10.10.10.2
  42.       Advertised MAC label: 301232
  43.       Advertised aliasing label: 301232
  44.       Advertised split horizon label: 0
  45. Instance: __default_evpn__
  46.   Route Distinguisher: 10.10.10.1:0
  47.   VLAN ID: None
  48.   Per-instance MAC route label: 299808
  49.   MAC database status                Local  Remote
  50.     Total MAC addresses:                 0       0
  51.     Default gateway MAC addresses:       0       0
  52.   Number of local interfaces: 0 (0 up)
  53.   Number of IRB interfaces: 0 (0 up)
  54.   Number of bridge domains: 0
  55.   Number of neighbors: 1
  56.     10.10.10.2
  57.       Received routes
  58.         Ethernet auto-discovery:                0
  59.         Ethernet Segment:                       1
  60.   Number of ethernet segments: 0
  61. tim@MX5-1>

 

 

A couple of things to note:

  • EVPN is running in single-active mode, for ge-1/1/5.100 and ge-1/0/5.101
  • The access-interface (ge-1/1/5) on MX1 is shown to be up/forwarding, making this the active PE
  • MX1 is operating in single-active mode
  • The designated forwarder is MX1 (10.10.10.1)
  • The backup designated forwarder is MX2 (10.10.10.2)

Because MX-1 is the active PE, lets take a look at BGP on MX-3 to see what routes are advertised from the redundant site, to a remote site:

(Note – I currently have 2Mbps of IXIA traffic flowing bi-bidirectionally between each site, in each VLAN)

  1. EVPN-100.evpn.0: 17 destinations, 17 routes (17 active, 0 holddown, 0 hidden)
  2. + = Active Route, – = Last Active, * = Both
  3. 1:1.1.1.1:100::112233445566778899::0/304
  4.                    *[BGP/170] 04:17:27, localpref 100, from 10.10.10.1
  5.                       AS path: I, validation-state: unverified
  6.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  7. 1:10.10.10.1:0::112233445566778899::FFFF:FFFF/304
  8.                    *[BGP/170] 04:17:27, localpref 100, from 10.10.10.1
  9.                       AS path: I, validation-state: unverified
  10.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  11. 1:10.10.10.2:0::112233445566778899::FFFF:FFFF/304
  12.                    *[BGP/170] 13:50:18, localpref 100, from 10.10.10.2
  13.                       AS path: I, validation-state: unverified
  14.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300848
  15. 2:1.1.1.1:100::100::00:00:2e:18:6d:e1/304
  16.                    *[BGP/170] 04:17:23, localpref 100, from 10.10.10.1
  17.                       AS path: I, validation-state: unverified
  18.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  19. 2:1.1.1.1:100::101::00:00:2e:e6:77:95/304
  20.                    *[BGP/170] 04:17:23, localpref 100, from 10.10.10.1
  21.                       AS path: I, validation-state: unverified
  22.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  23. 2:1.1.1.1:100::100::00:00:2e:18:6d:e1::192.168.100.10/304
  24.                    *[BGP/170] 04:17:23, localpref 100, from 10.10.10.1
  25.                       AS path: I, validation-state: unverified
  26.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  27. 2:1.1.1.1:100::101::00:00:2e:e6:77:95::192.168.101.10/304
  28.                    *[BGP/170] 04:17:23, localpref 100, from 10.10.10.1
  29.                       AS path: I, validation-state: unverified
  30.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  31. 3:1.1.1.1:100::100::10.10.10.1/304
  32.                    *[BGP/170] 04:17:26, localpref 100, from 10.10.10.1
  33.                       AS path: I, validation-state: unverified
  34.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  35. 3:1.1.1.1:100::101::10.10.10.1/304
  36.                    *[BGP/170] 13:50:26, localpref 100, from 10.10.10.1
  37.                       AS path: I, validation-state: unverified
  38.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300912
  39. 3:1.1.1.2:100::100::10.10.10.2/304
  40.                    *[BGP/170] 13:50:18, localpref 100, from 10.10.10.2
  41.                       AS path: I, validation-state: unverified
  42.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300848
  43. 3:1.1.1.2:100::101::10.10.10.2/304
  44.                    *[BGP/170] 13:50:18, localpref 100, from 10.10.10.2
  45.                       AS path: I, validation-state: unverified
  46.                     > to 192.169.100.15 via ge-1/1/0.0, Push 300848
  47. tim@MX5-3>

 

We covered type-2 and type-3 routes in the previous labs, but here we have a new type-1 route being received on MX-3, what’s that all about? lets take a deeper look:

  1. tim@MX5-3> show route protocol bgp table EVPN-100.evpn.0 extensive
  2. EVPN-100.evpn.0: 17 destinations, 17 routes (17 active, 0 holddown, 0 hidden)
  3. 1:1.1.1.1:100::112233445566778899::0/304 (1 entry, 1 announced)
  4.         *BGP    Preference: 170/-101
  5.                 Route Distinguisher: 1.1.1.1:100
  6.                 Next hop type: Indirect
  7.                 Address: 0x2a7b880
  8.                 Next-hop reference count: 16
  9.                 Source: 10.10.10.1
  10.                 Protocol next hop: 10.10.10.1
  11.                 Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  12.                 State: <Secondary Active Int Ext>
  13.                 Local AS:   100 Peer AS:   100
  14.                 Age: 4:21:25    Metric2: 1
  15.                 Validation State: unverified
  16.                 Task: BGP_100.10.10.10.1+179
  17.                 Announcement bits (1): 0-EVPN-100-evpn
  18.                 AS path: I
  19.                 Communities: target:100:100
  20.                 Import Accepted
  21.                 Route Label: 301232
  22.                 Localpref: 100
  23.                 Router ID: 10.10.10.1
  24.                 Primary Routing Table bgp.evpn.0
  25.                 Indirect next hops: 1
  26.                         Protocol next hop: 10.10.10.1 Metric: 1
  27.                         Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  28.                         Indirect path forwarding next hops: 1
  29.                                 Next hop type: Router
  30.                                 Next hop: 192.169.100.15 via ge-1/1/0.0
  31.                                 Session Id: 0x0
  32.             10.10.10.1/32 Originating RIB: inet.3
  33.               Metric: 1           Node path count: 1
  34.               Forwarding nexthops: 1
  35.                 Nexthop: 192.169.100.15 via ge-1/1/0.0
  36. 1:10.10.10.1:0::112233445566778899::FFFF:FFFF/304 (1 entry, 1 announced)
  37.         *BGP    Preference: 170/-101
  38.                 Route Distinguisher: 10.10.10.1:0
  39.                 Next hop type: Indirect
  40.                 Address: 0x2a7b880
  41.                 Next-hop reference count: 16
  42.                 Source: 10.10.10.1
  43.                 Protocol next hop: 10.10.10.1
  44.                 Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  45.                 State: <Secondary Active Int Ext>
  46.                 Local AS:   100 Peer AS:   100
  47.                 Age: 4:21:25    Metric2: 1
  48.                 Validation State: unverified
  49.                 Task: BGP_100.10.10.10.1+179
  50.                 Announcement bits (1): 0-EVPN-100-evpn
  51.                 AS path: I
  52.                 Communities: target:100:100 esi-label:single-active (label 0)
  53.                 Import Accepted
  54.                 Localpref: 100
  55.                 Router ID: 10.10.10.1
  56.                 Primary Routing Table bgp.evpn.0
  57.                 Indirect next hops: 1
  58.                         Protocol next hop: 10.10.10.1 Metric: 1
  59.                         Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  60.                         Indirect path forwarding next hops: 1
  61.                                 Next hop type: Router
  62.                                 Next hop: 192.169.100.15 via ge-1/1/0.0
  63.                                 Session Id: 0x0
  64.             10.10.10.1/32 Originating RIB: inet.3
  65.               Metric: 1           Node path count: 1
  66.               Forwarding nexthops: 1
  67.                 Nexthop: 192.169.100.15 via ge-1/1/0.0
  68. 1:10.10.10.2:0::112233445566778899::FFFF:FFFF/304 (1 entry, 1 announced)
  69.         *BGP    Preference: 170/-101
  70.                 Route Distinguisher: 10.10.10.2:0
  71.                 Next hop type: Indirect
  72.                 Address: 0x2a7ae54
  73.                 Next-hop reference count: 6
  74.                 Source: 10.10.10.2
  75.                 Protocol next hop: 10.10.10.2
  76.                 Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  77.                 State: <Secondary Active Int Ext>
  78.                 Local AS:   100 Peer AS:   100
  79.                 Age: 13:54:16   Metric2: 1
  80.                 Validation State: unverified
  81.                 Task: BGP_100.10.10.10.2+179
  82.                 Announcement bits (1): 0-EVPN-100-evpn
  83.                 AS path: I
  84.                 Communities: target:100:100 esi-label:single-active (label 0)
  85.                 Import Accepted
  86.                 Localpref: 100
  87.                 Router ID: 10.10.10.2
  88.                 Primary Routing Table bgp.evpn.0
  89.                 Indirect next hops: 1
  90.                         Protocol next hop: 10.10.10.2 Metric: 1
  91.                         Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  92.                         Indirect path forwarding next hops: 1
  93.                                 Next hop type: Router
  94.                                 Next hop: 192.169.100.15 via ge-1/1/0.0
  95.                                 Session Id: 0x0
  96.             10.10.10.2/32 Originating RIB: inet.3
  97.               Metric: 1           Node path count: 1
  98.               Forwarding nexthops: 1
  99.                 Nexthop: 192.169.100.15 via ge-1/1/0.0

 

The Type-1 route is known as an AD or Auto-Discovery route, and it’s broken up into two distinct chunks:

  • A per-EVI AD route (line 4
  • A per-ESI AD route (lines 71 and 87)

The first route (line 4) is known as a per-EVI route, and contains what’s known as the “aliasing label” technically this isn’t required in an active-standby situation, as it exists to ensure that traffic can be forwarded equally where you have multiple PEs in an active-active setup. It solves the problem of traffic polarisation caused by a CE hashing traffic on one egress link only – resulting in that being replicated in the control-plane, so return traffic is also polarised, the aliasing label gets around this simply because a remote PE treats it like a regular MAC/IP route, but more on that in the next blog 🙂

The other two routes (line 71 and 87) are Per-ESI AD routes, and contain the ESI of the site, advertised from PE1 and PE2, you notice that the community is set as “target:100:100 esi-label:single-active” and has a label-value of 0. This is essentially telling MX3 that the ESI is running in single-active mode, if it was running in active-active mode – then a non-zero MPLS label would be present – in order to cater for split horizon and BUM traffic. In this case the setup is single-active and so there will only ever be one route at a time back to site 1.

These routes also speed up convergence, if you’re advertising 1000s of MAC/IP routes and you get a link failure, rather than a PE having to send BGP messages to withdraw all those routes, it can simply withdraw the Ethernet AD routes – which speeds up convergence.

Next lets take a look at what’s going on at the main site, and see what MX1 is advertising to MX2:

 

  1. tim@MX5-1> show route advertising-protocol bgp 10.10.10.2 evpn-esi-value 00:11:22:33:44:55:66:77:88:99 detail
  2. VPN-100.inet.0: 8 destinations, 14 routes (8 active, 0 holddown, 0 hidden)
  3. EVPN-100.evpn.0: 16 destinations, 16 routes (16 active, 0 holddown, 0 hidden)
  4. * 1:1.1.1.1:100::112233445566778899::0/304 (1 entry, 1 announced)
  5.  BGP group iBGP-PEs type Internal
  6.      Route Distinguisher: 1.1.1.1:100
  7.      Route Label: 301232
  8.      Nexthop: Self
  9.      Flags: Nexthop Change
  10.      Localpref: 100
  11.      AS path: [100] I
  12.      Communities: target:100:100
  13. __default_evpn__.evpn.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
  14. * 1:10.10.10.1:0::112233445566778899::FFFF:FFFF/304 (1 entry, 1 announced)
  15.  BGP group iBGP-PEs type Internal
  16.      Route Distinguisher: 10.10.10.1:0
  17.      Nexthop: Self
  18.      Flags: Nexthop Change
  19.      Localpref: 100
  20.      AS path: [100] I
  21.      Communities: target:100:100 esi-label:single-active (label 0)
  22. * 4:10.10.10.1:0::112233445566778899:10.10.10.1/304 (1 entry, 1 announced)
  23.  BGP group iBGP-PEs type Internal
  24.      Route Distinguisher: 10.10.10.1:0
  25.      Nexthop: Self
  26.      Flags: Nexthop Change
  27.      Localpref: 100
  28.      AS path: [100] I
  29.      Communities: es-import-target:22-33-44-55-66-77

 

You can see that there’s a new “type-4” route being advertised, this is known as an “Ethernet Segment (ES) route” and is advertised by PE routers which are configured with non-zero ESI values. Essentially, it’s a special extended community (ES-Import-target) that each PE router will import if they both have the same ESI configured, it means that two PE routers remote from one another, know that they’re both connected to the same Ethernet segment, all other PE routers with default, or non-zero ESI values filter these advertisements.

So a quick recap – we’ve looked at the new route types, the control-plane and the configuration, the next step is to see how well it works, first a quick recap of the diagram:

Capture7

I’ve created a flow of IXIA traffic bi-bidirectionally between the top site and the bottom site, if I go to MX-1 and look at the MPLS facing interface, we should see the traffic:


Physical interface: ge-1/1/0, Enabled, Physical link is Up
Interface index: 147, SNMP ifIndex: 525
Link-level type: Ethernet, MTU: 1514, MRU: 1522, Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled,
Flow control: Enabled, Auto-negotiation: Enabled, Remote fault: Online
Pad to minimum frame size: Disabled
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x0
Link flags : None
CoS queues : 8 supported, 8 maximum usable queues
Current address: a8:d0:e5:5b:7c:90, Hardware address: a8:d0:e5:5b:7c:90
Last flapped : 2016-06-10 20:15:19 UTC (5d 19:13 ago)
Input rate : 5599000 bps (500 pps)
Output rate : 5583408 bps (499 pps)

So it’s clear that traffic is being forwarded by MX-1, because I’m sending packets at an exact rate of 1000pps we should be able to measure how quickly fail-over occurs by counting the number of lost packets, for example – at 1000pps, if I lose 50 packets, that yields a fail-over time of 50ms.

First an easy failure – I’ll shut down ge-0/0/0 on EX4200-1, this will put the interface down/down on MX-1 and we’ll measure how long it takes to recover:


imtech@ex4200-1# set interfaces ge-0/0/0 disable
{master:0}[edit]
imtech@ex4200-1# commit
configuration check succeeds
commit complete
{master:0}[edit]
imtech@ex4200-1#

Lets look at much traffic was lost:

Fail1

Frames delta = 1077, so just a fraction longer than 1 second to failover, which isn’t THAT bad, we might be able to improve it later..

Lets check the EVPN instance to see how things have changed:

on MX1:

  1. im@MX5-1> show evpn instance extensive
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.1:100
  4.   Per-instance MAC route label: 299776
  5.   MAC database status                Local  Remote
  6.     Total MAC addresses:                 0       3
  7.     Default gateway MAC addresses:       0       0
  8.   Number of local interfaces: 2 (0 up)
  9.     Interface name  ESI                            Mode             Status
  10.     ge-1/1/5.100    00:11:22:33:44:55:66:77:88:99  single-active    Down  
  11.     ge-1/1/5.101    00:11:22:33:44:55:66:77:88:99  single-active    Down  
  12.   Number of IRB interfaces: 2 (0 up)
  13.     Interface name  VLAN ID  Status  L3 context
  14.   irb.100         100      Down    VPN-100                          
  15.     irb.101         101      Down    VPN-100      
  16.   Number of bridge domains: 2
  17.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  18.     100          1   0     Extended         Enabled
  19.     101          1   0     Extended         Enabled
  20.   Number of neighbors: 2
  21.     10.10.10.2
  22.       Received routes
  23.         MAC address advertisement:              1
  24.         MAC+IP address advertisement:           1
  25.         Inclusive multicast:                    2
  26.         Ethernet auto-discovery:                2
  27.     10.10.10.3
  28.       Received routes
  29.         MAC address advertisement:              2
  30.         MAC+IP address advertisement:           2
  31.         Inclusive multicast:                    2
  32.         Ethernet auto-discovery:                0
  33.   Number of ethernet segments: 1
  34.     ESI: 00:11:22:33:44:55:66:77:88:99
  35.       Status: Resolved by NH 1048582
  36.   Local interface: ge-1/1/5.100, Status: Down
  37.       Number of remote PEs connected: 1
  38.         Remote PE        MAC label  Aliasing label  Mode
  39.         10.10.10.2       301008     301008          single-active
  40.       Designated forwarder: 10.10.10.2
  41.       Advertised MAC label: 301232
  42.       Advertised aliasing label: 301232
  43.       Advertised split horizon label: 0
  44. Instance: __default_evpn__
  45.   Route Distinguisher: 10.10.10.1:0
  46.   VLAN ID: None
  47.   Per-instance MAC route label: 299808
  48.   MAC database status                Local  Remote
  49.     Total MAC addresses:                 0       0
  50.     Default gateway MAC addresses:       0       0
  51.   Number of local interfaces: 0 (0 up)
  52.   Number of IRB interfaces: 0 (0 up)
  53.   Number of bridge domains: 0
  54.   Number of neighbors: 1
  55.     10.10.10.2
  56.       Received routes
  57.         Ethernet auto-discovery:                0
  58.         Ethernet Segment:                       1
  59.   Number of ethernet segments: 0
  60. tim@MX5-1>

 

So it’s pretty clear that things have gone down, and MX2 is the new active PE router, lets check it out:

  1. tim@MX5-2> show evpn instance extensive
  2. Instance: EVPN-100
  3.   Route Distinguisher: 1.1.1.2:100
  4.   Per-instance MAC route label: 299776
  5.   MAC database status                Local  Remote
  6.     Total MAC addresses:                 1       2
  7.     Default gateway MAC addresses:       2       0
  8.   Number of local interfaces: 2 (2 up)
  9.     Interface name  ESI                            Mode             Status
  10.     ge-1/0/5.100    00:11:22:33:44:55:66:77:88:99  single-active    Up    
  11.     ge-1/0/5.101    00:11:22:33:44:55:66:77:88:99  single-active    Up    
  12.   Number of IRB interfaces: 2 (2 up)
  13.     Interface name  VLAN ID  Status  L3 context
  14.     irb.100         100      Up      VPN-100                          
  15.     irb.101         101      Up      VPN-100      
  16.   Number of bridge domains: 2
  17.     VLAN ID  Intfs / up    Mode             MAC sync  IM route label
  18.     100          1   1     Extended         Enabled   302272
  19.     101          1   1     Extended         Enabled   302224
  20.   Number of neighbors: 1
  21.     10.10.10.3
  22.       Received routes
  23.         MAC address advertisement:              2
  24.         MAC+IP address advertisement:           2
  25.         Inclusive multicast:                    2
  26.         Ethernet auto-discovery:                0
  27.   Number of ethernet segments: 1
  28.     ESI: 00:11:22:33:44:55:66:77:88:99
  29.       Status: Resolved by IFL ge-1/0/5.100
  30.       Local interface: ge-1/0/5.100, Status: Up/Forwarding
  31.       Designated forwarder: 10.10.10.2
  32.       Advertised MAC label: 301008
  33.       Advertised aliasing label: 301008
  34.       Advertised split horizon label: 0
  35. Instance: __default_evpn__
  36.   Route Distinguisher: 10.10.10.2:0
  37.   VLAN ID: None
  38.   Per-instance MAC route label: 299808
  39.   MAC database status                Local  Remote
  40.     Total MAC addresses:                 0       0
  41.     Default gateway MAC addresses:       0       0
  42.   Number of local interfaces: 0 (0 up)
  43.   Number of IRB interfaces: 0 (0 up)
  44.   Number of bridge domains: 0
  45.   Number of neighbors: 0
  46.   Number of ethernet segments: 0
  47. tim@MX5-2>

 

 

If we look at the MPLS facing interface on MX2, we should see that all traffic is being sent and received via the MPLS network:


tim@MX5-2> show interfaces ge-1/1/0
Physical interface: ge-1/1/0, Enabled, Physical link is Up
Interface index: 147, SNMP ifIndex: 526
Link-level type: Ethernet, MTU: 1514, MRU: 1522, Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled, Source filtering: Disabled,
Flow control: Enabled, Auto-negotiation: Enabled, Remote fault: Online
Pad to minimum frame size: Disabled
Device flags : Present Running
Interface flags: SNMP-Traps Internal: 0x0
Link flags : None
CoS queues : 8 supported, 8 maximum usable queues
Current address: a8:d0:e5:5b:75:90, Hardware address: a8:d0:e5:5b:75:90
Last flapped : 2016-06-10 20:08:17 UTC (5d 19:42 ago)
Input rate : 5605824 bps (502 pps)
Output rate : 5584392 bps (501 pps)

 

The solution itself is a lot more elegant than traditional FHRP (First hop routing protocols) such as VRRP or HSRP.

  • Because MX1 and MX2 automatically learn about each other via the MPLS network and the type-4 Ethernet-Segment route, and NOT the LAN (like HSRP) – if there’s any problem with the MPLS side connected to the active router, it transitions to standby and the solution fails over.

If I fail the MPLS interface on the “P” router connected to MX1, we get failover in less than 1 second:


Axians@m10i-1# set interfaces ge-0/0/2 disable
[edit]
Axians@m10i-1# commit
commit complete

Then check the packet loss in IXIA:

Fail2

The solution recovers from the failure in 912ms.

This is pretty great, not least because it works reliably – but most of this functionality is built directly into the protocol, I haven’t had to do any crazy tracking of routes, I haven’t needed to go anywhere near IP SLA or any of that horror that is a massive pain when designing this sort of thing, with EVPN – things are pretty simple and work reliably.

It’s not perfect however, unlike HSRP or VRRP which form an adjacency over a LAN via Multicast, EVPN doesn’t do this – all information about other PEs is sent and received via BGP. If you have a complex LAN environment and a failure leaves the PEs isolated – you don’t get a traditional split-brain scenario like you would with HSRP or VRRP, the solution simply doesn’t fail at all, the basic triggers for failure are that the physical interface goes down, the MPLS side goes down, or the entire PE goes down.

This can easily be demonstrated by breaking the logical interface on EX4200-1 whilst leaving the physical interface up/up:


imtech@ex4200-1# set interfaces ge-0/0/0.0 disable
{master:0}[edit]
imtech@ex4200-1# commit
configuration check succeeds
commit complete

The whole solution breaks, and stays broken forever:

Fail3

So you still need to be careful with the design and the different way in which EVPN operates, incidentally you can use things like Ethernet OAM to get around this problem:

Just for laughs, lets apply a basic Ethernet OAM config to MX1, MX2 and the EX4200:

OAM template (shown just on MX-1):

  1. oam {
  2.     ethernet {
  3.         connectivity-fault-management {
  4.             action-profile bring-down {
  5.                 event {
  6.                     interface-status-tlv down;
  7.                     adjacency-loss;
  8.                 }
  9.                 action {
  10.                     interface-down;
  11.                 }
  12.             }
  13.             maintenance-domain “IEEE level 4” {
  14.                 level 4;
  15.                 maintenance-association PE1 {
  16.                     short-name-format character-string;
  17.                     continuity-check {
  18.                         interval 100ms;
  19.                         interface-status-tlv;
  20.                     }
  21.                     mep 1 {
  22.                         interface ge-1/1/5.100;
  23.                         direction down;
  24.                         auto-discovery;
  25.                         remote-mep 2 {
  26.                             action-profile bring-down;
  27.                         }
  28.                     }
  29.                 }
  30.             }
  31.         }
  32.     }

 

Just for clarity, the OAM configuration ensures that if there’s a problem with connectivity between MX1 – EX4200-1 and MX2 – EX4200-1 but the physical interfaces remain up/up, OAM will detect the connectivity loss, and automatically tear the line-protocol of the interface to the down/down status, and force EVPN to fail-over,

lets repeat the exact same test again, with the OAM configuration applied to the PEs and the switch:


imtech@ex4200-1# set interfaces ge-0/0/0.0 disable
{master:0}[edit]
imtech@ex4200-1# commit
configuration check succeeds
commit complete

and check the packet-loss with IXIA:

Fail4

Not bad! 612 packets lost, equals failure and convergence in 624ms, which is a lot better than the original 1077ms when failing the physical interface, and a hell of a lot better than it being down forever, if the network experiences a non-direct failure, (software/logical fail)

Anyway I hope you’ve found this useful, there’s a few bits I’ve skipped over – but I’ll cover those in more detail when I do all-active redundancy in the next blog 🙂

 

EVPN Inter-VLAN routing + mobility

So in the last blog I essentially looked at one of the most basic aspects of EVPN – a multi-site layer-2 network with nothing fancy going on, with traffic forwarding occurring between multiple sites in the same VLAN. The fact of the matter is that there was nothing going on there that you couldn’t do with a traditional VPLS configuration, however the general idea was to demonstrate the basics and take a look at the basic control-plane first.

In this update we’ll be looking at some of the more exclusive and highly useful aspects of EVPNs which make it a very attractive technology for things such as data-centre interconnect, there are a few things which are possible with EVPN which cannot be done with VPLS.

Consider the revised topology:

Capture

It’s the same topology from the first blog post, however I’ve simply added an additional VLAN (VLAN 101) to ge-0/0/22 of each EX4200 LAN switch, and an additional IXIA host.

For this post we’re going to look at a rather cool way of performing inter-VLAN forwarding between hosts in VLAN100 and VLAN101. Not that I want to spend time teaching people how to suck eggs, but generally in a simple network with multiple VLANs you have 2 common ways of performing inter-VLAN forwarding:

  • Use a good ole’ fashioned router on a stick topology
  • Bolt some additional layer-3 functionality onto your layer-2 switch

As everyone knows, the latter method is by far the most common – the vast majority of switches support layer-3 routing functionality, usually in the form of IRB/BVI/SVI depending on the vendor in question.

In a service provider network, where we generally have a number of PE routers acting together as a large distributed switch, providing layer-2 connectivity – the old fashioned way of doing this would be with VPLS. In order to enable inter-VLAN forwarding we’d add a BVI interface to the VPLS instance, this enables a PE to do standard layer-2 switching and route between VLANs at layer-3 – which is very important for data-centre interconnect applications.

EVPN has a number of enhancements which make it more suitable for modern day data-centre interconnect designs, especially where things such as VM mobility are concerned. A company or organisation with a traditional MPLS based network, might require the ability to move hosts around between data centres seamlessly, without causing any real downtime.

Lets take a look at the basic interface configuration and routing-instance configuration:

  1. interfaces {
  2.     irb {
  3.         unit 100 {
  4.             family inet {
  5.                 address 192.168.100.1/24;
  6.             }
  7.             mac 00:00:19:21:68:10;
  8.         }
  9.         unit 101 {
  10.             family inet {
  11.                 address 192.168.101.1/24;
  12.             }
  13.             mac 00:00:19:21:68:11;
  14.         }
  15.     }
  16. routing-instances {
  17. EVPN-100 {
  18.     instance-type virtual-switch;
  19.     route-distinguisher 1.1.1.1:100;
  20.     vrf-target target:100:100;
  21.     protocols {
  22.         evpn {
  23.             extended-vlan-list 100-101;
  24.             default-gateway do-not-advertise;
  25.         }
  26.     }
  27.     bridge-domains {
  28.         VL-100 {
  29.             vlan-id 100;
  30.             interface ge-1/1/5.100;
  31.             routing-interface irb.100;
  32.         }
  33.         VL-101 {
  34.             vlan-id 101;
  35.             interface ge-1/1/5.101;
  36.             routing-interface irb.101;
  37.         }
  38.     }
  39. }
  40. VPN-100 {
  41.     instance-type vrf;
  42.     interface irb.100;
  43.     interface irb.101;
  44.     route-distinguisher 100.100.100.1:100;
  45.     vrf-target target:1:100;
  46.     vrf-table-label;
  47. }

 

First things first – lines 1 – 15 take care of the IRB interfaces for VLAN 100 and VLAN 101; more of that shortly.

Lines 16 – 39 form the configuration for the EVPN routing instance, you’ll note a couple of differences from the first EVPN blog post;

  • The extended-vlan-list has been increased to include both VLANs within the routing instance
  • A new command “default-gateway do-not-advertise” is present under the EVPN protocol configuration
  • An additional bridge-domain has been configured for Vlan 101 under the routing-instance, along with the IRB interface for each vlan
  • What looks like a totally standard L3VPN has been configured, albeit with different RTs and RDs – but it does contain the IRB interfaces from the EVPN routing instance.

The command “default-gateway do-not-advertise” is used to generate a new extended-community route. If on your PE routers you have different IRB MAC addresses and IPv4 addresses – the PE will generate a “default-gateway route” which tells other PEs in the EVPN that this route is a default-gateway somewhere, however in this example and in best practise – it’s simpler and easier to configure the same IRB MAC/IP on all your PEs, and so the command here is “do-not-advertise” as we don’t need it at this time.

But perhaps the coolest feature and one of the biggest advantages EVPN has over VPLS is the way the IRB interfaces are configured, in this topology the 3x PE routers, (MX5-1, MX5-2 and MX5-3) all have an identical IRB interface configuration for VLAN 100 and VLAN 101, each PE has the exact same IP address, and MAC address…:

MX5-1:

  1. imtech@MX5-1# run show configuration interfaces irb
  2. unit 100 {
  3.     family inet {
  4.         address 192.168.100.1/24;
  5.     }
  6.     mac 00:00:19:21:68:10;
  7. }
  8. unit 101 {
  9.     family inet {
  10.         address 192.168.101.1/24;
  11.     }
  12.     mac 00:00:19:21:68:11;
  13. }

MX5-2

  1. imtech@MX5-2# run show configuration interfaces irb
  2. unit 100 {
  3.     family inet {
  4.         address 192.168.100.1/24;
  5.     }
  6.     mac 00:00:19:21:68:10;
  7. }
  8. unit 101 {
  9.     family inet {
  10.         address 192.168.101.1/24;
  11.     }
  12.     mac 00:00:19:21:68:11;
  13. }

MX5-3

  1. imtech@MX5-3# run show configuration interfaces irb
  2. unit 100 {
  3.     family inet {
  4.         address 192.168.100.1/24;
  5.     }
  6.     mac 00:00:19:21:68:10;
  7. }
  8. unit 101 {
  9.     family inet {
  10.         address 192.168.101.1/24;
  11.     }
  12.     mac 00:00:19:21:68:11;
  13. }

The first time you see it, you think:

15omtr

But it’s true! all the PEs in the network have the exact same IP address and MAC address on their IRB interfaces, why would we do that? and how does it work?

Consider the following scenario:

Capture2

Imagine a basic data-centre environment running things like VMware or openstack – basically we can provision servers and move them around all over the place using things like VMotion etc. If you can imagine the active server on the left hand portion of the data-centre and business as usual from a networks perspective, arp is learnt between the host and the left hand PE, the default-gateway is 192.168.100.1

Now, imagine that the DC admin flicks the switch, and that active VM on the left is immediately torn down and spun up inside the right hand DC (which could be many miles away) you’ll notice that the interface mac-address and the default-gateway are the same. This gives us the ability to move hosts around our data centres, without having to worry about different default-gateways, or incurring too much downtime whilst we wait for things to re-arp, because everything is identical at each DC site – there’s no problem moving things around between one site or the next.

Capture3

You cannot do this with VPLS as the implementation demands that you use unique MAC-addresses, which moves us on deeper into the technology – how does EVPN achieve this breakthrough?

It’s essentially boils down to the way that EVPN has been engineered to more closely integrate with the layer-3 world, essentially the software has a number of hooks which go between EVPN and L3VPN in a much more elegant fashion than VPLS, for example in the first blog post – it showed how MAC addresses were learnt and inserted into the BGP control-plane, in this example for Inter-VLAN forwarding, a few extra things are happening:

  • Firstly we have the BGP MAC advertisement from the L2 world,
  • Secondly, we get a new MAC/IP advertisement containing the PE’s IRB MAC and IP address – this is linked to the PE’s ARP table
  • Thirdly, we get a totally standard /32 IPv4 L3VPN route for the host’s /32 address, this is advertised to all remote PEs

Let’s recap a more basic version of the lab diagram and see what the control-plane looks like when we send some traffic between hosts in different VLANs:

Capture4

Now lets look at the BGP control-plane on MX-1 and see what’s going on:

  1. imtech@MX5-1> show route protocol bgp table EVPN-100.evpn.0
  2. EVPN-100.evpn.0: 8 destinations, 8 routes (8 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 2:1.1.1.2:100::101::00:00:2e:e6:77:97/304
  5.                    *[BGP/170] 00:04:38, localpref 100, from 10.10.10.2
  6.                       AS path: I, validation-state: unverified
  7.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299968
  8. 2:1.1.1.2:100::101::00:00:2e:e6:77:97::192.168.101.11/304  
  9.                    *[BGP/170] 00:04:38, localpref 100, from 10.10.10.2
  10.                       AS path: I, validation-state: unverified
  11.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299968
  12. 3:1.1.1.2:100::100::10.10.10.2/304
  13.                    *[BGP/170] 00:04:38, localpref 100, from 10.10.10.2
  14.                       AS path: I, validation-state: unverified
  15.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299968
  16. 3:1.1.1.2:100::101::10.10.10.2/304
  17.                    *[BGP/170] 00:04:38, localpref 100, from 10.10.10.2
  18.                       AS path: I, validation-state: unverified
  19.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299968
  20. imtech@MX5-1> show route protocol bgp table VPN-100.inet.0
  21. VPN-100.inet.0: 6 destinations, 9 routes (6 active, 0 holddown, 0 hidden)
  22. + = Active Route, – = Last Active, * = Both
  23. 192.168.100.0/24    [BGP/170] 00:04:44, localpref 100, from 10.10.10.2
  24.                       AS path: I, validation-state: unverified
  25.                     > to 192.169.100.11 via ge-1/1/0.0, Push 16, Push 299968(top)
  26. 192.168.101.0/24    [BGP/170] 00:04:44, localpref 100, from 10.10.10.2
  27.                       AS path: I, validation-state: unverified
  28.                     > to 192.169.100.11 via ge-1/1/0.0, Push 16, Push 299968(top)
  29. 192.168.101.11/32   [BGP/170] 00:04:44, localpref 100, from 10.10.10.2
  30.                       AS path: I, validation-state: unverified
  31.                     > to 192.169.100.11 via ge-1/1/0.0, Push 16, Push 299968(top)

You’ll immediatley notice that compared to the vanilla L2VPN implementation, there’s a lot more going on – lets break it down,

  • Line 6 is the standard MAC advertisement route, the same sort of advertisement we went over with the vanilla standard L2-only version of EVPN – this is for layer-2 connectivity only.
  • Line 10 is an EVPN MAC/IP route, which is basically the ARP mapping learnt directly from MX2 – this route makes it possible for all PEs in the network to synchronise their arp tables with each other!
  • Line 34 is a standard L3VPN route, containing the /32 host behind MX2

Line 10 essentially means, that as soon as you move a host from one place to another – the moment a packet lands on the ingress PE interface – it generates a new MAC/IP ARP route, and all other PE’s synchronise accordingly, meanwhile the host that’s moved doesn’t need to do anything else – other than keep sending packets at the exact same gateway IP/MAC as it did before it was moved, essentially we have layer-2 and layer-3 working together in harmony.

Line 34 is a standard L3VPN /32 host route for the host behind MX2, this means that if you have EVPN running across numerous data-centres in various places, if this is connected to a wider layer-3 network – such as traditional residential/business PE routers, these other routers don’t need to have any awareness of EVPN whatsoever – so long as they can participate in regular L3VPN then packets will always be delivered to the right place when things get moved around, because these routes are dynamically generated and advertised accordingly. This is a massive advantage over VPLS, as you don’t need to configure it in every corner of the network for it to be useful, it simply lives on your DC edge – the rest is left to vanilla L3VPN.

There are a few more enhancements due at some point soon, including quite an interesting one which is the “MAC mobility extended-community” which is essentially a safeguard to prevent a few rather nasty situations from arising:

  • A layer-2 loop, where two PEs constantly advertise the same MAC addresses – which could overwhelm the BGP control-plane
  • A situation where a pair of hosts each in a different DC are mis-configured with the same MAC address – if they’re both sending data then each PE will be generating route advertisements,

The MAC mobility extended community drafted in RFC 7432 introduces a sequence number, where if the same route is advertised a certain number of times within a specific period, it’s assumed that something is broken and the routers should perform some sort of damping and alerting procedure to prevent network meltdown.

I hope you found this useful! the next one I’ll be looking at some of the redundant designs including single-active and all-active multi-homing.

 

 

 

EVPN – the basics

So I decided to take a deep dive into eVPN, I’ll mostly be looking into VLAN-aware bundling, as per RFC 7432 – and mostly because I think this will fit more closely, with the types of deployments most of the customers are used to – good old IRB interfaces and bridge-tables!

As everyone knows, VPLS has been available for many years now and it’s pretty widely deployed, most of the customers I see have some flavour of VPLS configured on their networks and use it to good effect – so why eVPN? what’s the point in introducing a new technology if the current one appears to work fine.

The reality is that multipoint layer-2 VPNs (VPLS) were never quite as polished as layer-3 VPNs, when layer-3 VPNs were first invented they became, and still are the in many cases the “go to” technology for layer-3 connectivity across MPLS networks, and the technology itself hasn’t really changed that much for well over a decade. The same cannot be said for VPLS, over the years we’ve had many different iterations of the technology:

  • Vanilla VPLS
    • LDP signalled
    • BGP signalled
  • H-VPLS (hierarchical VPLS)
    • BGP based
    • LDP based
  • VPLS auto-discovery

Along with the different types of VPLS, the technology itself has been repeatedly modified with hacks and patches, in order to get around some annoyingly simple problems, for example:

  1. VPLS auto-discovery is only supported under BGP signalling – you can’t do it if you’re using LDP signalled VPLS,
  2. H-VPLS – in order to get around the fully meshed psedudowire problem of vanilla VPLS, H-VPLS introduced a hierarchy, in order to cut down on the amount of pseduowires in large networks, unfortunately the  design often ends up being cumbersome and complicated.
  3. mac-address learning – VPLS has no layer-2 control plane, it learns mac-addresses directly from the data-plane like a standard switch – which is fine if it’s taking place inside a single device, but across a large distributed network with many thousands of mac-addresses, a loss of any attachment circuit can result in stale forwarding state and slow convergence/recovery
  4. all-active CE-Multihoming – simply can’t do it in VPLS, single-homed only, which is a major pain for large-scale modern data centres with lots and lots of layer-2 connectivity
  5. Layer-3 integration – With VPLS it’s typical to use a BVI or IRB interface as the layer-3 gateway to a VLAN, however there’s no real integration between the layer-2 and layer-3 world, you still need VRRP for first hop redundancy – which comes with all the pain you’d expect (traffic black holding, complex tracking requirements, interface timers, etc)

The topology I’m going to use for this is shown below:

Capture

A few basic points about the network:

  • The 3x “P” routers in the core of the network are Juniper M10i series, running nothing other than ISIS/LDP/MPLS
  • The 3x “PE” routers, are Juniper MX5 – each with 14.1.R6.4 loaded on, connectivity is via a 20x1G MIC
  • The 3x “EX4200” switches are doing nothing other than trunking VLAN 100 towards each MX-5
  • Each IXIA port has a single host on VLAN 100

The first lab will look at eVPN with basic MPLS transport – this is essentially a replacement for vanilla VPLS, we have three sites each with a single switch – all in Vlan 100 on a common /24 subnet, nothing fancy going on, no layer-3 routing or bridging anywhere, this is all strictly layer-2 for now.

The first thing to note about eVPN is that the core of it is built around a BGP control-plane, no LDP or anything else, it’s BGP only which is great because we all love BGP, the first thing is to enable the evpn address family, (AFI 25 for L2VPN and the new of SAFI 70 evpn)

(Output taken from MX5-1, but identical on all 3 PEs, <except for IP addressing obviously>)

  1. bgp {
  2.         group iBGP-PEs {
  3.             type internal;
  4.             local-address 10.10.10.1;
  5.             family evpn {
  6.                 signaling;
  7.             }
  8.             neighbor 10.10.10.2;
  9.             neighbor 10.10.10.3;
  10.         }
  11.     }

 

This essentially enables the evpn signalling which is essential, unlike VPLS there’s no manual provisioning of pseudowires, because there are no pseudowires, just like L3 VPNs everything is handled via BGP and uses the same route-distinguishers and route-targets that we’ve all come to love.

The configuration for this lab is pretty much identical across all three PEs but we’ll look at MX5-1 for this example, first the LAN facing interface:

  1. ge-1/1/5 {
  2.         flexible-vlan-tagging;
  3.         encapsulation flexible-ethernet-services;
  4.         unit 100 {
  5.             encapsulation vlan-bridge;
  6.             vlan-id 100;
  7.         }
  8.     }

 

Followed by the evpn routing-instance:

  1. routing-instances {
  2.     EVPN-100 {
  3.         instance-type virtual-switch;
  4.         route-distinguisher 1.1.1.1:100;
  5.         vrf-target target:100:100;
  6.         protocols {
  7.             evpn {
  8.                 extended-vlan-list 100;
  9.             }
  10.         }
  11.         bridge-domains {
  12.             VL-100 {
  13.                 vlan-id 100;
  14.                 interface ge-1/1/5.100;
  15.             }
  16.         }
  17.     }
  18. }

 

A few things to note about the routing-instance:

  • Lines 4 and 5 mark the “RD” and “RT” which essentially the same as a standard L3VPN setup
  • The routing-instance is of type “virtual-switch” and the bridge-domain sits inside it,
  • This is essentially is configured the same as a VPLS virtual-switch, except with a different protocol.

Before we send any traffic or try to get any connectivity, lets take a look at the basic control-plane and exactly what sort of things BGP is getting up to, whilst things are simple.

  1. greg@MX5-1# run show bgp summary
  2. Groups: 1 Peers: 2 Down peers: 0
  3. Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
  4. bgp.evpn.0
  5.                        2          2          0          0          0          0
  6. Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Accepted/Damped…
  7. 10.10.10.2              100        231        231       0       1     1:40:54 Establ
  8.   bgp.evpn.0: 1/1/1/0
  9.   EVPN-100.evpn.0: 1/1/1/0
  10.   __default_evpn__.evpn.0: 0/0/0/0
  11. 10.10.10.3              100        229        231       0       1     1:40:40 Establ
  12.   bgp.evpn.0: 1/1/1/0
  13.   EVPN-100.evpn.0: 1/1/1/0
  14.   __default_evpn__.evpn.0: 0/0/0/0
  15. [edit]
  16. greg@MX5-1#

 

You’ll notice that before we’ve sent any traffic or done anything, that we have two types of table under each established BGP peer:

  • “bgp.evpn.0” for the core-facing BGP adjacency, (the same as regular L3VPN)
  • “EVPN-100.evpn.0” for the routing-instance table, (again the same as regular L3VPN)

You’ll also notice that we’re receiving 1 route from each PE, for each table, if we investigate further and take a look:

  1. greg@MX5-1# run show route table bgp.evpn.0
  2. bgp.evpn.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 3:1.1.1.2:100::100::10.10.10.2/304
  5.                    *[BGP/170] 00:10:42, localpref 100, from 10.10.10.2
  6.                       AS path: I, validation-state: unverified
  7.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  8. 3:1.1.1.3:100::100::10.10.10.3/304  
  9.                    *[BGP/170] 00:10:40, localpref 100, from 10.10.10.3
  10.                       AS path: I, validation-state: unverified
  11.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299936
  12. [edit]
  13. greg@MX5-1# run show route table EVPN-100.evpn.0
  14. EVPN-100.evpn.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
  15. + = Active Route, – = Last Active, * = Both
  16. 3:1.1.1.1:100::100::10.10.10.1/304
  17.                    *[EVPN/170] 00:10:54
  18.                       Indirect
  19. 3:1.1.1.2:100::100::10.10.10.2/304  
  20.                    *[BGP/170] 00:10:49, localpref 100, from 10.10.10.2
  21.                       AS path: I, validation-state: unverified
  22.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  23. 3:1.1.1.3:100::100::10.10.10.3/304  
  24.                    *[BGP/170] 00:10:47, localpref 100, from 10.10.10.3
  25.                       AS path: I, validation-state: unverified
  26.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299936

 

Because everyone reading this has eyes like hawks 😉  you’ll immediately notice the strange looking /304 routes coming from each adjacent PE, let’s examine the first one:

3:1.1.1.2:100::100::10.10.10.2/304  

The format is essentially: 3 : <RD> :: <VLAN-ID> :: <ROUTER-ID> /304

It also contains the “ROUTER-ID-LENGTH” which is obviously /32 however Juniper hides this from the output. It should be obvious to most people what all these values are, except for the “3” what does that mean?

It’s important to note, that evpn defines a set of route-route types as shown below:

  • Type 1 – Ethernet auto-discovery route
  • Type 2 – MAC/IP advertisement route
  • Type 3 – Inclusive multicast Ethernet tag route
  • Type 4 – Ethernet segment (ES) route
  • Type 5 – IP prefix route

Type 3 routes are for signalling the inclusive tunnel, with VLAN-Aware evpn each PE generates a VLAN specific inclusive tunnel which is used for BUM (broadcast unknown multicast) traffic. Basically – it’s used to send BUM traffic to all PEs that have sites in the same VLAN, lets look at it in even more detail:

 

  1. greg@MX5-1# run show route table bgp.evpn.0 extensive
  2. bgp.evpn.0: 2 destinations, 2 routes (2 active, 0 holddown, 0 hidden)
  3. 3:1.1.1.2:100::100::10.10.10.2/304 (1 entry, 0 announced)
  4.         *BGP    Preference: 170/-101
  5.                 Route Distinguisher: 1.1.1.2:100
  6. PMSI: Flags 0x0: Label 300512: Type INGRESS-REPLICATION 10.10.10.2
  7.                 Next hop type: Indirect
  8.                 Address: 0x2fa4c34
  9.                 Next-hop reference count: 2
  10.                 Source: 10.10.10.2
  11.                 Protocol next hop: 10.10.10.2
  12.                 Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  13.                 State: <Active Int Ext>
  14.                 Local AS:   100 Peer AS:   100
  15.                 Age: 30:23  Metric2: 1
  16.                 Validation State: unverified
  17.                 Task: BGP_100.10.10.10.2+56692
  18.                 AS path: I
  19.                 Communities: target:100:100
  20.                 Import Accepted
  21.                 Localpref: 100
  22.                 Router ID: 10.10.10.2
  23.                 Secondary Tables: EVPN-100.evpn.0
  24.                 Indirect next hops: 1
  25.                         Protocol next hop: 10.10.10.2 Metric: 1
  26.                         Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  27.                         Indirect path forwarding next hops: 1
  28.                                 Next hop type: Router
  29.                                 Next hop: 192.169.100.11 via ge-1/1/0.0
  30.                                 Session Id: 0x0
  31.             10.10.10.2/32 Originating RIB: inet.3
  32.               Metric: 1           Node path count: 1
  33.               Forwarding nexthops: 1
  34.                 Nexthop: 192.169.100.11 via ge-1/1/0.0

 

Line 6 shows the route-type as PMSI (provider multicast service interface) and is type “ingress-replication” one important thing to note – label 300512 is a downstream allocated label, the same as what’s commonly used in P2MP LSPs for multicast services. Essentially, in this case MX5-1 uses the remotely learnt service label to send BUM traffic to the remote PEs – OR, the other way round, it expects to receive BUM traffic from other remote PEs, tagged with IR label 300512.

Moving on – for people new to evpn, one of the coolest concepts is the way in which BGP is used to advertise mac-addresses… rather than plain old IP subnets – this is fantastic because we now have an intelligent control-plane maintained across the whole network in a scalable and stable fashion, rather than having to rely on less reliable data-plane learning.

For the first basic test, we’ll send bi-directional traffic between host connected to EX4200-1 on MX5-1 and the host connected to EX4200-2 on MX5-2

Lets recap the diagram and spin up some hosts:

Capture2

We’ll start with a single host at each site, and send traffic both ways, 1Mbps each way for a total of 2Mbps, (the hosts are in the same /24 VLAN100 – 192.168.100.1 and 192.168.100.2) 

Capture3

Traffic is being forwarded end to end, lets check the routing and see how the control-plane has changed:

 

  1. greg@MX5-1# run show route table bgp.evpn.0
  2. bgp.evpn.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
  3. + = Active Route, – = Last Active, * = Both
  4. 2:1.1.1.3:100::100::00:00:0e:52:42:29/304  
  5.                    *[BGP/170] 00:04:04, localpref 100, from 10.10.10.3
  6.                       AS path: I, validation-state: unverified
  7.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299936
  8. 3:1.1.1.2:100::100::10.10.10.2/304
  9.                    *[BGP/170] 00:53:37, localpref 100, from 10.10.10.2
  10.                       AS path: I, validation-state: unverified
  11.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  12. 3:1.1.1.3:100::100::10.10.10.3/304
  13.                    *[BGP/170] 00:53:35, localpref 100, from 10.10.10.3
  14.                       AS path: I, validation-state: unverified
  15.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299936
  16. [edit]
  17. greg@MX5-1# run show route table EVPN-100.evpn.0
  18. EVPN-100.evpn.0: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
  19. + = Active Route, – = Last Active, * = Both
  20. 2:1.1.1.1:100::100::00:00:0e:52:23:91/304      
  21.                    *[EVPN/170] 00:04:13
  22.                       Indirect
  23. 2:1.1.1.3:100::100::00:00:0e:52:42:29/304    
  24.                    *[BGP/170] 00:04:13, localpref 100, from 10.10.10.3
  25.                       AS path: I, validation-state: unverified
  26.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299936
  27. 3:1.1.1.1:100::100::10.10.10.1/304
  28.                    *[EVPN/170] 00:53:51
  29.                       Indirect
  30. 3:1.1.1.2:100::100::10.10.10.2/304
  31.                    *[BGP/170] 00:53:46, localpref 100, from 10.10.10.2
  32.                       AS path: I, validation-state: unverified
  33.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  34. 3:1.1.1.3:100::100::10.10.10.3/304
  35.                    *[BGP/170] 00:53:44, localpref 100, from 10.10.10.3
  36.                       AS path: I, validation-state: unverified
  37.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299936
  38. [edit]
  39. greg@MX5-1#

 

The type-3 routes are still present as before for the inclusive tunnels, but you’ll notice the addition of the new type-2 MAC/IP route, this is essentially a BGP NLRI containing a mac-address instead of an IP subnet – pretty cool huh?

The indirect route is the one learnt locally from the connected LAN, the one known via BGP/170 is the one from the remote PE, packets destined for that mac-address have label 299936 pushed on them, and are forwarded directly out of the MPLS facing core interface, like any regular MPLS packet.

Lets take a more detailed look at a type-2 route:

  1. 2:1.1.1.3:100::100::00:00:0e:52:42:29/304 (1 entry, 1 announced)
  2.         *BGP    Preference: 170/-101
  3.                 Route Distinguisher: 1.1.1.3:100
  4.                 Next hop type: Indirect
  5.                 Address: 0x2705954
  6.                 Next-hop reference count: 4
  7.                 Source: 10.10.10.3
  8.                 Protocol next hop: 10.10.10.3
  9.                 Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  10.                 State: <Secondary Active Int Ext>
  11.                 Local AS:   100 Peer AS:   100
  12.                 Age: 14:20  Metric2: 1
  13.                 Validation State: unverified
  14.                 Task: BGP_100.10.10.10.3+64545
  15.                 Announcement bits (1): 0-EVPN-100-evpn
  16.                 AS path: I
  17.                 Communities: target:100:100
  18.                 Import Accepted
  19.                 Route Label: 300048
  20.                 ESI: 00:00:00:00:00:00:00:00:00:00
  21.                 Localpref: 100
  22.                 Router ID: 10.10.10.3
  23.                 Primary Routing Table bgp.evpn.0
  24.                 Indirect next hops: 1
  25.                         Protocol next hop: 10.10.10.3 Metric: 1
  26.                         Indirect next hop: 0x2 no-forward INH Session ID: 0x0
  27.                         Indirect path forwarding next hops: 1
  28.                                 Next hop type: Router
  29.                                 Next hop: 192.169.100.11 via ge-1/1/0.0
  30.                                 Session Id: 0x0
  31.             10.10.10.3/32 Originating RIB: inet.3
  32.               Metric: 1           Node path count: 1
  33.               Forwarding nexthops: 1
  34.                 Nexthop: 192.169.100.11 via ge-1/1/0.0

 

A basic recap on MPLS forwarding, for the above route MX5-1 is notifying all other PEs in the network, that if they receive a frame on an interface inside “EVPN-100” on VLAN 100 for destination MAC-address 00:00:0e:52:42:29, impose MPLS label 300048 and send it my way.

Another new aspect of evpn can be seen under the “ESI” field, “ESI” stands for “Ethernet segment identifier” essentially it’s a way of labelling individual Ethernet segments, but it’s only used for all-active multihomed designs, any other design it should remain the default of 0x0 (more on ESIs in the next blog)

To demonstrate the control-plane learning and MAC/IP advertisement mechanism more effectively, lets spin up all 3 sites with 50 hosts per site – then send a full mesh of traffic (150 streams in total) and see what the control-plane looks like,

Quick recap of the diagram showing all 3 sites, with 50 hosts per site:

Capture4

Plenty of juicy MAC/IP routes!

 

  1. greg@MX5-1# run show route summary
  2. Autonomous system number: 100
  3. Router ID: 10.10.10.1
  4. inet.0: 14 destinations, 14 routes (14 active, 0 holddown, 0 hidden)
  5.               Direct:      3 routes,      3 active
  6.                Local:      2 routes,      2 active
  7.               Static:      1 routes,      1 active
  8.                IS-IS:      7 routes,      7 active
  9.                  LDP:      1 routes,      1 active
  10. inet.3: 5 destinations, 5 routes (5 active, 0 holddown, 0 hidden)
  11.                  LDP:      5 routes,      5 active
  12. iso.0: 1 destinations, 1 routes (1 active, 0 holddown, 0 hidden)
  13.               Direct:      1 routes,      1 active
  14. mpls.0: 18 destinations, 18 routes (18 active, 0 holddown, 0 hidden)
  15.                 MPLS:      6 routes,      6 active
  16.                  LDP:      6 routes,      6 active
  17.                 EVPN:      6 routes,      6 active
  18. bgp.evpn.0: 102 destinations, 102 routes (102 active, 0 holddown, 0 hidden)
  19.                  BGP:    102 routes,    102 active
  20.  
  21. EVPN-100.evpn.0: 153 destinations, 153 routes (153 active, 0 holddown, 0 hidden)
  22.                  BGP:    102 routes,    102 active
  23.                 EVPN:     51 routes,     51 active
  24. [edit]
  25. greg@MX5-1#

 

Lots of MAC/IP routes 🙂

A quick look at the BGP table:

 

  1. bgp.evpn.0: 102 destinations, 102 routes (102 active, 0 holddown, 0 hidden)
  2. + = Active Route, – = Last Active, * = Both
  3. 2:1.1.1.2:100::100::00:00:0f:45:a2:8a/304
  4.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  5.                       AS path: I, validation-state: unverified
  6.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  7. 2:1.1.1.2:100::100::00:00:0f:45:a2:8c/304
  8.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  9.                       AS path: I, validation-state: unverified
  10.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  11. 2:1.1.1.2:100::100::00:00:0f:45:a2:8e/304
  12.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  13.                       AS path: I, validation-state: unverified
  14.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  15. 2:1.1.1.2:100::100::00:00:0f:45:a2:90/304
  16.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  17.                       AS path: I, validation-state: unverified
  18.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  19. 2:1.1.1.2:100::100::00:00:0f:45:a2:92/304
  20.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  21.                       AS path: I, validation-state: unverified
  22.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  23. 2:1.1.1.2:100::100::00:00:0f:45:a2:94/304
  24.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  25.                       AS path: I, validation-state: unverified
  26.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  27. 2:1.1.1.2:100::100::00:00:0f:45:a2:96/304
  28.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  29.                       AS path: I, validation-state: unverified
  30.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  31. 2:1.1.1.2:100::100::00:00:0f:45:a2:98/304
  32.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  33.                       AS path: I, validation-state: unverified
  34.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  35. 2:1.1.1.2:100::100::00:00:0f:45:a2:9a/304
  36.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  37.                       AS path: I, validation-state: unverified
  38.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  39. 2:1.1.1.2:100::100::00:00:0f:45:a2:9c/304
  40.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  41.                       AS path: I, validation-state: unverified
  42.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904
  43. 2:1.1.1.2:100::100::00:00:0f:45:a2:9e/304
  44.                    *[BGP/170] 00:07:38, localpref 100, from 10.10.10.2
  45.                       AS path: I, validation-state: unverified
  46.                     > to 192.169.100.11 via ge-1/1/0.0, Push 299904

 

So yeah – it basically goes on and on,

Incidentally, what we gain in using more of the networks resources – we lose in scalability because you cannot get something for nothing. We all know that TCAM, forwarding-tables and BGP tables are limiting factors on even the largest routers, with evpn a very large amount of information is loaded into BGP (every single mac-address on the network) and because each mac-address is totally non-contiguous (different blocks for different vendor nics) they can’t be aggregated or summarised in any way.

If you had a data centre with 500k servers, you’d have 500k MAC/IP advertisements, which is a pretty large burden on the control-plane, in my own time I did some comparisons with tens of thousands of hosts on MX480 routers, with RE1800x4’s and high-end MPCs, and the results were not pretty on a very large network (more than 100k hosts) the control-plane learning was very laggy, and RE’s tended to suffer from very high CPU during the learning process, or if a failover occurred.

The evolution onwards from this is PBB-EVPN (provider backbone bridging EVPN) which essentially allows large numbers of hosts to be represented by a single mac-address, which enables absolutely enormous scalability (millions of hosts per site), at the expense of some feature loss – PBB-EVPNs will be the topic for another blog, where I can hopefully use IXIA to show hundreds of thousands of hosts connected!

Hope you found this useful, (if anyone even read it! 😀 )