CCIE in 3 months - Is it possible?: How Multi is MP-BGP in IOS-XR

Sunday, October 27, 2013

How Multi is MP-BGP in IOS-XR - Part #2

When two years ago i was writing the first part of "How Multi is MP-BGP in IOS-XR", i concluded with the following:

In IOS-XR you need an IPv6 NH in order to activate the IPv6 AF for an IPv4 BGP session.
If you don't have an IPv6 NH, then the IPv4 BGP session won't even come up.
The above was done to protect against misconfiguration, because otherwise you would get a misleading v4 mapped v6 address as NH.
If you have an IPv6 NH, then the IPv4 BGP session with the IPv6 AF will come up.
If afterwards you remove the IPv6 NH, then the session deliberately remains up and you get a misleading v4 mapped v6 address as NH.

Although back then i didn't agree with the above behavior, i couldn't do anything more than just accept the "solution" given: print a warning message if such a case is met.

Recently i realized that the same IOS-XR developers decided to bring even more confusion into the engineer's everyday job.

Old Fact: You can't have an IPv4 address as next-hop for IPv6 prefixes
New Fact: You must display an IPv4 address as next-hop for IPv6 prefixes

I probably need to explain it a little bit more, because i'm talking about 6VPE this time. 6VPE is a technology/architecture which allows you to have IPv6 connectivity over an IPv4 MPLS (and not only) network.

The original 6VPE scenario where i met this behavior is quite complex, but for simplicity let's assume that it includes only the following:

2 x CE routers (CE1, CE2)
2 x 6VPE routers (PE1, PE2)
2 x P/RR routers (P1, P2)

This is a very simple VPN topology (with only two sites) like the following:

CE1 <=> PE1 <=> P1 <=> P2 <=> PE2 <=> CE2

All routers are exchanging IPv4/IPv6 prefixes using BGP, in order to have IPv4/IPv6 connectivity between the two CEs. IPJ describes the relevant route advertisement with a very nice picture:

If we could follow the exchange of information between some of these routers, then we would notice something like the following.

CE1 sends its IPv6 prefixes to PE1 through MP-BGP and PE1 sends them to P1 accordingly. Since (as of now) there is no support for LDP over IPv6, PE1 sends the IPv6 prefixes using a v4-mapped IPv6 address as next-hop (that's encoded inside the NLRI as shown in the debugs).

This is how the IPv6 prefix is sent from PE1 (10.10.253.164) to P1 (10.10.253.161).

PE1#sh bgp vpnv6 unicast all neighbors 10.10.253.161 advertised-routes
BGP table version is 7, local router ID is 10.10.253.164
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, f RT-Filter, a additional-path
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:109 (default for vrf TEST-VRF)

*> 2001:DB8:2:60::22/127
                    2001:DB8:2:50::23        0         32768 ?

This is the relevant debug info that shows how exactly the IPv6 prefix (2001:DB8:2:60::22/127 with next-hop of ::FFFF:10.10.253.164) is sent from PE1 to P1.

BGP(5): (base) 10.10.253.161 send UPDATE (format) [100:109]2001:DB8:2:60::22/127, next ::FFFF:10.10.253.164, label 2689, metric 0, path Local, extended community RT:100:109

And this is how the IPv6 prefix is shown on the P1 (10.10.253.161) when received from PE1 (10.10.253.164):

P1#sh bgp vpnv6 unicast neighbors 10.10.253.164 routes
BGP router identifier 10.10.253.161, local AS number 100
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0x0   RD version: 3888712472
BGP main routing table version 3
BGP NSR Initial initsync version 3 (Reached)
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
              i - internal, r RIB-failure, S stale
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network            Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:109
*>i2001:DB8:2:60::22/127
                      10.10.253.164           0    100      0 ?

Now, if the old fact was existent in the case of VPNv6 prefixes as in the case of simple IPv6 prefixes, then the IPv6 BGP session shouldn't even come up. Instead, it comes up and works fine. But, in order to confuse me even more, the next-hop of an IPv6 prefix is an IPv4 address (!!!).

Reminder...
Old Fact: You can't have an IPv4 address as next-hop for IPv6 prefixes

Quoting from RFC 4659 (BGP-MPLS IP Virtual Private Network (VPN) Extension for IPv6 VPN):

When the IPv6 VPN traffic is to be transported to the BGP speaker using IPv4 tunneling (e.g., IPv4 MPLS LSPs, IPsec-protected IPv4 tunnels), the BGP speaker SHALL advertise to its peer a Next Hop Network Address field containing a VPN-IPv6 address:

- whose 8-octet RD is set to zero, and

- whose 16-octet IPv6 address is encoded as an IPv4-mapped IPv6 address [V6ADDR] containing the IPv4 address of the advertising BGP speaker. This IPv4 address must be routable by the other BGP Speaker.

Now, if we check the PE2 on the other side, who is also getting some IPv6 prefixes from CE2, we'll notice that everything is fine and according to everything we know about 6VPE. So this time the IPv6 prefixes have a v4-mapped IPv6 address as next-hop, which is the expected output for a 6VPE topology.

This is how an IPv6 prefix is shown on the P2 (10.10.231.4) when received from PE2 (10.10.253.165):

P2#sh bgp vpnv6 unicast all neighbors 10.10.253.165 routes
BGP table version is 4, local router ID is 10.10.231.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, f RT-Filter, a additional-path
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:141
*>i2001:DB8:10:1::22/127
                    ::FFFF:10.10.253.165
                                             0    100      0 65141 ?

Let's summarize:

P1: IPv6 prefix 2001:DB8:2:60::22/127 with 10.10.253.164 as NH
P2: IPv6 prefix 2001:DB8:10:1::22/127 with ::FFFF:10.10.253.165 as NH

If you still haven't figured out the difference between P1 and P2 based on the outputs given, let me tell you that P1 is an IOS-XR (4.2.x) based platform while P2 is an IOS (15.x) based one. In terms of functionality everything works fine, because internally IOS-XR uses the correct v4-mapped IPv6 address as next-hop. It just doesn't display it on the CLI, neither on the debugs.

Now you might wonder why such a difference in IOS-XR behavior. After waiting for several weeks for a lab reproduction from Cisco, because nobody knew about this behavior, i got a very enlightening answer from the developers:

IOS-XR decided it's best to display the actual next hop which is a v4 nexthop and is also registered with the v4 RIB for tracking purposes, rather than what is transported in the NLRI and matches the output of everything else. So the behaviour is indeed expected and works 'by design'.

Reminder...
Old Fact: You can't have an IPv4 address as next-hop for IPv6 prefixes

Don't you love those guys? First they tell that you can't have a v4 nexthop, but now they tell you that it's better to display a v4 nexthop, although the actual NLRI has a different one.

Luckily, someone inside Cisco (thx Xander once more) agreed that this behavior is misleading, so an enhancement request (CSCuj74543) was opened. If you would like to change this behavior too, please link your case to this DDTS. At the same time a documentation DDTS (CSCuj76745) was opened in order to officially describe this IOS-XR "expected" and "by design" behavior on CCO.