Saturday, November 12, 2011

aggregate-address ... summary-only-after-a-while

As it seems, there is always something that you think you know, until it's proven the other way around.

Some years ago, when i was studying for the CCIE, i knew that in order to suppress more specific routes from an aggregate advertisement in BGP, you could use the "aggregate-address .... summary-only" command. And i believed it until recently.

Let's suppose you have the following config in a ASR1k (10.1.1.1) running 15.1(2)S2:

router bgp 100
 bgp router-id 10.1.1.1
 neighbor 10.2.2.2 remote-as 100
 neighbor 10.2.2.2 update-source Loopback0
...
 address-family ipv4
  aggregate-address 10.10.10.0 255.255.255.0 summary-only
  redistribute connected
  neighbor 10.2.2.2 activate
...

Then you have 2 subscribers logging in.
With "show bgp" the 2 /32 routes under 10.10.10.0/24 seem to be suppressed and the /24 is in the BGP table as it should be:

*> 10.10.10.0/24    0.0.0.0                            32768 i
s> 10.10.10.3/32    0.0.0.0                  0         32768 ?
s> 10.10.10.4/32    0.0.0.0                  0         32768 ?

but doing some "debug bgp updates/events" reveals the following:

BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.3/32, next 10.1.1.1, metric 0, path Local
BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.4/32, next 10.1.1.1, metric 0, path Local

...and after a while:

BGP: aggregate timer expired

BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.3/32
BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.4/32

At the same time on the peer router (10.2.2.2) you can see the above 2 /32 routes being received:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32
*>i10.10.10.3/32    10.1.1.1            0    100      0 ?
*>i10.10.10.4/32    10.1.1.1            0    100      0 ?

and immediately afterwards you can see the 2 /32 routes being withdrawn:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32

Cisco is a little bit contradicting on this behavior, as usual.


According to Cisco tac, the default aggregation logic runs every 30 seconds, but bgp update processing will be done almost every 2 seconds. That's the reason the route is being updated initially and later withdrawn (due to the aggregation processing following the initial update). They also admit that the root cause of this problem is with the BGP code. The route will be advertised as soon as the best path is completed. It may take 30 seconds or more for the aggregation logic to complete and withdraw the more specific route.

Then we also have the following:

Bug CSCsu96698

BGP: /32 route being advertised while 'summary-only' is configured

Symptoms: More specific routes are advertised and withdrawn later even if config aggregate-address net mask summary-only is configured. The BGP table shows the specific prefixes as suppressed with s>
Conditions: This occurs only with very large configurations.
Workaround: Configure a distribute-list in BGP process that denies all of the aggregation child routes.

Related Bug Information

It takes 30 seconds for BGP to form aggregate route

Symptom: for approximately 30 seconds router announces specific prefixes instead of aggregate route
Conditions: bgp session up/down
Workaround: unknown yet


Release notes of 12.2SB and 12.0S

The periodic function is by default called at 60 second intervals. The aggregate processing is normally done based on the CPU load. If there is no CPU load, then the aggregate processing function would be triggered within one second. As the CPU load increases, this function call will be triggered at higher intervals and if the CPU load is very high it could go as high as the maximum aggregate timer value configured via command. By default this maximum value is 30 seconds and is configurable with a range of 6-60 seconds and in some trains 0. So, if default values are configured, then as the CPU load increases, the chances of hitting this defect is higher.


Release notes of 12.2(33)SXH6

CLI change to bgp aggregate-timer command to suppress more specific routes.

Old Behavior: More specific routes are advertised and withdrawn later, even if aggregate-address
summary-only is configured. The BGP table shows the specific prefixes as suppressed.

New Behavior: The bgp aggregate-timer command now accepts the value of 0 (zero), which
disables the aggregate timer and suppresses the routes immediately.


Command Reference for "bgp aggregate-timer"

To set the interval at which BGP routes will be aggregated or to disable timer-based route aggregation, use the bgp aggregate-timer command in address-family or router configuration mode. To restore the default value, use the no form of this command.

bgp aggregate-timer seconds
no bgp aggregate-timer

Syntax Description

seconds

Interval (in seconds) at which the system will aggregate BGP routes.

•The range is from 6 to 60 or else 0 (zero). The default is 30.
•A value of 0 (zero) disables timer-based aggregation and starts aggregation immediately.

Command Default

30 seconds

Usage Guidelines

Use this command to change the default interval at which BGP routes are aggregated.

In very large configurations, even if the aggregate-address summary-only command is configured, more specific routes are advertised and later withdrawn. To avoid this behavior, configure the bgp aggregate-timer to 0 (zero), and the system will immediately check for aggregate routes and suppress specific routes.


The interesting part is that the command reference for "aggregate-address ... summary-only" doesn't mention anything about this behavior in order to warn you.

Using the summary-only keyword not only creates the aggregate route (for example, 192.*.*.*) but also suppresses advertisements of more-specific routes to all neighbors. If you want to suppress only advertisements to certain neighbors, you may use the neighbor distribute-list command, with caution. If a more-specific route leaks out, all BGP or mBGP routers will prefer that route over the less-specific aggregate you are generating (using longest-match routing).


The following debug logs show the default aggregate-timer which is 30 secs, vs the default BGP scan timer which is 60 secs:

Nov 12 21:45:32.468: BGP: Performing BGP general scanning
Nov 12 21:45:38.906: BGP: aggregate timer expired
Nov 12 21:46:09.637: BGP: aggregate timer expired
Nov 12 21:46:32.487: BGP: Performing BGP general scanning
Nov 12 21:46:40.379: BGP: aggregate timer expired
Nov 12 21:47:11.099: BGP: aggregate timer expired
Nov 12 21:47:32.506: BGP: Performing BGP general scanning
Nov 12 21:47:41.828: BGP: aggregate timer expired
Nov 12 21:48:12.547: BGP: aggregate timer expired
Nov 12 21:48:32.525: BGP: Performing BGP general scanning
Nov 12 21:48:43.268: BGP: aggregate timer expired
Nov 12 21:49:13.989: BGP: aggregate timer expired
Nov 12 21:49:32.544: BGP: Performing BGP general scanning
Nov 12 21:49:44.765: BGP: aggregate timer expired
Nov 12 21:50:15.510: BGP: aggregate timer expired

Guess what! After changing the aggregate-timer to 0, the cpu load increases by a steady +10%, due to the BGP Router process!

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 17%/6%; one minute: 16%; five minutes: 15%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
 61   329991518  1083305044        304  4.07%  4.09%  3.93%   0 IOSD ipc task
340   202049403    53391862       3784  1.35%  2.10%  2.11%   0 VTEMPLATE Backgr
404    84594181  2432529294          0  1.19%  1.18%  1.14%   0 PPP Events
229    49275197     1710570      28806  0.71%  0.33%  0.39%   0 QoS stats proces
152    39536838   801801056         49  0.63%  0.56%  0.51%   0 SSM connection m
159    51982155   383585236        135  0.47%  0.66%  0.64%   0 SSS Manager

ASR1k#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
ASR1k(config)#router bgp 100
ASR1k(config-router)# address-family ipv4
ASR1k(config-router-af)# bgp aggregate-timer 0
ASR1k(config-router-af)#^Z

...and after a while:

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 29%/6%; one minute: 26%; five minutes: 25%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
394   143774989    19776654       7269  9.91%  7.93%  7.32%   0 BGP Router
 61   329983414  1083280872        304  4.79%  3.97%  3.90%   0 IOSD ipc task
340   202044852    53390504       3784  2.71%  2.22%  2.04%   0 VTEMPLATE Backgr
404    84591714  2432460324          0  1.03%  1.16%  1.09%   0 PPP Events
159    51980758   383575060        135  0.79%  0.66%  0.62%   0 SSS Manager
152    39535734   801779715         49  0.63%  0.54%  0.50%   0 SSM connection m

Conclusions

1) By default, the "aggregate-address ... summary-only" command doesn't immediately stop the announcement of more specific routes, as it's supposed to. You need to also change the BGP aggregate-timer to 0.
2) After changing the BGP aggregate-timer to 0, the announcement of more specific routes stops, but the cpu load increases by 10%.


C'mon Cisco! You gave us NH Address Tracking and PIC, and you can't fix the aggregation process?

Saturday, September 17, 2011

AAA and VTYs in IOS-XR : Bingo

Continuing on the IOS-XR saga, this is the newest bunch of things that don't "work as expected" (© Cisco). Well, as expected by me, not by Cisco.

Everything started while trying to configure a primary and backup aaa login method on an ASR9k, when i realized that...

1) having a backup aaa login method with the same tacacs servers as the ones in the primary aaa login method (which is using the management vrf) doesn't work

Imagine the following aaa configuration:

!
tacacs source-interface MgmtEth0/RSP0/CPU0/0 vrf MGMT
tacacs source-interface Loopback0 vrf default
!
aaa group server tacacs+ TACACS-AAA-GROUP
 server x.x.x.x
 server y.y.y.y
!
aaa group server tacacs+ TACACS-VRF-AAA-GROUP
 server x.x.x.x
 server y.y.y.y
 vrf MGMT
!
aaa authentication login default group TACACS-VRF-AAA-GROUP group TACACS-AAA-GROUP local
!

This is supposed to work in the following way:

As long as at least one mgmt interface is up (i'm using a virtual-ip for the mgmt interfaces), tacacs communication should happen through the out-of-band mgmt interfaces. If all mgmt interfaces are down, then tacacs communication should happen through an inband interface.

Guess what! There seems to be an issue with the above scenario, because in the 2nd case (where all mgmt interfaces are down) tacacs communication doesn't happen at all. Looking at the debugs, it's like the router isn't even trying to use the second (global) tacacs group. This has already been opened as SR (according to tac this should work, so let's hope it's just a bug), so i'm waiting for developers' feedback right now.

In order to overcome the above problem, i thought of using different vty templates, each one with a different access method.

In IOS you can have the following vty configuration and then access vtys 11-15 by either using "telnet x.x.x.x 3001" or "telnet x.x.x.x 2000+y" where y is the tty number displayed by using the command "show line".

!
line vty 11 15
 login authentication BACKUP-AAA
 rotary 1
!

Since the "rotary" command is not supported in IOS-XR, this is what you can do:

!
line default
 login authentication default
!
line template VTY-TEMPLATE
 login authentication BACKUP-AAA
!	
vty-pool default 0 10
vty-pool VTY-POOL 11 20 line-template VTY-TEMPLATE
!

And this is the point you realize that you can't choose a vty, because...

2) specific vtys can be accessed only through a combination of a line template and a specific ACL

First shock: You cannot easily access a specific vty line in IOS-XR. Vtys in IOS-XR work in a very different way in comparison to the IOS ones. According to the BU, when you do a telnet/ssh to the router, the router starts a scanning from the first vty (0) to the last vty (including all custom configured ones). When a free (available) vty is found, the vty ACL is checked in order to verify whether its permit conditions are met. If the vty ACL allows this specific access, then the session is opened.

Second shock: If the vty ACL doesn't allow access, then scanning for free vtys continues until one vty is found that has an ACL that allows this specific access. So, the only to way to access a specific vty is to apply a specific and unique ACL under that vty that allows your i.e. source ip. In order to access another vty, you'll have to use another source ip, and so on. Still wondering why Cisco chose such an implementation.

So i tried the following:

!
line default
 login authentication default
 access-class ingress HOST1-ACL
 transport input telnet ssh
!
line template LINE-TEMPLATE
 login authentication BACKUP-AAA
 access-class ingress HOST2-ACL
 transport input telnet ssh
!
vty-pool default 0 10
vty-pool VTY-POOL 11 20 line-template LINE-TEMPLATE
!
ipv4 access-list HOST1-ACL
 10 permit ipv4 host x.x.x.x any
 20 deny ipv4 any any log
!
ipv4 access-list HOST2-ACL
 10 permit ipv4 host y.y.y.y any
 20 deny ipv4 any any log
!

...and this is what i got when i tried to telnet from HOST2 to the router


HOST2$ telnet router
Trying z.z.z.z...
Connected to router.
Escape character is '^]'.
Connection to router closed by foreign host.

ipv4_acl_mgr[267]: %ACL-IPV4_ACL-6-IPACCESSLOGP : access-list HOST1-ACL (20) deny tcp y.y.y.y(46387) -> z.z.z.z(23), 1 packet

So i didn't manage to telnet into vtys 11-20, because my telnet session was dropped by HOST1-ACL. Is this another bug? Who knows...

And when i thought i had met every possible issue, i also found out that vty ACLs are useless for ssh sessions, because...

3) ssh sessions get established before hitting the vty ACLs

Yeap, that's another shock (3rd in a row). When you do a ssh session to an IOS-XR router, the vty (the one that the ssh session will use) is consumed regardless of your vty ACL. That means that the vty is occupied during the whole time the router is waiting for you to enter your password. It's only after you enter your password that you get disconnected because of the vty ACL. And that's a nice way to dos attack an IOS-XR router.


%SECURITY-SSHD-6-INFO_GENERAL : Incoming SSH session rate limit exceeded
%SECURITY-SSHD-3-ERR_GENERAL : Failed to allocate pty


Note: the same happens with telnet, but since the username is asked after the ACL check, the time while telnet session remains open is limited.

But wait; isn't that supposed to be solved by Management Plane Protection (MPP)? Sure it is, but...

4) MMP configuration doesn't support ACLs

Who would have though of that! MPP configuration expects you to configure hosts and networks in a Juniper kind of way (although Juniper allows you to reuse the "clients" section).


RP/0/RSP0/CPU0:router(config-telnet-peer)#address ipv4 ?
  A.B.C.D         Enter IPv4 address
  A.B.C.D/length  Enter IPv4 address with prefix

RP/0/RSP0/CPU0:router(config-telnet-peer)#address ipv6 ?
  X:X::X         Enter IPv6 address
  X:X::X/length  Enter IPv6 address with prefix


So, if you happen to have already defined ACLs for your NMS/OSS/whatever, which are already being used somewhere else, you can't reuse those ones, but you have to reconfigure all hosts and networks under the MPP section (something that makes mass router config changes even more difficult). You can't even reuse the same hosts/networks under different interfaces!

!
control-plane
 management-plane
  inband
   !
   interface GigabitEthernet0/3/0/0
    allow SSH peer
     address ipv6 2001:db8::69/64
    !
   !
   interface GigabitEthernet0/3/0/1
    allow SSH peer
     address ipv6 2001:db8::69/64
    !
   !

And that's surely a nice way to further "expand" your configuration (not to mention BGP dynamic neighbors that are not supported either, but's that's another story).


That's 4 in a row Cisco. Bingo!!!

Note: Many thanks to Arie for helping me with the 2nd issue (once again).


Question to the public:

Is there a character in IOS-XR that fully resembles "!" as a starting comment indicator, like in IOS?

IOS


router(config-line)#login authentication BACKUP-AAA ! backup
router(config-line)#

IOS-XR


RP/0/RSP0/CPU0:router(config-line)#login authentication BACKUP-AAA ! backup
                                                                   ^
% Invalid input detected at '^' marker.

In IOS-XR, "!" works only when it is the first character in the line.

Friday, June 3, 2011

Debugging IPv6 MTU issues in Windows

A common problem you might face soon (World IPv6 Day is 5 days away) is reachability to IPv6 sites due to MTU issues. ICMPv6 has a nice internal mechanism which is supposed to help the application overcome these issues, but like in the IPv4 world, not everything is perfect.

Let's suppose that an IPv6 subscriber is using a DSL router and is connected through PPPoE to a BRAS.

TARGET <=> BRAS <=> DSL-ROUTER <=> HOST

The usual MTU for PPPoE connections is 1492 bytes, as shown below.

1500 bytes = Ethernet Payload
-     6 bytes = PPPoE header
-     2 bytes = PPP ID
---------------------------------
1492 bytes = IPv6 Packet that can be carried over a PPPoE connection

If your host is configured with 1492 (or something lower) as MTU on its LAN interface, then the OS running on it will automatically take care of "fragmentation", so you don't need to worry for anything. Unfortunately this isn't a common scenario by default. You either have to configure it manually on the host or if you are lucky enough and the DSL modem supports advertisement of MTU to its LAN interface through RA messages (and your host accepts them), it will happen automatically.

If your host is configured with anything larger than 1492 on its LAN interface (in most cases it's the default of 1500), problems might arise.

Users with hosts running Windows can try to ping an IPv6 address (i.e. the next hop after the DSL router) in order to find possible issues with the MTU. The closer the target is, the easier it will be to troubleshoot the problem. Then you start moving towards the target until you meet the issue.

First, some numbers you will need regarding the various headers

1492 bytes = IPv6 Packet
-  40 bytes = IPv6 Header
-   8 bytes = ICMPv6 Header
-------------------------------
1444 bytes = ICMPv6 payload data

Since Windows ping uses the actual payload as a size, if you want to send a total of 1492 bytes, you have to send 1492-40-8=1444 bytes of ICMPv6 payload data. Anything larger will lead to either a problem or to fragmentation.

Windows>ping -l 1444 x:x::x

Pinging target [x:x:xx] with 1444 bytes of data:

Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=51ms
Reply from x:x:xx: time=54ms
Reply from x:x:xx: time=53ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 51ms, Maximum = 54ms, Average = 52ms

These are the relevant Wireshark captures.

The ICMP conversation between all involved devices


1444 bytes ICMP request from HOST to TARGET


If you increase the above number, you'd better start looking for "Too big" ICMPv6 received messages from any hop towards the target, otherwise you are in trouble.

i.e. if you ping with 1446 bytes of data, you get the following:

Windows>ping -l 1446 x:x:xx

Pinging target [x:x:xx] with 1446 bytes of data:

Packet needs to be fragmented but DF set.
Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=55ms
Reply from x:x:xx: time=57ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 3, Lost = 1 (25% loss),
Approximate round trip times in milli-seconds:
    Minimum = 53ms, Maximum = 57ms, Average = 55ms

These are the relevant Wireshark captures.

The ICMP conversation between all involved devices (fragmentation included))

1446 bytes ICMP request from HOST to TARGET
ICMP reply ("Too big") from DSL-ROUTER to HOST (original truncated message included)

As you can see, device DSL-ROUTER is replying with "Too Big" message in the first packet to the HOST and informs it about the MTU (1492) supported in the next-hop link (see RFC 4443 for ICMPv6 info); that's the WAN link towards the BRAS, where PPPoE is running on.

If you are in the unfortunate position to not get any incoming packets, you can safely assume (if everything else is fine) that someone in the path is blocking ICMPv6 messages.

The reply message is exactly 1280 bytes, which is the minimum packet size IPv6 supports. This leads to the original message being truncated in the reply message to 1280-40=1240 bytes for the ICMPv6 packet or  1240-8-40-8=1184 bytes for the actual payload data. So you loose 1446-1184=262 bytes of payload data in the reply message.

Next packets get a successful answer from the target, because they are sent as fragmented (1432+14 bytes).

1432 bytes ICMP request from HOST to TARGET

14 bytes ICMP request from HOST to TARGET

Windows is "smart" enough to keep track of this status for some minutes (in the so called destination or route cache), so next time you send large packets, the first packet is not lost, because fragmentation happens right away.

Windows>ping -l 1446 x:x:xx

Pinging target [x:x:xx] with 1446 bytes of data:

Reply from x:x:xx: time=54ms
Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=55ms
Reply from x:x:xx: time=52ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 52ms, Maximum = 55ms, Average = 53ms




Imho, it's better to make your host use the appropriate MTU from the beginning (i.e. hardcode 1492 or use RA's value) and not depend on ICMPv6 messages to do fragmentation. Some people have proposed to always use the minimum of 1280 (Geoff Huston, Tore Anderson), in order to be safe on every possible case (tunnels involved). I generally prefer to use the maximum possible, hoping that someone in the middle won't mess with ICMPv6 messages. I know that currently this is not the case (so stick with something lower, like 1280, for now), but this will probably change as native IPv6 gets deployed. Unless we start filtering ICMPv6 messages uncontrollably...like many do on IPv4. Does "Internet Control Message Protocol" say anything to you?

Notes

1) RFC 1982 describes Path MTU Discovery (PMTUD) for IPv6.
2) RFC 4821 will help a lot in PMTUD, when and if all vendors start implementing it.
3) In order to see clearly the fragmented IPv6 packets in Wireshark, you have to disable reassembly in preferences.
4) You can use the commands "ipv6 rc" and "ipv6 rcf" in order to view and clear the destination/route cache in WindowsXP

Monday, May 23, 2011

To forward, to peer, or to tunnel?

In an imaginary Cisco world every device would be able to talk with every other device in various layers. In the actual Cisco world, some devices can talk to some devices, while they can't talk to some other devices.

I'm talking specifically about L2 Control Protocols (L2CPs), when these need to be exchanged between different devices in order to support a requirement (i.e. create a spanning-tree loop). Cisco's L2 Protocol Tunneling (L2PT) can help in accomplishing some of these cases.

So let's suppose you have a scenario like the following.


When using the simplest form of devices (L2 switches like 3750), you can just tunnel the L2CPs between devices S1 and S2 and everything will be fine. Spanning tree running on devices C1-C4 will see a loop and will block a port depending on various parameters (priority, cost, etc).

As you move ahead and start to replace the S1 or S2 device with another (usually better), you realize that the new device supports a different way of handling L2CPs, which might be "incompatible" with the old way.

Generally, you can do the following actions on L2CPs as they enter a port:

forward: frame is forwarded to another device without any change (no local processing takes place)
drop: frame is dropped
peer: frame is processed/terminated locally
tunnel: frame is tunneled to another device after changing the destination mac address (L2PT)

Tunneling is quite common is scenarios like the above, where you need to pass the L2 frame across a L2 domain, without having the intermediate devices act upon it.

You can also achieve the same result with forwarding, as long as you don't have a native L2 domain in between, because you might end up mixing local protocols with protocols that just pass over.

It's obvious that you cannot have tunneling on one side and forwarding on the other side, because exchanged frames won't be able to "talk" each other. i.e. for STP one side will tunnel the frame by changing the destination mac address from 01-00-0c-cc-cc-cd (or 01-80-c2-00-00-00) to 01-00-0c-cd-cd-d0, while the other side will just forward the frame by keeping the original destination address of 01-00-0c-cc-cc-cd (or 01-80-c2-00-00-00).

Below you'll find a list with all available options regarding the handling of L2CPs on some known platforms:

Device Interface forward drop peer tunnel
3750 L2 switchport l2protocol-tunnel
ME-3400 L2 switchport l2protocol-tunnel
ME-3800X L2 switchport l2protocol drop l2protocol peer
ME-3800X L2 service instance l2protocol forward (1) l2protocol peer l2protocol tunnel
7600/67xx L2 switchport l2protocol-tunnel
7600/ES L2 switchport l2protocol-tunnel
7600/ES L3 l2protocol drop l2protocol peer
7600/ES L3 service instance l2protocol forward
ASR9000 L2 transport by default (2) l2protocol tunnel

As you can see, you cannot have L2 communication between a service instance on a 7600/ES and one of the smaller platforms, because 7600/ES doesn't support tunneling and the smaller platforms do not support forwarding. Actually, the biggest surprise to me was the lack of support of L2PT on the 7600 with the ES cards when using service instances. I had the impression that this would be the most feature rich platform.

Cisco's proposal is to use the same platform for such scenarios, because they haven't verified anything else and some platforms were built to be used in specific ways. So instead of supporting the same feature (L2PT was their idea after all) along the range of platforms, you should always replace them in pairs. And if by accident, you happen to have more S devices serving many overlapping rings, then you have to replace all of them.

I would prefer, instead of promoting new platforms or new designs, to focus on fixing the existing platforms, so they can cooperate with each other. After all, if a platform is good enough, it will get its share in the market.

Also, the online documentation is quite incomplete on this area. You have to guess what will happen in most cases. We had to open 3 different cases and involve our account team in order to clarify things and push for fixing the documentation. Not surprisingly enough, the peering functionality is another mess. I'll probably need to write another post describing all available options (which lead to different behavior) on these platforms.


Notes

1) "l2protocol forward" on ME-3800X will become available in the next major release. Thanks to Cisco for giving.us the chance to try it earlier.
2) This is the default behavior according to Xander's doc here.
3) Arie asked me to add some extra information about PW/MST/REP/PVST-AG (and all these L2 HA) scenarios. I'll try to write a new post as soon as i find enough free time to test them.

Thursday, May 5, 2011

How Multi is MP-BGP in IOS-XR?

This caught me on surprise. I had an impression that IOS-XR as an advanced operating system would support all kinds of multi-protocol transferability over BGP.

As it seems, there is an issue when transferring IPv6 prefixes over an IPv4 peering or IPv4 prefixes over an IPv6 peering. This happens for sure on ASR9k running latest 4.1.0, but i haven't verified it on the CRS yet.

IPv4 prefixes over IPv6 peering
This doesn't seem to be supported based on the available configuration options.
What is even more worrying, is that no other address family is supported too.

RP/0/RSP0/CPU0:ASR#conf t
RP/0/RSP0/CPU0:ASR(config)#router bgp 100
RP/0/RSP0/CPU0:ASR(config-bgp)#neighbor 2001::1:2:3
RP/0/RSP0/CPU0:ASR(config-bgp-nbr)#address-family ?
  ipv6  IPv6 Address Family

IPv6 prefixes over IPv4 peering
This is supported according to the configuration options, but it doesn't work.
Cisco also insists that this is definitely supported.

RP/0/RSP0/CPU0:ASR#conf t
RP/0/RSP0/CPU0:ASR(config)#router bgp 100
RP/0/RSP0/CPU0:ASR(config-bgp)#neighbor 10.11.254.37
RP/0/RSP0/CPU0:ASR(config-bgp-nbr)#address-family ?
  ipv4   IPv4 Address Family
  ipv6   IPv6 Address Family
  l2vpn  L2VPN Address Family
  vpnv4  VPNv4 Address Family
  vpnv6  VPNv6 Address Family

As soon as you enable the IPv6 address family under the IPv4 neighbor, the BGP session is dropped and it never comes up.

RP/0/RSP0/CPU0:ASR#sh bgp sum
BGP router identifier 10.11.254.38, local AS number 100
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0000000   RD version: 1
BGP main routing table version 1
BGP scan interval 60 secs

BGP is operating in STANDALONE mode.


Process       RcvTblVer   bRIB/RIB   LabelVer  ImportVer  SendTblVer  StandbyVer
Speaker               1          1          1          1           1           1

Neighbor        Spk    AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down  St/PfxRcd
10.11.254.37      0  100        0       0        0    0    0 00:00:00 Idle

Also, debug shows that there are no tries of BGP to establish a session. It's like BGP gets disabled.

The only doc that refers such a limitation (in IOS-XR 3.3 for CRS) is the one in http://www.cisco.com/en/US/docs/ios_xr_sw/iosxr_r3.3/conversion/reference/guide/cn33main.html#wp1028960

A given address family is only supported with a neighbor whose address is from that address family. For instance, IPv4 neighbors support IPv4 unicast and multicast address families, and IPv6 neighbors support IPv6 unicast and multicast address families. However, you cannot exchange IPv6 routing information with an IPv4 neighbor and vice versa.

I searched all CCO for more information, but i didn't manage to find something useful. Does anyone have extra information to share? TAC is struggling (as usual) to find an answer...

Update #1
Cisco verified once more that this is a supported configuration. Arie Vayner (and later tac) proposed to add an IPv6 address to the interface being used as an IPv4 next-hop. Indeed, this solved the problem and the BGP session came up. But then it became even more interesting...

Two IPv6 prefixes are learned from the IPv4 neighbor. Next-hop is an IPv6 address.

RP/0/RSP0/CPU0:ASR#sh bgp ipv6 uni 
BGP router identifier 10.11.254.38, local AS number 100
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0800000   RD version: 5
BGP main routing table version 5
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network            Next Hop            Metric LocPrf Weight Path
* i2001::1:2:3/128    2003::1:2:3              0    100      0 ?
* i2003::/64          2003::1:2:3              0    100      0 ?

Processed 2 prefixes, 2 paths

If i remove the IPv6 address from the interface that is being used as next-hop (the one i added before), then i automatically get an IPv6 prefix with an IPv4 next-hop!!!

RP/0/RSP0/CPU0:core-distr-kln-02#sh bgp ipv6 uni 
BGP router identifier 10.11.254.38, local AS number 100
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0800000   RD version: 6
BGP main routing table version 6
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network            Next Hop            Metric LocPrf Weight Path
*>i2001::1:2:3/128    10.11.254.41             0    100      0 ?

Processed 1 prefixes, 1 paths

The BGP session stays up, until something happens that will reset it. Then it will stay down forever, as it was happening in the beginning.

I must say that i cannot endorse such an implementation. Using exactly the same configuration, you get different results, depending on the order of (un)configuring things. Also, i cannot understand why the establishment of an IPv4 BGP session that is going to negotiate IPv4/IPv6 address-family capabilities should depend on whether an IPv6 next-hop exists or not. That should be left for the NLRI exchange routine.

After all, RFC 4271 defines among others two error conditions for the NEXT_HOP attribute:

If the NEXT_HOP attribute field is syntactically incorrect, then the Error Subcode MUST be set to Invalid NEXT_HOP Attribute. The Data field MUST contain the incorrect attribute (type, length, and value). Syntactic correctness means that the NEXT_HOP attribute represents a valid IP host address.

If the NEXT_HOP attribute is semantically incorrect, the error SHOULD be logged, and the route SHOULD be ignored. In this case, a NOTIFICATION message SHOULD NOT be sent, and the connection SHOULD NOT be closed.



Update #2
After the developers got involved, we ended up with the following:

  1. In IOS-XR you need an IPv6 NH in order to activate the IPv6 AF for an IPv4 BGP session.
  2. If you don't have an IPv6 NH, then the IPv4 BGP session won't even come up.
  3. The above was done to protect against misconfiguration, because otherwise you would get a misleading v4 mapped v6 address as NH.
  4. If you have an IPv6 NH, then the IPv4 BGP session with the IPv6 AF will come up.
  5. If afterwords you remove the IPv6 NH, then the session deliberately remains up and you get a misleading v4 mapped v6 address as NH.
Cisco agreed (thx Xander) that the behavior in 3 and 5 contradict each other, so a short-term solution (update the documentation and print a warning message) got recorded in CSCtq26829.

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Greece License.