Saturday, November 10, 2012

You have to make the right balance between the convergence time and MTU

Lately i'm getting the impression that Cisco is getting new products out without the proper internal testing.

I'm going to talk about two recent examples, ASR1001 and ASR901, devices that are an excellent value for money, but (as usual) hide limitations that you unfortunately find out only after exhaustive testing.

ASR1001 is a fine router, a worthy replacement of 7200, which can be used for various purposes. Of course, as every new platform by every vendor these days, it fully supports jumbo frames and that's a nice thing. At least you get that impression until you try to use the large MTU for control/routing protocols, where you might fall into an also nice surprise.

MTU 1500 (just a few ms)

17:22:35.415: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from DOWN to INIT, Received Hello
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from INIT to 2WAY, 2-Way Received
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from 2WAY to EXSTART, AdjOK?
17:22:35.643: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXSTART to EXCHANGE, Negotiation Done
17:22:35.823: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXCHANGE to LOADING, Exchange Done
17:22:35.824: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from LOADING to FULL, Loading Done

MTU 9216 (~48 sec!)
17:43:07.923: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from DOWN to INIT, Received Hello
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from INIT to 2WAY, 2-Way Received
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from 2WAY to EXSTART, AdjOK?
17:43:08.098: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXSTART to EXCHANGE, Negotiation Done
17:43:08.241: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXCHANGE to LOADING, Exchange Done
17:43:55.942: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from LOADING to FULL, Loading Done

While trying to use MTU 9216 in a test environment (the same issue was observed even with smaller MTU), we met an interesting issue during the exchange of large OSPF databases between ASR1001s (running 15.1S & 15.2S). Packets (with LSAs) were being dropped internally in the router, because the underlying driver of LSMPI (Linux Shared Memory Punt Interface) was not capable of handling such packets in fast rates. These large packets internally got fragmented to smaller ones (512 bytes each), transmitted and then reassembled, something that increased the pps rate (9216/512=18 times) between the involved subsystems.

To be more exact, packets punted from the ESP to the RP are received by the Linux kernel of the RP. The Linux kernel then sends those packets to the IOSD process through LSMPI, as you can see in the following diagram.


So, this is the complete punt path on the ASR1000 Series router:

QFP <==> RP Linux Kernel <==> LSMPI <==> Fast-Path Thread <==> Cisco IOS Thread

Since there is a built-in limit on the pps rate that the LSMPI can handle, by fragmenting internally the packets due to their size, the internal pps rate increases and some sub-packets get discarded, which leads to complete packets being dropped and then retransmitted from the neighboring routers, which in turn leads to longer convergence times...if convergence can be accomplished after such packet losses (there were cases that even after minutes there was no convergence, or adjacency seemed FULL but the RIB didn't have any entries from the LSADB). Things can get messier if you also run BGP (with path mtu discovery) and there is a large number of updates that need to be processed. Some time in the past, IOS included code that retransmitted internally only the lost 512-bytes packets so OSPF couldn't actually understand it was losing packets, but due to it causing other issues (probably overloading the LSMPI even more) it got removed.

So this leads to the question "at what layer should the router handle internally control plane packet loss"? As low as possible in order to hide it from the actual protocol or just leave everything to the protocol itself?

You can use the following command in order to check for issues in the LSMPI path (look out for "Device xmit fail").

ASR1001#show platform software infrastructure lsmpi

LSMPI Driver stat ver: 3

Packets:
         In: 17916
        Out: 4713

Rings:
         RX: 2047 free    0    in-use    2048 total
         TX: 2047 free    0    in-use    2048 total
     RXDONE: 2047 free    0    in-use    2048 total
     TXDONE: 2046 free    1    in-use    2048 total

Buffers:
         RX: 6877 free    1317 in-use    8194 total

Reason for RX drops (sticky):
     Ring full        : 0
     Ring put failed  : 0
     No free buffer   : 0
     Receive failed   : 0
     Packet too large : 0
     Other inst buf   : 0
     Consecutive SOPs : 0
     No SOP or EOP    : 0
     EOP but no SOP   : 0
     Particle overrun : 0
     Bad particle ins : 0
     Bad buf cond     : 0
     DS rd req failed : 0
     HT rd req failed : 0
Reason for TX drops (sticky):
     Bad packet len   : 0
     Bad buf len      : 0
     Bad ifindex      : 0
     No device        : 0
     No skbuff        : 0
     Device xmit fail : 103
     Device xmit rtry : 0
     Tx Done ringfull : 0
     Bad u->k xlation : 0
     No extra skbuff  : 0
     Consecutive SOPs : 0
     No SOP or EOP    : 0
     EOP but no SOP   : 0
     Particle overrun : 0
     Other inst buf   : 0
...

Keep in mind that ICMP echoes cannot be used to verify this behavior, because ICMP replies are handled by the ESP/QFP, so you won't notice this issue.

Note: Cisco has an excellent doc describing all cases of packets drops on the ASR1k platform here.
Also, there is a relevant bug (CSCtz53398) that is supposed to provide a workaround.

Cisco's answer? "You have to make the right balance between the convergence time and MTU"!

I tend to agree with them, but until now i had the impression that a larger MTU would lower the convergence time (as long as the CPU could follow). Well, time to reconsider....


Saturday, November 12, 2011

aggregate-address ... summary-only-after-a-while

As it seems, there is always something that you think you know, until it's proven the other way around.

Some years ago, when i was studying for the CCIE, i knew that in order to suppress more specific routes from an aggregate advertisement in BGP, you could use the "aggregate-address .... summary-only" command. And i believed it until recently.

Let's suppose you have the following config in a ASR1k (10.1.1.1) running 15.1(2)S2:

router bgp 100
 bgp router-id 10.1.1.1
 neighbor 10.2.2.2 remote-as 100
 neighbor 10.2.2.2 update-source Loopback0
...
 address-family ipv4
  aggregate-address 10.10.10.0 255.255.255.0 summary-only
  redistribute connected
  neighbor 10.2.2.2 activate
...

Then you have 2 subscribers logging in.
With "show bgp" the 2 /32 routes under 10.10.10.0/24 seem to be suppressed and the /24 is in the BGP table as it should be:

*> 10.10.10.0/24    0.0.0.0                            32768 i
s> 10.10.10.3/32    0.0.0.0                  0         32768 ?
s> 10.10.10.4/32    0.0.0.0                  0         32768 ?

but doing some "debug bgp updates/events" reveals the following:

BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.3/32, next 10.1.1.1, metric 0, path Local
BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.4/32, next 10.1.1.1, metric 0, path Local

...and after a while:

BGP: aggregate timer expired

BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.3/32
BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.4/32

At the same time on the peer router (10.2.2.2) you can see the above 2 /32 routes being received:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32
*>i10.10.10.3/32    10.1.1.1            0    100      0 ?
*>i10.10.10.4/32    10.1.1.1            0    100      0 ?

and immediately afterwards you can see the 2 /32 routes being withdrawn:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32

Cisco is a little bit contradicting on this behavior, as usual.


According to Cisco tac, the default aggregation logic runs every 30 seconds, but bgp update processing will be done almost every 2 seconds. That's the reason the route is being updated initially and later withdrawn (due to the aggregation processing following the initial update). They also admit that the root cause of this problem is with the BGP code. The route will be advertised as soon as the best path is completed. It may take 30 seconds or more for the aggregation logic to complete and withdraw the more specific route.

Then we also have the following:

Bug CSCsu96698

BGP: /32 route being advertised while 'summary-only' is configured

Symptoms: More specific routes are advertised and withdrawn later even if config aggregate-address net mask summary-only is configured. The BGP table shows the specific prefixes as suppressed with s>
Conditions: This occurs only with very large configurations.
Workaround: Configure a distribute-list in BGP process that denies all of the aggregation child routes.

Related Bug Information

It takes 30 seconds for BGP to form aggregate route

Symptom: for approximately 30 seconds router announces specific prefixes instead of aggregate route
Conditions: bgp session up/down
Workaround: unknown yet


Release notes of 12.2SB and 12.0S

The periodic function is by default called at 60 second intervals. The aggregate processing is normally done based on the CPU load. If there is no CPU load, then the aggregate processing function would be triggered within one second. As the CPU load increases, this function call will be triggered at higher intervals and if the CPU load is very high it could go as high as the maximum aggregate timer value configured via command. By default this maximum value is 30 seconds and is configurable with a range of 6-60 seconds and in some trains 0. So, if default values are configured, then as the CPU load increases, the chances of hitting this defect is higher.


Release notes of 12.2(33)SXH6

CLI change to bgp aggregate-timer command to suppress more specific routes.

Old Behavior: More specific routes are advertised and withdrawn later, even if aggregate-address
summary-only is configured. The BGP table shows the specific prefixes as suppressed.

New Behavior: The bgp aggregate-timer command now accepts the value of 0 (zero), which
disables the aggregate timer and suppresses the routes immediately.


Command Reference for "bgp aggregate-timer"

To set the interval at which BGP routes will be aggregated or to disable timer-based route aggregation, use the bgp aggregate-timer command in address-family or router configuration mode. To restore the default value, use the no form of this command.

bgp aggregate-timer seconds
no bgp aggregate-timer

Syntax Description

seconds

Interval (in seconds) at which the system will aggregate BGP routes.

•The range is from 6 to 60 or else 0 (zero). The default is 30.
•A value of 0 (zero) disables timer-based aggregation and starts aggregation immediately.

Command Default

30 seconds

Usage Guidelines

Use this command to change the default interval at which BGP routes are aggregated.

In very large configurations, even if the aggregate-address summary-only command is configured, more specific routes are advertised and later withdrawn. To avoid this behavior, configure the bgp aggregate-timer to 0 (zero), and the system will immediately check for aggregate routes and suppress specific routes.


The interesting part is that the command reference for "aggregate-address ... summary-only" doesn't mention anything about this behavior in order to warn you.

Using the summary-only keyword not only creates the aggregate route (for example, 192.*.*.*) but also suppresses advertisements of more-specific routes to all neighbors. If you want to suppress only advertisements to certain neighbors, you may use the neighbor distribute-list command, with caution. If a more-specific route leaks out, all BGP or mBGP routers will prefer that route over the less-specific aggregate you are generating (using longest-match routing).


The following debug logs show the default aggregate-timer which is 30 secs, vs the default BGP scan timer which is 60 secs:

Nov 12 21:45:32.468: BGP: Performing BGP general scanning
Nov 12 21:45:38.906: BGP: aggregate timer expired
Nov 12 21:46:09.637: BGP: aggregate timer expired
Nov 12 21:46:32.487: BGP: Performing BGP general scanning
Nov 12 21:46:40.379: BGP: aggregate timer expired
Nov 12 21:47:11.099: BGP: aggregate timer expired
Nov 12 21:47:32.506: BGP: Performing BGP general scanning
Nov 12 21:47:41.828: BGP: aggregate timer expired
Nov 12 21:48:12.547: BGP: aggregate timer expired
Nov 12 21:48:32.525: BGP: Performing BGP general scanning
Nov 12 21:48:43.268: BGP: aggregate timer expired
Nov 12 21:49:13.989: BGP: aggregate timer expired
Nov 12 21:49:32.544: BGP: Performing BGP general scanning
Nov 12 21:49:44.765: BGP: aggregate timer expired
Nov 12 21:50:15.510: BGP: aggregate timer expired

Guess what! After changing the aggregate-timer to 0, the cpu load increases by a steady +10%, due to the BGP Router process!

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 17%/6%; one minute: 16%; five minutes: 15%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
 61   329991518  1083305044        304  4.07%  4.09%  3.93%   0 IOSD ipc task
340   202049403    53391862       3784  1.35%  2.10%  2.11%   0 VTEMPLATE Backgr
404    84594181  2432529294          0  1.19%  1.18%  1.14%   0 PPP Events
229    49275197     1710570      28806  0.71%  0.33%  0.39%   0 QoS stats proces
152    39536838   801801056         49  0.63%  0.56%  0.51%   0 SSM connection m
159    51982155   383585236        135  0.47%  0.66%  0.64%   0 SSS Manager

ASR1k#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
ASR1k(config)#router bgp 100
ASR1k(config-router)# address-family ipv4
ASR1k(config-router-af)# bgp aggregate-timer 0
ASR1k(config-router-af)#^Z

...and after a while:

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 29%/6%; one minute: 26%; five minutes: 25%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
394   143774989    19776654       7269  9.91%  7.93%  7.32%   0 BGP Router
 61   329983414  1083280872        304  4.79%  3.97%  3.90%   0 IOSD ipc task
340   202044852    53390504       3784  2.71%  2.22%  2.04%   0 VTEMPLATE Backgr
404    84591714  2432460324          0  1.03%  1.16%  1.09%   0 PPP Events
159    51980758   383575060        135  0.79%  0.66%  0.62%   0 SSS Manager
152    39535734   801779715         49  0.63%  0.54%  0.50%   0 SSM connection m

Conclusions

1) By default, the "aggregate-address ... summary-only" command doesn't immediately stop the announcement of more specific routes, as it's supposed to. You need to also change the BGP aggregate-timer to 0.
2) After changing the BGP aggregate-timer to 0, the announcement of more specific routes stops, but the cpu load increases by 10%.


C'mon Cisco! You gave us NH Address Tracking and PIC, and you can't fix the aggregation process?

Saturday, September 17, 2011

AAA and VTYs in IOS-XR : Bingo

Continuing on the IOS-XR saga, this is the newest bunch of things that don't "work as expected" (© Cisco). Well, as expected by me, not by Cisco.

Everything started while trying to configure a primary and backup aaa login method on an ASR9k, when i realized that...

1) having a backup aaa login method with the same tacacs servers as the ones in the primary aaa login method (which is using the management vrf) doesn't work

Imagine the following aaa configuration:

!
tacacs source-interface MgmtEth0/RSP0/CPU0/0 vrf MGMT
tacacs source-interface Loopback0 vrf default
!
aaa group server tacacs+ TACACS-AAA-GROUP
 server x.x.x.x
 server y.y.y.y
!
aaa group server tacacs+ TACACS-VRF-AAA-GROUP
 server x.x.x.x
 server y.y.y.y
 vrf MGMT
!
aaa authentication login default group TACACS-VRF-AAA-GROUP group TACACS-AAA-GROUP local
!

This is supposed to work in the following way:

As long as at least one mgmt interface is up (i'm using a virtual-ip for the mgmt interfaces), tacacs communication should happen through the out-of-band mgmt interfaces. If all mgmt interfaces are down, then tacacs communication should happen through an inband interface.

Guess what! There seems to be an issue with the above scenario, because in the 2nd case (where all mgmt interfaces are down) tacacs communication doesn't happen at all. Looking at the debugs, it's like the router isn't even trying to use the second (global) tacacs group. This has already been opened as SR (according to tac this should work, so let's hope it's just a bug), so i'm waiting for developers' feedback right now.

In order to overcome the above problem, i thought of using different vty templates, each one with a different access method.

In IOS you can have the following vty configuration and then access vtys 11-15 by either using "telnet x.x.x.x 3001" or "telnet x.x.x.x 2000+y" where y is the tty number displayed by using the command "show line".

!
line vty 11 15
 login authentication BACKUP-AAA
 rotary 1
!

Since the "rotary" command is not supported in IOS-XR, this is what you can do:

!
line default
 login authentication default
!
line template VTY-TEMPLATE
 login authentication BACKUP-AAA
!	
vty-pool default 0 10
vty-pool VTY-POOL 11 20 line-template VTY-TEMPLATE
!

And this is the point you realize that you can't choose a vty, because...

2) specific vtys can be accessed only through a combination of a line template and a specific ACL

First shock: You cannot easily access a specific vty line in IOS-XR. Vtys in IOS-XR work in a very different way in comparison to the IOS ones. According to the BU, when you do a telnet/ssh to the router, the router starts a scanning from the first vty (0) to the last vty (including all custom configured ones). When a free (available) vty is found, the vty ACL is checked in order to verify whether its permit conditions are met. If the vty ACL allows this specific access, then the session is opened.

Second shock: If the vty ACL doesn't allow access, then scanning for free vtys continues until one vty is found that has an ACL that allows this specific access. So, the only to way to access a specific vty is to apply a specific and unique ACL under that vty that allows your i.e. source ip. In order to access another vty, you'll have to use another source ip, and so on. Still wondering why Cisco chose such an implementation.

So i tried the following:

!
line default
 login authentication default
 access-class ingress HOST1-ACL
 transport input telnet ssh
!
line template LINE-TEMPLATE
 login authentication BACKUP-AAA
 access-class ingress HOST2-ACL
 transport input telnet ssh
!
vty-pool default 0 10
vty-pool VTY-POOL 11 20 line-template LINE-TEMPLATE
!
ipv4 access-list HOST1-ACL
 10 permit ipv4 host x.x.x.x any
 20 deny ipv4 any any log
!
ipv4 access-list HOST2-ACL
 10 permit ipv4 host y.y.y.y any
 20 deny ipv4 any any log
!

...and this is what i got when i tried to telnet from HOST2 to the router


HOST2$ telnet router
Trying z.z.z.z...
Connected to router.
Escape character is '^]'.
Connection to router closed by foreign host.

ipv4_acl_mgr[267]: %ACL-IPV4_ACL-6-IPACCESSLOGP : access-list HOST1-ACL (20) deny tcp y.y.y.y(46387) -> z.z.z.z(23), 1 packet

So i didn't manage to telnet into vtys 11-20, because my telnet session was dropped by HOST1-ACL. Is this another bug? Who knows...

And when i thought i had met every possible issue, i also found out that vty ACLs are useless for ssh sessions, because...

3) ssh sessions get established before hitting the vty ACLs

Yeap, that's another shock (3rd in a row). When you do a ssh session to an IOS-XR router, the vty (the one that the ssh session will use) is consumed regardless of your vty ACL. That means that the vty is occupied during the whole time the router is waiting for you to enter your password. It's only after you enter your password that you get disconnected because of the vty ACL. And that's a nice way to dos attack an IOS-XR router.


%SECURITY-SSHD-6-INFO_GENERAL : Incoming SSH session rate limit exceeded
%SECURITY-SSHD-3-ERR_GENERAL : Failed to allocate pty


Note: the same happens with telnet, but since the username is asked after the ACL check, the time while telnet session remains open is limited.

But wait; isn't that supposed to be solved by Management Plane Protection (MPP)? Sure it is, but...

4) MMP configuration doesn't support ACLs

Who would have though of that! MPP configuration expects you to configure hosts and networks in a Juniper kind of way (although Juniper allows you to reuse the "clients" section).


RP/0/RSP0/CPU0:router(config-telnet-peer)#address ipv4 ?
  A.B.C.D         Enter IPv4 address
  A.B.C.D/length  Enter IPv4 address with prefix

RP/0/RSP0/CPU0:router(config-telnet-peer)#address ipv6 ?
  X:X::X         Enter IPv6 address
  X:X::X/length  Enter IPv6 address with prefix


So, if you happen to have already defined ACLs for your NMS/OSS/whatever, which are already being used somewhere else, you can't reuse those ones, but you have to reconfigure all hosts and networks under the MPP section (something that makes mass router config changes even more difficult). You can't even reuse the same hosts/networks under different interfaces!

!
control-plane
 management-plane
  inband
   !
   interface GigabitEthernet0/3/0/0
    allow SSH peer
     address ipv6 2001:db8::69/64
    !
   !
   interface GigabitEthernet0/3/0/1
    allow SSH peer
     address ipv6 2001:db8::69/64
    !
   !

And that's surely a nice way to further "expand" your configuration (not to mention BGP dynamic neighbors that are not supported either, but's that's another story).


That's 4 in a row Cisco. Bingo!!!

Note: Many thanks to Arie for helping me with the 2nd issue (once again).


Question to the public:

Is there a character in IOS-XR that fully resembles "!" as a starting comment indicator, like in IOS?

IOS


router(config-line)#login authentication BACKUP-AAA ! backup
router(config-line)#

IOS-XR


RP/0/RSP0/CPU0:router(config-line)#login authentication BACKUP-AAA ! backup
                                                                   ^
% Invalid input detected at '^' marker.

In IOS-XR, "!" works only when it is the first character in the line.

Friday, June 3, 2011

Debugging IPv6 MTU issues in Windows

A common problem you might face soon (World IPv6 Day is 5 days away) is reachability to IPv6 sites due to MTU issues. ICMPv6 has a nice internal mechanism which is supposed to help the application overcome these issues, but like in the IPv4 world, not everything is perfect.

Let's suppose that an IPv6 subscriber is using a DSL router and is connected through PPPoE to a BRAS.

TARGET <=> BRAS <=> DSL-ROUTER <=> HOST

The usual MTU for PPPoE connections is 1492 bytes, as shown below.

1500 bytes = Ethernet Payload
-     6 bytes = PPPoE header
-     2 bytes = PPP ID
---------------------------------
1492 bytes = IPv6 Packet that can be carried over a PPPoE connection

If your host is configured with 1492 (or something lower) as MTU on its LAN interface, then the OS running on it will automatically take care of "fragmentation", so you don't need to worry for anything. Unfortunately this isn't a common scenario by default. You either have to configure it manually on the host or if you are lucky enough and the DSL modem supports advertisement of MTU to its LAN interface through RA messages (and your host accepts them), it will happen automatically.

If your host is configured with anything larger than 1492 on its LAN interface (in most cases it's the default of 1500), problems might arise.

Users with hosts running Windows can try to ping an IPv6 address (i.e. the next hop after the DSL router) in order to find possible issues with the MTU. The closer the target is, the easier it will be to troubleshoot the problem. Then you start moving towards the target until you meet the issue.

First, some numbers you will need regarding the various headers

1492 bytes = IPv6 Packet
-  40 bytes = IPv6 Header
-   8 bytes = ICMPv6 Header
-------------------------------
1444 bytes = ICMPv6 payload data

Since Windows ping uses the actual payload as a size, if you want to send a total of 1492 bytes, you have to send 1492-40-8=1444 bytes of ICMPv6 payload data. Anything larger will lead to either a problem or to fragmentation.

Windows>ping -l 1444 x:x::x

Pinging target [x:x:xx] with 1444 bytes of data:

Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=51ms
Reply from x:x:xx: time=54ms
Reply from x:x:xx: time=53ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 51ms, Maximum = 54ms, Average = 52ms

These are the relevant Wireshark captures.

The ICMP conversation between all involved devices


1444 bytes ICMP request from HOST to TARGET


If you increase the above number, you'd better start looking for "Too big" ICMPv6 received messages from any hop towards the target, otherwise you are in trouble.

i.e. if you ping with 1446 bytes of data, you get the following:

Windows>ping -l 1446 x:x:xx

Pinging target [x:x:xx] with 1446 bytes of data:

Packet needs to be fragmented but DF set.
Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=55ms
Reply from x:x:xx: time=57ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 3, Lost = 1 (25% loss),
Approximate round trip times in milli-seconds:
    Minimum = 53ms, Maximum = 57ms, Average = 55ms

These are the relevant Wireshark captures.

The ICMP conversation between all involved devices (fragmentation included))

1446 bytes ICMP request from HOST to TARGET
ICMP reply ("Too big") from DSL-ROUTER to HOST (original truncated message included)

As you can see, device DSL-ROUTER is replying with "Too Big" message in the first packet to the HOST and informs it about the MTU (1492) supported in the next-hop link (see RFC 4443 for ICMPv6 info); that's the WAN link towards the BRAS, where PPPoE is running on.

If you are in the unfortunate position to not get any incoming packets, you can safely assume (if everything else is fine) that someone in the path is blocking ICMPv6 messages.

The reply message is exactly 1280 bytes, which is the minimum packet size IPv6 supports. This leads to the original message being truncated in the reply message to 1280-40=1240 bytes for the ICMPv6 packet or  1240-8-40-8=1184 bytes for the actual payload data. So you loose 1446-1184=262 bytes of payload data in the reply message.

Next packets get a successful answer from the target, because they are sent as fragmented (1432+14 bytes).

1432 bytes ICMP request from HOST to TARGET

14 bytes ICMP request from HOST to TARGET

Windows is "smart" enough to keep track of this status for some minutes (in the so called destination or route cache), so next time you send large packets, the first packet is not lost, because fragmentation happens right away.

Windows>ping -l 1446 x:x:xx

Pinging target [x:x:xx] with 1446 bytes of data:

Reply from x:x:xx: time=54ms
Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=55ms
Reply from x:x:xx: time=52ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 52ms, Maximum = 55ms, Average = 53ms




Imho, it's better to make your host use the appropriate MTU from the beginning (i.e. hardcode 1492 or use RA's value) and not depend on ICMPv6 messages to do fragmentation. Some people have proposed to always use the minimum of 1280 (Geoff Huston, Tore Anderson), in order to be safe on every possible case (tunnels involved). I generally prefer to use the maximum possible, hoping that someone in the middle won't mess with ICMPv6 messages. I know that currently this is not the case (so stick with something lower, like 1280, for now), but this will probably change as native IPv6 gets deployed. Unless we start filtering ICMPv6 messages uncontrollably...like many do on IPv4. Does "Internet Control Message Protocol" say anything to you?

Notes

1) RFC 1982 describes Path MTU Discovery (PMTUD) for IPv6.
2) RFC 4821 will help a lot in PMTUD, when and if all vendors start implementing it.
3) In order to see clearly the fragmented IPv6 packets in Wireshark, you have to disable reassembly in preferences.
4) You can use the commands "ipv6 rc" and "ipv6 rcf" in order to view and clear the destination/route cache in WindowsXP

Monday, May 23, 2011

To forward, to peer, or to tunnel?

In an imaginary Cisco world every device would be able to talk with every other device in various layers. In the actual Cisco world, some devices can talk to some devices, while they can't talk to some other devices.

I'm talking specifically about L2 Control Protocols (L2CPs), when these need to be exchanged between different devices in order to support a requirement (i.e. create a spanning-tree loop). Cisco's L2 Protocol Tunneling (L2PT) can help in accomplishing some of these cases.

So let's suppose you have a scenario like the following.


When using the simplest form of devices (L2 switches like 3750), you can just tunnel the L2CPs between devices S1 and S2 and everything will be fine. Spanning tree running on devices C1-C4 will see a loop and will block a port depending on various parameters (priority, cost, etc).

As you move ahead and start to replace the S1 or S2 device with another (usually better), you realize that the new device supports a different way of handling L2CPs, which might be "incompatible" with the old way.

Generally, you can do the following actions on L2CPs as they enter a port:

forward: frame is forwarded to another device without any change (no local processing takes place)
drop: frame is dropped
peer: frame is processed/terminated locally
tunnel: frame is tunneled to another device after changing the destination mac address (L2PT)

Tunneling is quite common is scenarios like the above, where you need to pass the L2 frame across a L2 domain, without having the intermediate devices act upon it.

You can also achieve the same result with forwarding, as long as you don't have a native L2 domain in between, because you might end up mixing local protocols with protocols that just pass over.

It's obvious that you cannot have tunneling on one side and forwarding on the other side, because exchanged frames won't be able to "talk" each other. i.e. for STP one side will tunnel the frame by changing the destination mac address from 01-00-0c-cc-cc-cd (or 01-80-c2-00-00-00) to 01-00-0c-cd-cd-d0, while the other side will just forward the frame by keeping the original destination address of 01-00-0c-cc-cc-cd (or 01-80-c2-00-00-00).

Below you'll find a list with all available options regarding the handling of L2CPs on some known platforms:

Device Interface forward drop peer tunnel
3750 L2 switchport l2protocol-tunnel
ME-3400 L2 switchport l2protocol-tunnel
ME-3800X L2 switchport l2protocol drop l2protocol peer
ME-3800X L2 service instance l2protocol forward (1) l2protocol peer l2protocol tunnel
7600/67xx L2 switchport l2protocol-tunnel
7600/ES L2 switchport l2protocol-tunnel
7600/ES L3 l2protocol drop l2protocol peer
7600/ES L3 service instance l2protocol forward
ASR9000 L2 transport by default (2) l2protocol tunnel

As you can see, you cannot have L2 communication between a service instance on a 7600/ES and one of the smaller platforms, because 7600/ES doesn't support tunneling and the smaller platforms do not support forwarding. Actually, the biggest surprise to me was the lack of support of L2PT on the 7600 with the ES cards when using service instances. I had the impression that this would be the most feature rich platform.

Cisco's proposal is to use the same platform for such scenarios, because they haven't verified anything else and some platforms were built to be used in specific ways. So instead of supporting the same feature (L2PT was their idea after all) along the range of platforms, you should always replace them in pairs. And if by accident, you happen to have more S devices serving many overlapping rings, then you have to replace all of them.

I would prefer, instead of promoting new platforms or new designs, to focus on fixing the existing platforms, so they can cooperate with each other. After all, if a platform is good enough, it will get its share in the market.

Also, the online documentation is quite incomplete on this area. You have to guess what will happen in most cases. We had to open 3 different cases and involve our account team in order to clarify things and push for fixing the documentation. Not surprisingly enough, the peering functionality is another mess. I'll probably need to write another post describing all available options (which lead to different behavior) on these platforms.


Notes

1) "l2protocol forward" on ME-3800X will become available in the next major release. Thanks to Cisco for giving.us the chance to try it earlier.
2) This is the default behavior according to Xander's doc here.
3) Arie asked me to add some extra information about PW/MST/REP/PVST-AG (and all these L2 HA) scenarios. I'll try to write a new post as soon as i find enough free time to test them.

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Greece License.