Saturday, November 12, 2011

aggregate-address ... summary-only-after-a-while

As it seems, there is always something that you think you know, until it's proven the other way around.

Some years ago, when i was studying for the CCIE, i knew that in order to suppress more specific routes from an aggregate advertisement in BGP, you could use the "aggregate-address .... summary-only" command. And i believed it until recently.

Let's suppose you have the following config in a ASR1k (10.1.1.1) running 15.1(2)S2:

router bgp 100
 bgp router-id 10.1.1.1
 neighbor 10.2.2.2 remote-as 100
 neighbor 10.2.2.2 update-source Loopback0
...
 address-family ipv4
  aggregate-address 10.10.10.0 255.255.255.0 summary-only
  redistribute connected
  neighbor 10.2.2.2 activate
...

Then you have 2 subscribers logging in.
With "show bgp" the 2 /32 routes under 10.10.10.0/24 seem to be suppressed and the /24 is in the BGP table as it should be:

*> 10.10.10.0/24    0.0.0.0                            32768 i
s> 10.10.10.3/32    0.0.0.0                  0         32768 ?
s> 10.10.10.4/32    0.0.0.0                  0         32768 ?

but doing some "debug bgp updates/events" reveals the following:

BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.3/32, next 10.1.1.1, metric 0, path Local
BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.4/32, next 10.1.1.1, metric 0, path Local

...and after a while:

BGP: aggregate timer expired

BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.3/32
BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.4/32

At the same time on the peer router (10.2.2.2) you can see the above 2 /32 routes being received:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32
*>i10.10.10.3/32    10.1.1.1            0    100      0 ?
*>i10.10.10.4/32    10.1.1.1            0    100      0 ?

and immediately afterwards you can see the 2 /32 routes being withdrawn:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32

Cisco is a little bit contradicting on this behavior, as usual.


According to Cisco tac, the default aggregation logic runs every 30 seconds, but bgp update processing will be done almost every 2 seconds. That's the reason the route is being updated initially and later withdrawn (due to the aggregation processing following the initial update). They also admit that the root cause of this problem is with the BGP code. The route will be advertised as soon as the best path is completed. It may take 30 seconds or more for the aggregation logic to complete and withdraw the more specific route.

Then we also have the following:

Bug CSCsu96698

BGP: /32 route being advertised while 'summary-only' is configured

Symptoms: More specific routes are advertised and withdrawn later even if config aggregate-address net mask summary-only is configured. The BGP table shows the specific prefixes as suppressed with s>
Conditions: This occurs only with very large configurations.
Workaround: Configure a distribute-list in BGP process that denies all of the aggregation child routes.

Related Bug Information

It takes 30 seconds for BGP to form aggregate route

Symptom: for approximately 30 seconds router announces specific prefixes instead of aggregate route
Conditions: bgp session up/down
Workaround: unknown yet


Release notes of 12.2SB and 12.0S

The periodic function is by default called at 60 second intervals. The aggregate processing is normally done based on the CPU load. If there is no CPU load, then the aggregate processing function would be triggered within one second. As the CPU load increases, this function call will be triggered at higher intervals and if the CPU load is very high it could go as high as the maximum aggregate timer value configured via command. By default this maximum value is 30 seconds and is configurable with a range of 6-60 seconds and in some trains 0. So, if default values are configured, then as the CPU load increases, the chances of hitting this defect is higher.


Release notes of 12.2(33)SXH6

CLI change to bgp aggregate-timer command to suppress more specific routes.

Old Behavior: More specific routes are advertised and withdrawn later, even if aggregate-address
summary-only is configured. The BGP table shows the specific prefixes as suppressed.

New Behavior: The bgp aggregate-timer command now accepts the value of 0 (zero), which
disables the aggregate timer and suppresses the routes immediately.


Command Reference for "bgp aggregate-timer"

To set the interval at which BGP routes will be aggregated or to disable timer-based route aggregation, use the bgp aggregate-timer command in address-family or router configuration mode. To restore the default value, use the no form of this command.

bgp aggregate-timer seconds
no bgp aggregate-timer

Syntax Description

seconds

Interval (in seconds) at which the system will aggregate BGP routes.

•The range is from 6 to 60 or else 0 (zero). The default is 30.
•A value of 0 (zero) disables timer-based aggregation and starts aggregation immediately.

Command Default

30 seconds

Usage Guidelines

Use this command to change the default interval at which BGP routes are aggregated.

In very large configurations, even if the aggregate-address summary-only command is configured, more specific routes are advertised and later withdrawn. To avoid this behavior, configure the bgp aggregate-timer to 0 (zero), and the system will immediately check for aggregate routes and suppress specific routes.


The interesting part is that the command reference for "aggregate-address ... summary-only" doesn't mention anything about this behavior in order to warn you.

Using the summary-only keyword not only creates the aggregate route (for example, 192.*.*.*) but also suppresses advertisements of more-specific routes to all neighbors. If you want to suppress only advertisements to certain neighbors, you may use the neighbor distribute-list command, with caution. If a more-specific route leaks out, all BGP or mBGP routers will prefer that route over the less-specific aggregate you are generating (using longest-match routing).


The following debug logs show the default aggregate-timer which is 30 secs, vs the default BGP scan timer which is 60 secs:

Nov 12 21:45:32.468: BGP: Performing BGP general scanning
Nov 12 21:45:38.906: BGP: aggregate timer expired
Nov 12 21:46:09.637: BGP: aggregate timer expired
Nov 12 21:46:32.487: BGP: Performing BGP general scanning
Nov 12 21:46:40.379: BGP: aggregate timer expired
Nov 12 21:47:11.099: BGP: aggregate timer expired
Nov 12 21:47:32.506: BGP: Performing BGP general scanning
Nov 12 21:47:41.828: BGP: aggregate timer expired
Nov 12 21:48:12.547: BGP: aggregate timer expired
Nov 12 21:48:32.525: BGP: Performing BGP general scanning
Nov 12 21:48:43.268: BGP: aggregate timer expired
Nov 12 21:49:13.989: BGP: aggregate timer expired
Nov 12 21:49:32.544: BGP: Performing BGP general scanning
Nov 12 21:49:44.765: BGP: aggregate timer expired
Nov 12 21:50:15.510: BGP: aggregate timer expired

Guess what! After changing the aggregate-timer to 0, the cpu load increases by a steady +10%, due to the BGP Router process!

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 17%/6%; one minute: 16%; five minutes: 15%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
 61   329991518  1083305044        304  4.07%  4.09%  3.93%   0 IOSD ipc task
340   202049403    53391862       3784  1.35%  2.10%  2.11%   0 VTEMPLATE Backgr
404    84594181  2432529294          0  1.19%  1.18%  1.14%   0 PPP Events
229    49275197     1710570      28806  0.71%  0.33%  0.39%   0 QoS stats proces
152    39536838   801801056         49  0.63%  0.56%  0.51%   0 SSM connection m
159    51982155   383585236        135  0.47%  0.66%  0.64%   0 SSS Manager

ASR1k#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
ASR1k(config)#router bgp 100
ASR1k(config-router)# address-family ipv4
ASR1k(config-router-af)# bgp aggregate-timer 0
ASR1k(config-router-af)#^Z

...and after a while:

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 29%/6%; one minute: 26%; five minutes: 25%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
394   143774989    19776654       7269  9.91%  7.93%  7.32%   0 BGP Router
 61   329983414  1083280872        304  4.79%  3.97%  3.90%   0 IOSD ipc task
340   202044852    53390504       3784  2.71%  2.22%  2.04%   0 VTEMPLATE Backgr
404    84591714  2432460324          0  1.03%  1.16%  1.09%   0 PPP Events
159    51980758   383575060        135  0.79%  0.66%  0.62%   0 SSS Manager
152    39535734   801779715         49  0.63%  0.54%  0.50%   0 SSM connection m

Conclusions

1) By default, the "aggregate-address ... summary-only" command doesn't immediately stop the announcement of more specific routes, as it's supposed to. You need to also change the BGP aggregate-timer to 0.
2) After changing the BGP aggregate-timer to 0, the announcement of more specific routes stops, but the cpu load increases by 10%.


C'mon Cisco! You gave us NH Address Tracking and PIC, and you can't fix the aggregation process?

Saturday, September 17, 2011

AAA and VTYs in IOS-XR : Bingo

Continuing on the IOS-XR saga, this is the newest bunch of things that don't "work as expected" (© Cisco). Well, as expected by me, not by Cisco.

Everything started while trying to configure a primary and backup aaa login method on an ASR9k, when i realized that...

1) having a backup aaa login method with the same tacacs servers as the ones in the primary aaa login method (which is using the management vrf) doesn't work

Imagine the following aaa configuration:

!
tacacs source-interface MgmtEth0/RSP0/CPU0/0 vrf MGMT
tacacs source-interface Loopback0 vrf default
!
aaa group server tacacs+ TACACS-AAA-GROUP
 server x.x.x.x
 server y.y.y.y
!
aaa group server tacacs+ TACACS-VRF-AAA-GROUP
 server x.x.x.x
 server y.y.y.y
 vrf MGMT
!
aaa authentication login default group TACACS-VRF-AAA-GROUP group TACACS-AAA-GROUP local
!

This is supposed to work in the following way:

As long as at least one mgmt interface is up (i'm using a virtual-ip for the mgmt interfaces), tacacs communication should happen through the out-of-band mgmt interfaces. If all mgmt interfaces are down, then tacacs communication should happen through an inband interface.

Guess what! There seems to be an issue with the above scenario, because in the 2nd case (where all mgmt interfaces are down) tacacs communication doesn't happen at all. Looking at the debugs, it's like the router isn't even trying to use the second (global) tacacs group. This has already been opened as SR (according to tac this should work, so let's hope it's just a bug), so i'm waiting for developers' feedback right now.

In order to overcome the above problem, i thought of using different vty templates, each one with a different access method.

In IOS you can have the following vty configuration and then access vtys 11-15 by either using "telnet x.x.x.x 3001" or "telnet x.x.x.x 2000+y" where y is the tty number displayed by using the command "show line".

!
line vty 11 15
 login authentication BACKUP-AAA
 rotary 1
!

Since the "rotary" command is not supported in IOS-XR, this is what you can do:

!
line default
 login authentication default
!
line template VTY-TEMPLATE
 login authentication BACKUP-AAA
!	
vty-pool default 0 10
vty-pool VTY-POOL 11 20 line-template VTY-TEMPLATE
!

And this is the point you realize that you can't choose a vty, because...

2) specific vtys can be accessed only through a combination of a line template and a specific ACL

First shock: You cannot easily access a specific vty line in IOS-XR. Vtys in IOS-XR work in a very different way in comparison to the IOS ones. According to the BU, when you do a telnet/ssh to the router, the router starts a scanning from the first vty (0) to the last vty (including all custom configured ones). When a free (available) vty is found, the vty ACL is checked in order to verify whether its permit conditions are met. If the vty ACL allows this specific access, then the session is opened.

Second shock: If the vty ACL doesn't allow access, then scanning for free vtys continues until one vty is found that has an ACL that allows this specific access. So, the only to way to access a specific vty is to apply a specific and unique ACL under that vty that allows your i.e. source ip. In order to access another vty, you'll have to use another source ip, and so on. Still wondering why Cisco chose such an implementation.

So i tried the following:

!
line default
 login authentication default
 access-class ingress HOST1-ACL
 transport input telnet ssh
!
line template LINE-TEMPLATE
 login authentication BACKUP-AAA
 access-class ingress HOST2-ACL
 transport input telnet ssh
!
vty-pool default 0 10
vty-pool VTY-POOL 11 20 line-template LINE-TEMPLATE
!
ipv4 access-list HOST1-ACL
 10 permit ipv4 host x.x.x.x any
 20 deny ipv4 any any log
!
ipv4 access-list HOST2-ACL
 10 permit ipv4 host y.y.y.y any
 20 deny ipv4 any any log
!

...and this is what i got when i tried to telnet from HOST2 to the router


HOST2$ telnet router
Trying z.z.z.z...
Connected to router.
Escape character is '^]'.
Connection to router closed by foreign host.

ipv4_acl_mgr[267]: %ACL-IPV4_ACL-6-IPACCESSLOGP : access-list HOST1-ACL (20) deny tcp y.y.y.y(46387) -> z.z.z.z(23), 1 packet

So i didn't manage to telnet into vtys 11-20, because my telnet session was dropped by HOST1-ACL. Is this another bug? Who knows...

And when i thought i had met every possible issue, i also found out that vty ACLs are useless for ssh sessions, because...

3) ssh sessions get established before hitting the vty ACLs

Yeap, that's another shock (3rd in a row). When you do a ssh session to an IOS-XR router, the vty (the one that the ssh session will use) is consumed regardless of your vty ACL. That means that the vty is occupied during the whole time the router is waiting for you to enter your password. It's only after you enter your password that you get disconnected because of the vty ACL. And that's a nice way to dos attack an IOS-XR router.


%SECURITY-SSHD-6-INFO_GENERAL : Incoming SSH session rate limit exceeded
%SECURITY-SSHD-3-ERR_GENERAL : Failed to allocate pty


Note: the same happens with telnet, but since the username is asked after the ACL check, the time while telnet session remains open is limited.

But wait; isn't that supposed to be solved by Management Plane Protection (MPP)? Sure it is, but...

4) MMP configuration doesn't support ACLs

Who would have though of that! MPP configuration expects you to configure hosts and networks in a Juniper kind of way (although Juniper allows you to reuse the "clients" section).


RP/0/RSP0/CPU0:router(config-telnet-peer)#address ipv4 ?
  A.B.C.D         Enter IPv4 address
  A.B.C.D/length  Enter IPv4 address with prefix

RP/0/RSP0/CPU0:router(config-telnet-peer)#address ipv6 ?
  X:X::X         Enter IPv6 address
  X:X::X/length  Enter IPv6 address with prefix


So, if you happen to have already defined ACLs for your NMS/OSS/whatever, which are already being used somewhere else, you can't reuse those ones, but you have to reconfigure all hosts and networks under the MPP section (something that makes mass router config changes even more difficult). You can't even reuse the same hosts/networks under different interfaces!

!
control-plane
 management-plane
  inband
   !
   interface GigabitEthernet0/3/0/0
    allow SSH peer
     address ipv6 2001:db8::69/64
    !
   !
   interface GigabitEthernet0/3/0/1
    allow SSH peer
     address ipv6 2001:db8::69/64
    !
   !

And that's surely a nice way to further "expand" your configuration (not to mention BGP dynamic neighbors that are not supported either, but's that's another story).


That's 4 in a row Cisco. Bingo!!!

Note: Many thanks to Arie for helping me with the 2nd issue (once again).


Question to the public:

Is there a character in IOS-XR that fully resembles "!" as a starting comment indicator, like in IOS?

IOS


router(config-line)#login authentication BACKUP-AAA ! backup
router(config-line)#

IOS-XR


RP/0/RSP0/CPU0:router(config-line)#login authentication BACKUP-AAA ! backup
                                                                   ^
% Invalid input detected at '^' marker.

In IOS-XR, "!" works only when it is the first character in the line.

Friday, June 3, 2011

Debugging IPv6 MTU issues in Windows

A common problem you might face soon (World IPv6 Day is 5 days away) is reachability to IPv6 sites due to MTU issues. ICMPv6 has a nice internal mechanism which is supposed to help the application overcome these issues, but like in the IPv4 world, not everything is perfect.

Let's suppose that an IPv6 subscriber is using a DSL router and is connected through PPPoE to a BRAS.

TARGET <=> BRAS <=> DSL-ROUTER <=> HOST

The usual MTU for PPPoE connections is 1492 bytes, as shown below.

1500 bytes = Ethernet Payload
-     6 bytes = PPPoE header
-     2 bytes = PPP ID
---------------------------------
1492 bytes = IPv6 Packet that can be carried over a PPPoE connection

If your host is configured with 1492 (or something lower) as MTU on its LAN interface, then the OS running on it will automatically take care of "fragmentation", so you don't need to worry for anything. Unfortunately this isn't a common scenario by default. You either have to configure it manually on the host or if you are lucky enough and the DSL modem supports advertisement of MTU to its LAN interface through RA messages (and your host accepts them), it will happen automatically.

If your host is configured with anything larger than 1492 on its LAN interface (in most cases it's the default of 1500), problems might arise.

Users with hosts running Windows can try to ping an IPv6 address (i.e. the next hop after the DSL router) in order to find possible issues with the MTU. The closer the target is, the easier it will be to troubleshoot the problem. Then you start moving towards the target until you meet the issue.

First, some numbers you will need regarding the various headers

1492 bytes = IPv6 Packet
-  40 bytes = IPv6 Header
-   8 bytes = ICMPv6 Header
-------------------------------
1444 bytes = ICMPv6 payload data

Since Windows ping uses the actual payload as a size, if you want to send a total of 1492 bytes, you have to send 1492-40-8=1444 bytes of ICMPv6 payload data. Anything larger will lead to either a problem or to fragmentation.

Windows>ping -l 1444 x:x::x

Pinging target [x:x:xx] with 1444 bytes of data:

Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=51ms
Reply from x:x:xx: time=54ms
Reply from x:x:xx: time=53ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 51ms, Maximum = 54ms, Average = 52ms

These are the relevant Wireshark captures.

The ICMP conversation between all involved devices


1444 bytes ICMP request from HOST to TARGET


If you increase the above number, you'd better start looking for "Too big" ICMPv6 received messages from any hop towards the target, otherwise you are in trouble.

i.e. if you ping with 1446 bytes of data, you get the following:

Windows>ping -l 1446 x:x:xx

Pinging target [x:x:xx] with 1446 bytes of data:

Packet needs to be fragmented but DF set.
Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=55ms
Reply from x:x:xx: time=57ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 3, Lost = 1 (25% loss),
Approximate round trip times in milli-seconds:
    Minimum = 53ms, Maximum = 57ms, Average = 55ms

These are the relevant Wireshark captures.

The ICMP conversation between all involved devices (fragmentation included))

1446 bytes ICMP request from HOST to TARGET
ICMP reply ("Too big") from DSL-ROUTER to HOST (original truncated message included)

As you can see, device DSL-ROUTER is replying with "Too Big" message in the first packet to the HOST and informs it about the MTU (1492) supported in the next-hop link (see RFC 4443 for ICMPv6 info); that's the WAN link towards the BRAS, where PPPoE is running on.

If you are in the unfortunate position to not get any incoming packets, you can safely assume (if everything else is fine) that someone in the path is blocking ICMPv6 messages.

The reply message is exactly 1280 bytes, which is the minimum packet size IPv6 supports. This leads to the original message being truncated in the reply message to 1280-40=1240 bytes for the ICMPv6 packet or  1240-8-40-8=1184 bytes for the actual payload data. So you loose 1446-1184=262 bytes of payload data in the reply message.

Next packets get a successful answer from the target, because they are sent as fragmented (1432+14 bytes).

1432 bytes ICMP request from HOST to TARGET

14 bytes ICMP request from HOST to TARGET

Windows is "smart" enough to keep track of this status for some minutes (in the so called destination or route cache), so next time you send large packets, the first packet is not lost, because fragmentation happens right away.

Windows>ping -l 1446 x:x:xx

Pinging target [x:x:xx] with 1446 bytes of data:

Reply from x:x:xx: time=54ms
Reply from x:x:xx: time=53ms
Reply from x:x:xx: time=55ms
Reply from x:x:xx: time=52ms

Ping statistics for x:x:xx:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 52ms, Maximum = 55ms, Average = 53ms




Imho, it's better to make your host use the appropriate MTU from the beginning (i.e. hardcode 1492 or use RA's value) and not depend on ICMPv6 messages to do fragmentation. Some people have proposed to always use the minimum of 1280 (Geoff Huston, Tore Anderson), in order to be safe on every possible case (tunnels involved). I generally prefer to use the maximum possible, hoping that someone in the middle won't mess with ICMPv6 messages. I know that currently this is not the case (so stick with something lower, like 1280, for now), but this will probably change as native IPv6 gets deployed. Unless we start filtering ICMPv6 messages uncontrollably...like many do on IPv4. Does "Internet Control Message Protocol" say anything to you?

Notes

1) RFC 1982 describes Path MTU Discovery (PMTUD) for IPv6.
2) RFC 4821 will help a lot in PMTUD, when and if all vendors start implementing it.
3) In order to see clearly the fragmented IPv6 packets in Wireshark, you have to disable reassembly in preferences.
4) You can use the commands "ipv6 rc" and "ipv6 rcf" in order to view and clear the destination/route cache in WindowsXP

Monday, May 23, 2011

To forward, to peer, or to tunnel?

In an imaginary Cisco world every device would be able to talk with every other device in various layers. In the actual Cisco world, some devices can talk to some devices, while they can't talk to some other devices.

I'm talking specifically about L2 Control Protocols (L2CPs), when these need to be exchanged between different devices in order to support a requirement (i.e. create a spanning-tree loop). Cisco's L2 Protocol Tunneling (L2PT) can help in accomplishing some of these cases.

So let's suppose you have a scenario like the following.


When using the simplest form of devices (L2 switches like 3750), you can just tunnel the L2CPs between devices S1 and S2 and everything will be fine. Spanning tree running on devices C1-C4 will see a loop and will block a port depending on various parameters (priority, cost, etc).

As you move ahead and start to replace the S1 or S2 device with another (usually better), you realize that the new device supports a different way of handling L2CPs, which might be "incompatible" with the old way.

Generally, you can do the following actions on L2CPs as they enter a port:

forward: frame is forwarded to another device without any change (no local processing takes place)
drop: frame is dropped
peer: frame is processed/terminated locally
tunnel: frame is tunneled to another device after changing the destination mac address (L2PT)

Tunneling is quite common is scenarios like the above, where you need to pass the L2 frame across a L2 domain, without having the intermediate devices act upon it.

You can also achieve the same result with forwarding, as long as you don't have a native L2 domain in between, because you might end up mixing local protocols with protocols that just pass over.

It's obvious that you cannot have tunneling on one side and forwarding on the other side, because exchanged frames won't be able to "talk" each other. i.e. for STP one side will tunnel the frame by changing the destination mac address from 01-00-0c-cc-cc-cd (or 01-80-c2-00-00-00) to 01-00-0c-cd-cd-d0, while the other side will just forward the frame by keeping the original destination address of 01-00-0c-cc-cc-cd (or 01-80-c2-00-00-00).

Below you'll find a list with all available options regarding the handling of L2CPs on some known platforms:

Device Interface forward drop peer tunnel
3750 L2 switchport l2protocol-tunnel
ME-3400 L2 switchport l2protocol-tunnel
ME-3800X L2 switchport l2protocol drop l2protocol peer
ME-3800X L2 service instance l2protocol forward (1) l2protocol peer l2protocol tunnel
7600/67xx L2 switchport l2protocol-tunnel
7600/ES L2 switchport l2protocol-tunnel
7600/ES L3 l2protocol drop l2protocol peer
7600/ES L3 service instance l2protocol forward
ASR9000 L2 transport by default (2) l2protocol tunnel

As you can see, you cannot have L2 communication between a service instance on a 7600/ES and one of the smaller platforms, because 7600/ES doesn't support tunneling and the smaller platforms do not support forwarding. Actually, the biggest surprise to me was the lack of support of L2PT on the 7600 with the ES cards when using service instances. I had the impression that this would be the most feature rich platform.

Cisco's proposal is to use the same platform for such scenarios, because they haven't verified anything else and some platforms were built to be used in specific ways. So instead of supporting the same feature (L2PT was their idea after all) along the range of platforms, you should always replace them in pairs. And if by accident, you happen to have more S devices serving many overlapping rings, then you have to replace all of them.

I would prefer, instead of promoting new platforms or new designs, to focus on fixing the existing platforms, so they can cooperate with each other. After all, if a platform is good enough, it will get its share in the market.

Also, the online documentation is quite incomplete on this area. You have to guess what will happen in most cases. We had to open 3 different cases and involve our account team in order to clarify things and push for fixing the documentation. Not surprisingly enough, the peering functionality is another mess. I'll probably need to write another post describing all available options (which lead to different behavior) on these platforms.


Notes

1) "l2protocol forward" on ME-3800X will become available in the next major release. Thanks to Cisco for giving.us the chance to try it earlier.
2) This is the default behavior according to Xander's doc here.
3) Arie asked me to add some extra information about PW/MST/REP/PVST-AG (and all these L2 HA) scenarios. I'll try to write a new post as soon as i find enough free time to test them.

Thursday, May 5, 2011

How Multi is MP-BGP in IOS-XR?

This caught me on surprise. I had an impression that IOS-XR as an advanced operating system would support all kinds of multi-protocol transferability over BGP.

As it seems, there is an issue when transferring IPv6 prefixes over an IPv4 peering or IPv4 prefixes over an IPv6 peering. This happens for sure on ASR9k running latest 4.1.0, but i haven't verified it on the CRS yet.

IPv4 prefixes over IPv6 peering
This doesn't seem to be supported based on the available configuration options.
What is even more worrying, is that no other address family is supported too.

RP/0/RSP0/CPU0:ASR#conf t
RP/0/RSP0/CPU0:ASR(config)#router bgp 100
RP/0/RSP0/CPU0:ASR(config-bgp)#neighbor 2001::1:2:3
RP/0/RSP0/CPU0:ASR(config-bgp-nbr)#address-family ?
  ipv6  IPv6 Address Family

IPv6 prefixes over IPv4 peering
This is supported according to the configuration options, but it doesn't work.
Cisco also insists that this is definitely supported.

RP/0/RSP0/CPU0:ASR#conf t
RP/0/RSP0/CPU0:ASR(config)#router bgp 100
RP/0/RSP0/CPU0:ASR(config-bgp)#neighbor 10.11.254.37
RP/0/RSP0/CPU0:ASR(config-bgp-nbr)#address-family ?
  ipv4   IPv4 Address Family
  ipv6   IPv6 Address Family
  l2vpn  L2VPN Address Family
  vpnv4  VPNv4 Address Family
  vpnv6  VPNv6 Address Family

As soon as you enable the IPv6 address family under the IPv4 neighbor, the BGP session is dropped and it never comes up.

RP/0/RSP0/CPU0:ASR#sh bgp sum
BGP router identifier 10.11.254.38, local AS number 100
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0000000   RD version: 1
BGP main routing table version 1
BGP scan interval 60 secs

BGP is operating in STANDALONE mode.


Process       RcvTblVer   bRIB/RIB   LabelVer  ImportVer  SendTblVer  StandbyVer
Speaker               1          1          1          1           1           1

Neighbor        Spk    AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down  St/PfxRcd
10.11.254.37      0  100        0       0        0    0    0 00:00:00 Idle

Also, debug shows that there are no tries of BGP to establish a session. It's like BGP gets disabled.

The only doc that refers such a limitation (in IOS-XR 3.3 for CRS) is the one in http://www.cisco.com/en/US/docs/ios_xr_sw/iosxr_r3.3/conversion/reference/guide/cn33main.html#wp1028960

A given address family is only supported with a neighbor whose address is from that address family. For instance, IPv4 neighbors support IPv4 unicast and multicast address families, and IPv6 neighbors support IPv6 unicast and multicast address families. However, you cannot exchange IPv6 routing information with an IPv4 neighbor and vice versa.

I searched all CCO for more information, but i didn't manage to find something useful. Does anyone have extra information to share? TAC is struggling (as usual) to find an answer...

Update #1
Cisco verified once more that this is a supported configuration. Arie Vayner (and later tac) proposed to add an IPv6 address to the interface being used as an IPv4 next-hop. Indeed, this solved the problem and the BGP session came up. But then it became even more interesting...

Two IPv6 prefixes are learned from the IPv4 neighbor. Next-hop is an IPv6 address.

RP/0/RSP0/CPU0:ASR#sh bgp ipv6 uni 
BGP router identifier 10.11.254.38, local AS number 100
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0800000   RD version: 5
BGP main routing table version 5
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network            Next Hop            Metric LocPrf Weight Path
* i2001::1:2:3/128    2003::1:2:3              0    100      0 ?
* i2003::/64          2003::1:2:3              0    100      0 ?

Processed 2 prefixes, 2 paths

If i remove the IPv6 address from the interface that is being used as next-hop (the one i added before), then i automatically get an IPv6 prefix with an IPv4 next-hop!!!

RP/0/RSP0/CPU0:core-distr-kln-02#sh bgp ipv6 uni 
BGP router identifier 10.11.254.38, local AS number 100
BGP generic scan interval 60 secs
BGP table state: Active
Table ID: 0xe0800000   RD version: 6
BGP main routing table version 6
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
i - internal, r RIB-failure, S stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network            Next Hop            Metric LocPrf Weight Path
*>i2001::1:2:3/128    10.11.254.41             0    100      0 ?

Processed 1 prefixes, 1 paths

The BGP session stays up, until something happens that will reset it. Then it will stay down forever, as it was happening in the beginning.

I must say that i cannot endorse such an implementation. Using exactly the same configuration, you get different results, depending on the order of (un)configuring things. Also, i cannot understand why the establishment of an IPv4 BGP session that is going to negotiate IPv4/IPv6 address-family capabilities should depend on whether an IPv6 next-hop exists or not. That should be left for the NLRI exchange routine.

After all, RFC 4271 defines among others two error conditions for the NEXT_HOP attribute:

If the NEXT_HOP attribute field is syntactically incorrect, then the Error Subcode MUST be set to Invalid NEXT_HOP Attribute. The Data field MUST contain the incorrect attribute (type, length, and value). Syntactic correctness means that the NEXT_HOP attribute represents a valid IP host address.

If the NEXT_HOP attribute is semantically incorrect, the error SHOULD be logged, and the route SHOULD be ignored. In this case, a NOTIFICATION message SHOULD NOT be sent, and the connection SHOULD NOT be closed.



Update #2
After the developers got involved, we ended up with the following:

  1. In IOS-XR you need an IPv6 NH in order to activate the IPv6 AF for an IPv4 BGP session.
  2. If you don't have an IPv6 NH, then the IPv4 BGP session won't even come up.
  3. The above was done to protect against misconfiguration, because otherwise you would get a misleading v4 mapped v6 address as NH.
  4. If you have an IPv6 NH, then the IPv4 BGP session with the IPv6 AF will come up.
  5. If afterwards you remove the IPv6 NH, then the session deliberately remains up and you get a misleading v4 mapped v6 address as NH.
Cisco agreed (thx Xander) that the behavior in 3 and 5 contradict each other, so a short-term solution (update the documentation and print a warning message) got recorded in CSCtq26829.

Wednesday, May 4, 2011

BRAS/Server initiated renewal for DHCPv6-PD leases - When?

One major issue when dealing with IPv6 CPEs is the currently missing capability to renew automatically the IPv6 addresses on the CPE's LAN after a disconnect/reconnect of the subscriber's dynamic session.

Although there are some tricks (#1, #2) for client (subscriber) initiated renewal, not all CPE vendors support those tricks. Also many times it is preferable to have the BRAS/BNG, or generally the ISPs, control this renewal, since all the AAA (and BSS/OSS) systems are usually managed by them.

The DHCPv6 "Reconfigure" message was made to help in the above case. According to RFC 3315:

RECONFIGURE (10) A server sends a Reconfigure message to a client to inform the client that the server has new or updated configuration parameters, and that the client is to initiate a Renew/Reply or Information-request/Reply transaction with the server in order to receive the updated information.
...
The client includes a Reconfigure Accept option if the client is willing to accept Reconfigure messages from the server.


It's obvious that without this support, a client must wait until it renews its lease to get configuration updates, which might be from some hours to many days. Btw, shouldn't the change of the WAN interface state on the CPE automatically cause the renewal of the delegated prefix on its LAN???

Also, according to the recently approved informational RFC 6204, the support of the DHCPv6 Reconfigure option is a MUST for IPv6 CPEs.

WAA-4: The IPv6 CE router MUST be able to support the following DHCPv6 options: IA_NA, Reconfigure Accept and DNS_SERVERS.

Now, someone malicious might translate the above "MUST be able to support" phrase into "ok, it's not actually required to support it now, but you must be able to support it in the future". It definitely would be better to have it as "MUST support".

A recent "IPv6 CE Router Interoperability Whitepaper" from UNH-IOL shows that none of the CPEs that were tested, supported this option.

The last issue discovered during the testing was IPv6 CE router lack of support for DHCP Reconfigure. According to draft-ietf-v6ops-ipv6-cpe-router-09, “WAA-4: The IPv6 CE router MUST be able to support the following DHCPv6 options: IA_NA, Reconfigure Accept [RFC3315], DNS_SERVERS [RFC3646].” Therefore the IPv6 CE routers should have included the Reconfigure Accept in DHCPv6 Request or Solicit messages.

It gets a little bit more complicated, if you check what RFC 3633 says about the "Reconfigure" message when it is used for Prefix Delegation:

13.1. Delegating Router behavior

The delegating router initiates a configuration message exchange with a requesting router, as described in section 19, "DHCP Server-Initiated Configuration Exchange" of RFC 3315, by sending a Reconfigure message (acting as a DHCP server) to the requesting router, as described in section 19.1, "Server Behavior" of RFC 3315. The delegating router specifies the IA_PD option in the Option Request option to cause the requesting router to include an IA_PD option to obtain new information about delegated prefix(es).

13.2. Requesting Router behavior

The requesting router responds to a Reconfigure message, acting as a DHCP client, received from a delegating router as described in section 19.4, "Client Behavior" of RFC 3315. The requesting router MUST include the IA_PD Prefix option(s) (in an IA_PD option) for prefix(es) that have been delegated to the requesting router by the delegating router from which the Reconfigure message was received.


So, if someone claims support of the "Reconfigure" option, where does it refer to? DHCPv6 or DHCPv6-PD? What about Relay?

On the server side, Juniper MX series already support it (it's called "dynamic reconfiguration for DHCPv6"), but Cisco ASR1k doesn't. Cisco CNR 7.x also supports it, so does (or will) Dibbler 0.8.0. ISC DHCPv6 server and Windows Server 2008 probably don't.

Notes

  • Our experience with IPv6 CPEs until now is disappointing on this matter. Although we have feedback from various CPE vendors that they will support it, none of them actually supports it now.
  • Wouldn't it be interesting to have the "Reconfigure" message be sent by the BRAS/BNG DHCPv6 server to the client, when the router receives a Radius CoA (RFC 3576) packet for this specific subscriber?

Saturday, April 30, 2011

VTY IPv6 ACLs in IOS-XR

One of the first things you have to do before adding IPv6 addresses in a router, is to protect its management plane. A simple way to implement a part of that is to define an ACL (Access List) under the relevant terminal lines (VTYs).

In IOS it's quite simple.
One ACL for IPv4 and one ACL for IPv6, which cannot share the same name.

! IOS
!----
ip access-list extended IPV4-VTY-ACL
 permit ip 10.0.0.0 0.0.0.255 any
 deny   ip any any log
!
ipv6 access-list IPV6-VTY-ACL
 permit ipv6 2001:DB8::/32 any
 deny   ipv6 any any log
!
line vty 0 10
 access-class IPV4-VTY-ACL in
 ipv6 access-class IPV6-VTY-ACL in
!

In IOS-XR it gets a little bit tricky.
One ACL for IPv4 and one ACL for IPv6, which must share the same name.

! IOS-XR
!-------
ipv4 access-list VTY-ACL
 10 permit ipv4 10.0.0.0 0.0.0.255 any
 20 deny   ipv4 any any log
!
ipv6 access-list VTY-ACL
 10 permit ipv6 2001:DB8::/32 any
 20 deny   ipv6 any any log
!
vty-pool default 0 10
line default
 access-class ingress VTY-ACL
!

Ok, then you think that this is good because it saves you typing.
So you expect to meet the same behavior when viewing the ACLs. Bad Luck. You still have to use the "ipv6" keyword in order to view the ipv6 ACL.

RP/0/RSP0/CPU0:ASR#sh access-lists VTY-ACL
ipv4 access-list VTY-ACL
 10 permit ipv4 10.0.0.0 0.0.0.255 any
 20 deny ipv4 any any log

RP/0/RSP0/CPU0:ASR#sh access-lists ipv4 VTY-ACL
ipv4 access-list VTY-ACL
 10 permit ipv4 10.0.0.0 0.0.0.255 any
 20 deny ipv4 any any log

RP/0/RSP0/CPU0:ASR#sh access-lists ipv6 VTY-ACL
ipv6 access-list VTY-ACL
 10 permit ipv6 2001:DB8::/32 any
 20 deny ipv6 any any log

Talking about uniformity...


Notes

IOS-XR offers a different way to protect the mgmt-plane by using the MPP feature (Management Plane Protection).

Sample IPv6 Addressing/Dimensioning Plan for ISPs

This is a high level summary of an IPv6 addressing & dimensioning plan for mid-sized service providers. Obviously it doesn't apply to all cases, but i hope other people will find it useful too.

First you define 3 levels of PoPs (Points of Presence), depending on number of customers and address consumption:

  • Level-1 PoP (Large)
  • Level-2 PoP (Medium)
  • Level-3 PoP (Small)
Then you define 2 types of customers:
  • Residential
  • Business

General rules
  • Keep the boundary on /32,/40,/48,/56,/64,/128 for easier management.
  • Avoid hex letters (A,B,C,D,E,F) on infrastructure addresses, at least until you run out of numbers.
  • Keep infrastructure addresses in a single block for easier ACL management.
  • Loopbacks, Management, Internal can be contained in a single /41, leaving Public on the other /41. 
  • Keep customer addresses in a single block per PoP for easier route aggregation.
  • Too much aggregation won't help you much in case of multiple internet exits.

Each ISP gets at least a /32 from its RIR. A sample dimensioning could be the following:

1 x /32
  • 1 x /40 : Infrastructure Addresses
    • 1 x /48 for all PoPs : Loopbacks & Management
      • 1 x /56 for all PoPs : Loopbacks
        • 1 x /64 per Loopback Category
          • 1 x /128 per Loopback Interface
      • 1 x /56 per PoP : Management
        • 1 x /64 per Management LAN
    • 1 x /48 for all PoPs : Reserved
    • 1 x /48 per PoP : Internal Networks
      • X1 x /56 : Routers P2P Links
        • 1 x /64 per Routers P2P Link 
      • X2 x /56 : Routers LANs
        • 1 x /64 per Routers LAN
      • X3 x /56 : Hosts LANs
        • 1 x /64 per Hosts LAN
      • X4 x /56 : Servers LANs
        • 1 x /64 per Servers LAN
      • X5 x /56 : Other
        • 1 x /64 per Other
    • 1 x /48 per PoP : Public Networks
      •  X1 x /56 : Routers P2P Links
        • 1 x /64 per Routers P2P Link 
      •  X2 x /56 : Routers LANs
        • 1 x /64 per Routers LAN 
      •  X3 x /56 : Hosts LANs
        • 1 x /64 per Hosts LAN 
      •  X4 x /56 : Servers LANs
        • 1 x /64 per Servers LAN 
      • X5 x /56 : Other
        • 1 x /64 per Other
  • A x /40 : Level-1 PoP Customers
    • N1 x /40 per PoP : Business Customers
      • 1 x /48 per Large Customer
      • 1 x /56 per Small Customer
    • N2 x /40 per PoP : Residential Customers
      • 1 x /56 per Customer LAN
      • 1 x /64 per Customer WAN
  • B x /40 : Level-2 PoP Customers
    •  M1 x /40 per PoP : Business Customers
      •  1 x /48 per Large Customer
      •  1 x /56 per Small Customer
    •  M2 x /40 per PoP : Residential Customers
      •  1 x /56 per Customer LAN
      •  1 x /64 per Customer WAN
  • C x /40 : Level-3 PoP Customers
    •  L1 x /40 per PoP : Business Customers
      •  1 x /48 per Large Customer
      •  1 x /56 per Small Customer
    •  L2 x /40 per PoP : Residential Customers
      •  1 x /56 per Customer LAN
      •  1 x /64 per Customer WAN
  • D x /40 : Reserved

Calculations

A,B,C depend on the total number of /40 per type of PoP (A>B>C).
N1,N2,M1,M2,L1,L2 depend on the number of customers per type of PoP (N1>M1>L1 & N2>M2>L2)

The summary of (N1 x /40) + (N2 x /40) for all Level-1 PoPs equals to A x /40.
The summary of (M1 x /40) + (M2 x /40) for all Level-2 PoPs equals to B x /40.
The summary of (L1 x /40) + (L2 x /40) for all Level-3 PoPs equals to C x /40.

Notes

The above plan is based on what i believe to be current best practices and recommendations for a specific type of service provider. Some things will change, as they changed in the past:

  1. Initially a /48 was recommended for all sites in the general case (RFC 3177), now it's per case (RFC 6177).
  2. Some years ago a /127 was not recommended for p2p links (RFC 3627), now /127 came back into surface (RFC 6164).
  3. Currently only /64 is used by SLAAC (RFC 4862), but someone though something longer would be better (draft-yhb-6man-slaac-improvement).

I am sure we'll see a lot of changes in the following months/years regarding the length of prefixes in IPv6. I just hope we don't have to move to something new with more than 128 bits.

Tuesday, April 26, 2011

IOS-XR still lacks proper help output

Is it really so hard for Cisco to put a logic in the list of available configuration commands?

This is what you get if you enter "?" under a BVI in an ASR9k running IOS-XR 4.0.1.

RP/0/RSP0/CPU0:ASR9k1(config-if)#?    
  address-family   AFI/SAFI configuration
  arp              Configure Address Resolution Protocol
  bandwidth        Set the bandwidth of an interface
  clear            Clear the uncommitted configuration
  commit           Commit the configuration changes to running
  crypto           Set crypto parameters
  dampening        configure state dampening on the given interface
  describe         Describe a command without taking real actions
  description      Set description for this interface
  do               Run an exec command
  exit             Exit from this submode
  frame-relay      Frame Relay interface configuration commands
  ipv4             IPv4 interface subcommands
  load-interval    Specify interval for load calculation for an interface
  local-proxy-arp  Enable local proxy ARP
  logging          Per-interface logging configuration
  mac-address      Set the Mac address(xxxx.xxxx.xxxx) on an interface
  maintenance      Configure maintenance embargo flag on the given interface
  mtu              Set the MTU on an interface
  no               Negate a command or set its defaults
  proxy-arp        Enable proxy ARP
  pwd              Commands used to reach current submode
  root             Exit to the global configuration mode
  service-policy   Configure a service policy
  show             Show contents of configuration
  shutdown         shutdown the given interface
  vrf              Set VRF in which the interface operates

I thought IOS-XR would improve such simple things, especially as now it has become mature.

Wouldn't it be better to have some of the commands that are not applicable to interface configuration under a different paragraph, like below?

RP/0/RSP0/CPU0:ASR9k1(config-if)#?    
  address-family   AFI/SAFI configuration
  arp              Configure Address Resolution Protocol
  bandwidth        Set the bandwidth of an interface
  crypto           Set crypto parameters
  dampening        configure state dampening on the given interface
  description      Set description for this interface
  frame-relay      Frame Relay interface configuration commands
  ipv4             IPv4 interface subcommands
  load-interval    Specify interval for load calculation for an interface
  local-proxy-arp  Enable local proxy ARP
  logging          Per-interface logging configuration
  mac-address      Set the Mac address(xxxx.xxxx.xxxx) on an interface
  maintenance      Configure maintenance embargo flag on the given interface
  mtu              Set the MTU on an interface
  no               Negate a command or set its defaults
  proxy-arp        Enable proxy ARP
  service-policy   Configure a service policy
  shutdown         shutdown the given interface
  vrf              Set VRF in which the interface operates
  ---------------------------------------------------------
  clear            Clear the uncommitted configuration
  commit           Commit the configuration changes to running
  describe         Describe a command without taking real actions
  do               Run an exec command
  exit             Exit from this submode
  pwd              Commands used to reach current submode
  root             Exit to the global configuration mode
  show             Show contents of configuration

Just wondering....


Notes

NX-OS has already solved that problem.

Saturday, April 23, 2011

How to assign IPv6 addresses to broadband CPEs

During the last months i've been experimenting a lot with all possible IPv6 address assignment methods to a broadband subscriber. As is the case with most ISPs and IPv6, we are testing every possible scenario before we put one (or a combination) of them into production; nevertheless we still haven't decided which path to follow on every aspect. And although we have made up our minds on many of those dilemmas, the one that will remain unresolved for a little bit more is the choice between dynamic vs static addresses.

Based on various RFCs, Broadband Forum's TRs, blogs (i.e. Ivan's) and IETF WG presentations, i have gathered below all possible address assignment scenarios/methods, when it comes to a broadband (PPPoE) IPv6 subscriber CPE. Single host connectivity is out of the scope of this list, because it doesn't seem applicable for wireline providers.

Some of these methods have already been verified internally (using mostly Cisco equipment, followed closely by Juniper), others are waiting to be verified and a few are under evaluation whether they actually can be implemented and verified. So, if you believe that something is not applicable, please write down your objection, so we can discuss it. Also, if you find something missing, please feel free to add it in your comment.

I have put them in 3 different areas, based on the interface and the address type. Each area includes various mechanisms related to address assignment.

  • CPE WAN Interface - Link-Local IPv6 Address
    • IPv6CP (RFC 5072)
      • Random/Unique Interface-Id
      • "Framed-Interface-Id" from Radius (RFC 3162)
  • CPE WAN Interface - Global Unicast IPv6 Address/Prefix
    • SLAAC (RFC 4862)
      • RA using a locally configured IPv6 pool on BRAS
      • RA using a "Framed-IPv6-Pool" from Radius (RFC 3162) to define a locally configured IPv6 pool
      • RA using "Framed-IPv6-Prefix" from Radius (RFC 3162)
    • Stateful DHCPv6 (RFC 3315)
      • IA_NA using a locally configured IPv6 address prefix pool on BRAS
      • IA_NA using an external DHCPv6 server, having BRAS as Relay Agent
      • IA_NA using "Framed-IPv6-Pool" from Radius (RFC 3162) to define a locally configured IPv6 pool on BRAS
      • IA_NA using "Framed-IPv6-Address" from Radius (draft-ietf-radext-ipv6-access)
  • CPE LAN Interface - Global Unicast IPv6 Prefix
    • DHCPv6-PD (RFC 3633)
      • IA_PD using "Delegated-IPv6-Prefix" from Radius (RFC 4818)
      • IA_PD using "Framed-IPv6-Prefix" from Radius (RFC 3162) for $username-dhcpv6 (if RFC 4818 is not supported)
      • IA_PD using "Framed-IPv6-Prefix" from Radius (RFC 3162) (if global addresses are not used on the WAN interface)
      • IA_PD using a locally configured IPv6 prefix pool on BRAS
      • IA_PD using an external DHCPv6-PD server, having BRAS as Relay Agent
      • IA_PD using "Framed-IPv6-Pool" from Radius (RFC 3162) to define a  locally configured IPv6 prefix pool on BRAS
      • IA_PD using "Delegated-IPv6-Prefix-Pool" from Radius (draft-ietf-radext-ipv6-access) to define a locally configured IPv6 prefix pool on BRAS
Acronyms
RA = Router Advertisement
IA_NA = Identity Association for Non-temporary Addresses
IA_PD = Identity Association for Prefix Delegation

Color Codes
GREEN : Verified
ORANGE : To be verified
BLUE : To be implemented and verified
RED : Probably not applicable

What i see as the most interesting options are the following two:
  • "Framed-IPv6-Prefix" for WAN (through SLAAC) and "Delegated-IPv6-Prefix" for LAN
  • "Framed-IPv6-Address" for WAN (through DHCPv6) and "Delegated-IPv6-Prefix" for LAN

Hopefully we'll soon have support of "draft-ietf-radext-ipv6-access" by BRAS vendors and we'll be able to verify another bunch of them.


Notes 
  • Radiator has been used as the radius server for all our internal tests. Its flexibility probably puts it at the top of the market.
  • Technically, IPv6CP is not an address assignment method (like IPCP). It just helps/hints the peer to build a possible IPv6 link-local address by negotiating a 64bit Interface-Id option. There have been various draft RFCs (draft-huang-ipv6cp-options, draft-qin-pppext-ipv6-addr-pref) that were supposed to add more options to it (like Prefix, IPv6-Address, Delegated-Prefix, DNS-IPv6-Address, etc.), but they expired and never moved into standard status. Two new ones (draft-hu-pppext-ipv6cp-requirements, draft-hu-pppext-ipv6cp-extensions) have brought the issue into surface again. IETF members' usual answer is that PPP is a link-layer protocol, so higher level (i.e network,application) protocols should be left outside.

Saturday, April 16, 2011

How to find the peer IPv6 address of a PPPoE subscriber

In the IPv4 world you could very easily do the following on a BRAS/BNG, find the subscriber's IPv4 address and ping it.

bbras#sh users | i test
  Vi4          test PPPoVPDN     00:01:42 10.11.12.13
bbras#p 10.11.12.13

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 10.11.12.13, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/18/24 ms

Although ping wouldn't always work, in many cases (especially under a managed CPE environment), ping was an easy way to verify subscriber's connectivity, something that is useful for the call-center or 1st level support. Besides checking basic connectivity, using just a single command (as the one shown above, or the ones shown below) you could easily (or with a little bit of text searching) find the IPv4 address of a subscriber.

bbras#sh caller user test

  User: test, line Vi4, service PPPoVPDN
        Connected for 1d05h, Idle for 00:01:54
  Timeouts:    Limit     Remaining Timer Type
               01:00:00  00:58:05  PPP idle
  PPP: LCP Open, multilink Closed, PAP (<-), IPCP, IPV6CP
  IP: Local 10.10.10.10, remote 10.11.12.13
      Access list (I/O) is 120/not set
  Counts: 27378 packets input, 534558 bytes, 0 no buffer
          0 input errors, 0 CRC, 0 frame, 0 overrun
          13703 packets output, 280432 bytes, 0 underruns
          0 output errors, 0 collisions, 0 interface resets

bbras#sh ip int virtual-access 4
Virtual-Access4 is up, line protocol is up
  Interface is unnumbered. Using address of Loopback0 (10.10.10.10)
  Broadcast address is 255.255.255.255
  Peer address is 10.11.12.13

bbras#sh ppp int virtual-Access 4
PPP Serial Context Info
-------------------
Interface        : Vi4
PPP Serial Handle: 0x0
PPP Handle       : 0x0
SSS Handle       : 0x0
AAA ID           : 0
Access IE        : 0x0
SHDB Handle      : 0x0
State            : Down
Last State       : Init
Last Event       : None

PPP Session Info
----------------
Interface        : Vi4
PPP ID           : 0xC600001D
Phase            : UP
Stage            : Local Termination
Peer Name        : test
Peer Address     : 10.11.12.13
Control Protocols: LCP[Open] PAP+ IPCP[Open]
Session ID       : 29
AAA Unique ID    : 59
SSS Manager ID   : 0x3B
SIP ID           : 0x4F00003A
PPP_IN_USE       : 0x11

Vi4 LCP: [Open]
Our Negotiated Options
Vi4 LCP:    AuthProto PAP (0x0304C023)
Vi4 LCP:    MagicNumber 0x547CCD04 (0x0506547CCD04)
Vi4 LCP:    EndpointDisc 1 bbras
Vi4 LCP:    (0x13130162627261732D6C6C752D6B6C6E)
Vi4 LCP:    (0x2D3331)
Our Rejected options
  MRRU
Peer's Negotiated Options
Vi4 LCP:    MagicNumber 0x3DB09C3A (0x05063DB09C3A)

Vi4 IPCP: [Open]
Our Negotiated Options
Vi4 IPCP:    Address 10.10.10.10 (0x0306C2DBE763)
Peer's Negotiated Options
Vi4 IPCP:    Address 10.11.12.13 (0x0306C2DB711D)
Vi4 IPCP:    PrimaryDNS 10.1.1.1 (0x8106C15C9603)
Vi4 IPCP:    SecondaryDNS 10.2.2.2 (0x8306C15C030B)


Now, in the IPv6 world, you can have quite a few of IPv6 "addresses" on the CPE (link-local address, SLAAC/DHCPv6 prefix/addresses for the WAN, DHCPv6-PD prefix/addresses for the LAN) and very limited info on the BRAS/BNG, that actually there is no easy way to do something similar.

First of all, there isn't any "show ipv6 users" command. And if there was one, which IPv6 address should it display there?

Secondly, although in 99% of the cases you can set the Framed-Interface-Id per user, this doesn't mean it will be honored. The problem with Framed-Interface-Id is that it is used as a hint to the peer, so you cannot always depend on your own being used at the end. Btw, Broadband Forum TR-187 R-32 proposes to have the BRAS/BNG decline the tentative Interface-Id received from CPE in case a Framed-Interface-Id from Radius is being used.

In any case, a manual concatenation of the prefix + Id would produce the asked IPv6 addresses.

So if you want to find the IPv6 address of a subscriber, you have to do some of the following steps:

bbras#sh ipv6 int virtual-access 4
Virtual-Access4 is up, line protocol is up
  IPv6 is enabled, link-local address is FE80::EE44:76FF:FEC8:5C1B
  No Virtual link-local address(es):
  Interface is unnumbered. Using address of Loopback0
  No global unicast address is configured

Note: peer IPv6 address is not included as probably expected in the "show ipv6 int" output. The one shown above, is the link-local IPv6 address on the BRAS/BNG side.

Adding the "prefix" keyword at the end of the previous command, reveals the Framed-IPv6-Prefix being used on this interface:

bbras#sh ipv6 int virtual-access 4 prefix
IPv6 Prefix Advertisements Virtual-Access4
Codes: A - Address, P - Prefix-Advertisement, O - Pool
       U - Per-user prefix, D - Default
       N - Not advertised, C - Calendar

PD default [LA] Valid lifetime 2592000, preferred lifetime 604800
OD 2999:2148:100:300::/64 [LA] Valid lifetime 2592000, preferred lifetime 604800

Under PPP you can find IPv6CP and the corresponding Framed-Interface-Id:

bbras#sh ppp int virtual-Access 4 | b IPV6CP:
Vi4 IPV6CP: [Open]
Our Negotiated Options
Vi4 IPV6CP:    Interface-Id EE44:76FF:FEC8:5C1B (0x010AEE4476FFFEC85C1B)
Peer's Negotiated Options
Vi4 IPV6CP:    Interface-Id 3131:3131:3A31:3131 (0x010A313131313A313131)

So, now you have the following info:

Framed-IPv6-Prefix: 2999:2148:100:300::/64
Framed-Interface-Id: 3131:3131:3A31:3131
Link Local prefix: FE80::/10

Based on these strings, you can create the following IPv6 addresses:

Peer global address: 2999:2148:100:300:3131:3131:3A31:3131
Peer link-local address: FE80::3131:3131:3A31:3131

And you can verify connectivity to them accordingly:

bbras#p 2999:2148:100:300:3131:3131:3A31:3131

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 2999:2148:100:300:3131:3131:3A31:3131, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/16/16 ms

bbras#p FE80::3131:3131:3A31:3131%Virtual-Access4

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to FE80::3131:3131:3A31:3131, timeout is 2 seconds:
Packet sent with a source address of FE80::EE44:76FF:FEC8:5C1B%Virtual-Access4
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/17/20 ms


Many thanks to Ole Troan for sharing the "%interface" trick with me ("%interface" is usually used to define the zone index on UNIX systems). I hope our talk about an IPv6 output similar to "show users" will have a positive end inside IOS code.

Sunday, March 13, 2011

Trying to calculate the IPv6 BGP table in 2015

As you may have already noticed, during the last months there isn't anything new written here by me. This is mainly due to two reasons: 1) lack of free time due to new responsibilities in my company and 2) most of my writing happens in our internal wiki (almost 200 "docs" during the last 12 months!).

On the other hand, there are a lot of "new" things happening in the internet, one of them being the IPv6 "paranoia". Since we have made IPv6 be a strict requirement in my company for any new product evaluation during the last 2 years, one of the things i'm trying lately to calculate is the size of the IPv6 BGP routing table in the forthcoming years.

Geoff has a diagram showing the size of IPv4 BGP table from 1994 till today.


As you can see, it started at ~20k in 1994 and has reached ~360k in 2011.

Imho, it's too risky to believe the same trend-line will also be used by IPv6. I expect IPv6 to have a higher rate and a sharper curve, especially in the next 2 years, where IPv4 will probably lower its increase rate. The only thing that can stop this is a translation method that works for everything (current translations have a lot of issues). If such a thing doesn't appear soon (lately, IPv6 related RFCs are being updated/published/expired quite frequently), everyone will soon or later try to move to IPv6, meaning more routes in the global IPv6 BGP table.

Currently, the IPv6 BGP table has ~5k routes, but the potential for rapid increase seems quite promising. As you can see in the following diagram, the IPv6 table has near-doubled its size during the last year and 2011 started quite aggressively.


Based on the above diagram, one could suppose the increase will continue to happen on the same rate during the next years too.

Another approach is based on the number of prefixes that each ASN announces. The average in IPv4 is a steady ~10 prefixes per ASN all these years (currently there are ~37k IPv4 ASNs and ~360k IPv4 routes). The average in IPv6 is 1.4 prefixes per ASN (currently there are ~3.5k IPv6 ASNs and ~5k IPv6 routes), but that might change soon.

Another thing to keep in mind is the average prefix size in IPv6. It started as ~/33, had many up-downs and it has now reached ~/38. This will probably keep on showing this up-down behavior, but in general i expect a small increase as time passes by, for the reason i explain below. Also, having as a reference the IPv4 diagram, where it has stabilized somewhere above /22, i expect the IPv6 one to stabilize around /44.

IPv4 didn't have a competition and/or push from another protocol all the past years, so its increase was based solely in the number of new networks being connected to the internet. IPv6 has to support all current IPv4 networks in a shorter time-frame that IPv4 did (the translation case applies here too), plus all the new ones that are coming out due to more devices getting connected to the internet. This of course doesn't necessarily mean more IPv6 prefixes in the BGP table.

A /32 is usually what a provider gets these days when asking of IPv6 space. And although a single /32 announcement could probably cover all routing needs of the provider, this is difficult to happen in reality. The usual interconnections between providers are based on multiple 10 Gbps or 2,5 Gbps links. In order to distribute the /32 traffic among all of them, something must be done in terms of traffic engineering; split the /32 in multiples of /40,/48, etc. (imho, a IPv4 /24 is like a IPv6 /48 in terms of network announcement; most providers will filter anything smaller). Currently this isn't needed, because IPv6 traffic is minimal, so a single 1 Gbps link is more than enough. But in the near future, a 10 Gbps link won't be enough to service a provider's /32.

Meanwhile, 40 Gbps and 100 Gbps are around the corner. I don't know if these will become a commodity before IPv6 traffic explosion. If yes, we might see IPv6 table increasing in the beginning and decreasing after a while.

Regarding vendors, the things are quite complicated as usual. Here is the information i managed to gather about IPv6 routes capacity (FIB), based on the available info on the internet.

Cisco CRS-1 : 1m IPv6 routes
Cisco ASR9k : 1m IPv6 routes
Cisco ASR1k/RP2 : 2m/4m IPv6 routes (?)
Cisco ASR1k/RP1 : 250k IPv6 routes
Cisco C12k : 450k IPv6 routes
Cisco 7600/CXL : 500k IPv6 routes
Cisco 6500/CXL :  500k IPv6 routes
Juniper T640 : 750k IPv6 routes
Juniper MX960 : 720k IPv6 routes
AlcaLu 7750 : 512k IPv6 routes (RADIX compression)
Brocade XMR : 240k IPv6 routes
Brocade MLX : 240k IPv6 routes
Ericsson SE1200 : 2m IPv6 routes

Note : These have not been verified by the vendors. If you have any other numbers, please inform me.

Also, before taking into account the above numbers, you should remember that dual-stack will consume IPv4+IPv6 space (future IPv4 estimations talk about 500k routes) and some of these numbers might be hardcoded, while some other are configurable.

A linear scenario starting from today would be the following, which would leads us to ~29k IP6 routes.


Some "exponential" scenarios starting from today would be the following, which would lead us to ~104k/141k/194k IPv6 routes.


Lastly, an exponential scenario based on past utilisation would be the following, which would lead us to ~32k IPv6 routes.



Personally, i would feel very safe with ~125k IPv6 routes until 2015. I would even risk for ~50k. Anyway, i believe it will be much easier to predict in 2 years from now.

As a last note, i have added a poll to the right regarding your estimation about the IPv6 BGP table in 2015. Please feel free to submit your vote.

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Greece License.