Sunday, February 23, 2014

IPv6 radius accounting is still a mess


Since the beginning of putting IPv6 into production BRAS/BNG (almost 3 years ago), we were facing the following issue: radius accounting records were missing either IPv4 and/or IPv6 address information. When only IPv4 was being used, everything was easy: just wait for IPCP to complete and then send the accounting record. Now, with the addition of IPv6 and most importantly DHCPv6 into the equation, things have become complicated and even worse, trying to fix IPv6 has broken IPv4.

The issue is mainly caused by the extra PPP/NCP negotiation happening due to IPv6 and by the DHCPv6 running asynchronously on top of IPv6 and PPP. Unfortunately most of these issues are unnoticed until you have a lot of IPv6 subscribers and a lot of different CPEs.

All these years, Cisco has tried various solutions in order to fix this issue, maybe following a wrong way.

My proposal?

  • Allow the first accounting packet (start) to be sent into a configurable (preferable small) timeframe, assuming at least one of IPv4/IPv6 address assignments has completed.
  • Allow a second accounting packet (interim/alive) to be sent a little bit later (with a larger timeframe compared to the first one), only if some IPv6 info wasn't included in the first accounting packet.
  • Both timeframes should be configurable by the user.
  • Allow an extra accounting packet (interim/alive) to be sent whenever an DHCPv6 prefix delegation happens, only if its info wasn't included in the previous two accounting packets.

TR-187 from Broadband Forum describes a similar behavior (in R-67 to R-73), but it includes an accounting record per each phase of PPP/NCP/IPCP negotiation.

Here is a more detailed description of my proposal:

Everything happening at the PPP layer, should produce a single accounting record assuming "logical" PPP timeouts forbid the NCP negotiation to continue for a long time.  PPP should complete at least one of the IPv4/IPv6 address negotiations during this time, otherwise the PPP session should be torn down. If at least one protocol address assignment is successful, then send the accounting start record for this protocol and allow for extra accounting (interim/alive) record to be sent as soon as the other protocol completes too. But don't add extra delay in the initial accounting, while waiting for both IPv4/IPv6 address assignment to happen. This initial accounting record (start) should include both IPv4 (IPCP) address and IPv6 (IPV6CP) interface-id, with the capability to also add the ICMPv6 RA prefix if SLAAC is being used. Based on my measurements (on a averagely loaded BRAS/BNG), IPCP/IPV6CP assignments should take less than 100ms, while SLAAC completion might take up to 500ms after PPP completes.

A 2nd accounting record should be sent, just after acquiring any IPv6 address/prefix through DHCPv6 and/or DHCPv6-PD. Since DHCPv6 runs asynchronously on top of IPv6 and you cannot guarantee for the exact timeframe into which the CPE will ask for an address/prefix using it, it's better to have a separate accounting record for it, even if that comes after some seconds. Based on my measurements, DHCPv6 completion might happen up to 3-5 seconds after PPP completes, depending mostly on the CPE's ability to initiate DHCPv6 at the right time (after that it's just some ms until the prefix is assigned).

A 3rd accounting record should be sent on vary rare cases, i.e. when IPv4 assignment completes early, SLAAC and/or DHCPv6 IA-NA are completed afterwards and DHCPv6-PD happens at a later time (although relevant RFCs "forbid" such behavior, the network should be able to cope with "arrogant" CPEs too).

Of course if the DHCPv6/DHCPv6-PD assignment happens into the initial small timeframe, then its info should also be included into the initial accounting packet if possible. Or if SLAAC accounting didn't happen into the initial timeframe, then its info should be included into the second accounting packet.

Also, it's obvious that if PPP protocol rejects are received for IPv4 (i.e. for IPv6-only) or IPv6 (i.e. for IPv4-only), then there shouldn't be any waiting time for the rejected protocol to complete address assignment.

An even better proposal is to fix the periodic accounting inform in order to include every possible type of new information (including IPv4/IPv6 addresses), either as soon as it happens or after some configurable time.

All the above is interesting for conversation assuming that the prefix assigned through DHCPv6-PD is indeed included into the radius accounting packets. There have been numerous tries of fixing that and the IPv4/IPv6 accounting in general ("aaa accounting include auth-profile delegated-ipv6-prefix", "aaa accounting delay-start extended-time", "aaa accounting delay-start [all]", CSCua18679, CSCub63085, CSCuj80527, CSCtn24085, CSCuj09925), so let's hope this time Cisco will be successful.

PS: Some things might work better or worse depending on whether the DHCPv6-PD assignment happens through a local pool or a pool configured on radius. Ivan's blog includes some information too.

Sunday, October 27, 2013

How Multi is MP-BGP in IOS-XR - Part #2

When two years ago i was writing the first part of "How Multi is MP-BGP in IOS-XR", i concluded with the following:

  1. In IOS-XR you need an IPv6 NH in order to activate the IPv6 AF for an IPv4 BGP session.
  2. If you don't have an IPv6 NH, then the IPv4 BGP session won't even come up.
  3. The above was done to protect against misconfiguration, because otherwise you would get a misleading v4 mapped v6 address as NH.
  4. If you have an IPv6 NH, then the IPv4 BGP session with the IPv6 AF will come up.
  5. If afterwards you remove the IPv6 NH, then the session deliberately remains up and you get a misleading v4 mapped v6 address as NH.
Although back then i didn't agree with the above behavior, i couldn't do anything more than just accept the "solution" given: print a warning message if such a case is met.

Recently i realized that the same IOS-XR developers decided to bring even more confusion into the engineer's everyday job.

Old Fact: You can't have an IPv4 address as next-hop for IPv6 prefixes
New Fact: You must display an IPv4 address as next-hop for IPv6 prefixes

I probably need to explain it a little bit more, because i'm talking about 6VPE this time. 6VPE is a technology/architecture which allows you to have IPv6 connectivity over an IPv4 MPLS (and not only) network.

The original 6VPE scenario where i met this behavior is quite complex, but for simplicity let's assume that it includes only the following:

2 x CE routers (CE1, CE2)
2 x 6VPE routers (PE1, PE2)
2 x P/RR routers (P1, P2)

This is a very simple VPN topology (with only two sites) like the following:

CE1 <=> PE1 <=> P1 <=> P2 <=> PE2 <=> CE2

All routers are exchanging IPv4/IPv6 prefixes using BGP, in order to have IPv4/IPv6 connectivity between the two CEs. IPJ describes the relevant route advertisement with a very nice picture:

If we could follow the exchange of information between some of these routers, then we would notice something like the following.

CE1 sends its IPv6 prefixes to PE1 through MP-BGP and PE1 sends them to P1 accordingly. Since (as of now) there is no support for LDP over IPv6, PE1 sends the IPv6 prefixes using a v4-mapped IPv6 address as next-hop (that's encoded inside the NLRI as shown in the debugs).

This is how the IPv6 prefix is sent from PE1 (10.10.253.164) to P1 (10.10.253.161).

PE1#sh bgp vpnv6 unicast all neighbors 10.10.253.161 advertised-routes
BGP table version is 7, local router ID is 10.10.253.164
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, f RT-Filter, a additional-path
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:109 (default for vrf TEST-VRF)

*> 2001:DB8:2:60::22/127
                    2001:DB8:2:50::23        0         32768 ?
 


This is the relevant debug info that shows how exactly the IPv6 prefix (2001:DB8:2:60::22/127 with next-hop of ::FFFF:10.10.253.164) is sent from PE1 to P1.

BGP(5): (base) 10.10.253.161 send UPDATE (format) [100:109]2001:DB8:2:60::22/127, next ::FFFF:10.10.253.164, label 2689, metric 0, path Local, extended community RT:100:109


And this is how the IPv6 prefix is shown on the P1 (10.10.253.161) when received from PE1 (10.10.253.164):

P1#sh bgp vpnv6 unicast neighbors 10.10.253.164 routes
BGP router identifier 10.10.253.161, local AS number 100
BGP generic scan interval 60 secs
Non-stop routing is enabled
BGP table state: Active
Table ID: 0x0   RD version: 3888712472
BGP main routing table version 3
BGP NSR Initial initsync version 3 (Reached)
BGP scan interval 60 secs

Status codes: s suppressed, d damped, h history, * valid, > best
              i - internal, r RIB-failure, S stale
Origin codes: i - IGP, e - EGP, ? - incomplete
   Network            Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:109
*>i2001:DB8:2:60::22/127
                      10.10.253.164           0    100      0 ?


Now, if the old fact was existent in the case of VPNv6 prefixes as in the case of simple IPv6 prefixes, then the IPv6 BGP session shouldn't even come up. Instead, it comes up and works fine. But, in order to confuse me even more, the next-hop of an IPv6 prefix is an IPv4 address (!!!).

Reminder...
Old Fact: You can't have an IPv4 address as next-hop for IPv6 prefixes

Quoting from RFC 4659 (BGP-MPLS IP Virtual Private Network (VPN) Extension for IPv6 VPN):

When the IPv6 VPN traffic is to be transported to the BGP speaker using IPv4 tunneling (e.g., IPv4 MPLS LSPs, IPsec-protected IPv4 tunnels), the BGP speaker SHALL advertise to its peer a Next Hop Network Address field containing a VPN-IPv6 address:

- whose 8-octet RD is set to zero, and

- whose 16-octet IPv6 address is encoded as an IPv4-mapped IPv6  address [V6ADDR] containing the IPv4 address of the advertising BGP speaker. This IPv4 address must be routable by the other BGP Speaker.


Now, if we check the PE2 on the other side, who is also getting some IPv6 prefixes from CE2, we'll notice that everything is fine and according to everything we know about 6VPE. So this time the IPv6 prefixes have a v4-mapped IPv6 address as next-hop, which is the expected output for a 6VPE topology.

This is how an IPv6 prefix is shown on the P2 (10.10.231.4) when received from PE2 (10.10.253.165):

P2#sh bgp vpnv6 unicast all neighbors 10.10.253.165 routes
BGP table version is 4, local router ID is 10.10.231.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale, m multipath, b backup-path, x best-external, f RT-Filter, a additional-path
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
Route Distinguisher: 100:141
*>i2001:DB8:10:1::22/127
                    ::FFFF:10.10.253.165
                                             0    100      0 65141 ?


Let's summarize:

P1: IPv6 prefix 2001:DB8:2:60::22/127 with 10.10.253.164 as NH
P2: IPv6 prefix 2001:DB8:10:1::22/127 with ::FFFF:10.10.253.165 as NH

If you still haven't figured out the difference between P1 and P2 based on the outputs given, let me tell you that P1 is an IOS-XR (4.2.x) based platform while P2 is an IOS (15.x) based one. In terms of functionality everything works fine, because internally IOS-XR uses the correct v4-mapped IPv6 address as next-hop. It just doesn't display it on the CLI, neither on the debugs.

Now you might wonder why such a difference in IOS-XR behavior. After waiting for several weeks for a lab reproduction from Cisco, because nobody knew about this behavior, i got a very enlightening answer from the developers:

IOS-XR decided it's best to display the actual next hop which is a v4 nexthop and is also registered with the v4 RIB for tracking purposes, rather than what is transported in the NLRI and matches the output of everything else. So the behaviour is indeed expected and works 'by design'.


Reminder...
Old Fact: You can't have an IPv4 address as next-hop for IPv6 prefixes

Don't you love those guys? First they tell that you can't have a v4 nexthop, but now they tell you that it's better to display a v4 nexthop, although the actual NLRI has a different one.

Luckily, someone inside Cisco (thx Xander once more) agreed that this behavior is misleading, so an enhancement request (CSCuj74543) was opened. If you would like to change this behavior too, please link your case to this DDTS. At the same time a documentation DDTS (CSCuj76745) was opened in order to officially describe this IOS-XR "expected" and "by design" behavior on CCO.

Saturday, November 10, 2012

You have to make the right balance between the convergence time and MTU

Lately i'm getting the impression that Cisco is getting new products out without the proper internal testing.

I'm going to talk about two recent examples, ASR1001 and ASR901, devices that are an excellent value for money, but (as usual) hide limitations that you unfortunately find out only after exhaustive testing.

ASR1001 is a fine router, a worthy replacement of 7200, which can be used for various purposes. Of course, as every new platform by every vendor these days, it fully supports jumbo frames and that's a nice thing. At least you get that impression until you try to use the large MTU for control/routing protocols, where you might fall into an also nice surprise.

MTU 1500 (just a few ms)

17:22:35.415: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from DOWN to INIT, Received Hello
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from INIT to 2WAY, 2-Way Received
17:22:35.539: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from 2WAY to EXSTART, AdjOK?
17:22:35.643: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXSTART to EXCHANGE, Negotiation Done
17:22:35.823: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXCHANGE to LOADING, Exchange Done
17:22:35.824: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from LOADING to FULL, Loading Done

MTU 9216 (~48 sec!)
17:43:07.923: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from FULL to DOWN, Neighbor Down: Interface down or detached
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from DOWN to INIT, Received Hello
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from INIT to 2WAY, 2-Way Received
17:43:08.001: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from 2WAY to EXSTART, AdjOK?
17:43:08.098: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXSTART to EXCHANGE, Negotiation Done
17:43:08.241: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from EXCHANGE to LOADING, Exchange Done
17:43:55.942: %OSPF-5-ADJCHG: Process 1, Nbr x.x.x.x on GigabitEthernet0/0/0 from LOADING to FULL, Loading Done

While trying to use MTU 9216 in a test environment (the same issue was observed even with smaller MTU), we met an interesting issue during the exchange of large OSPF databases between ASR1001s (running 15.1S & 15.2S). Packets (with LSAs) were being dropped internally in the router, because the underlying driver of LSMPI (Linux Shared Memory Punt Interface) was not capable of handling such packets in fast rates. These large packets internally got fragmented to smaller ones (512 bytes each), transmitted and then reassembled, something that increased the pps rate (9216/512=18 times) between the involved subsystems.

To be more exact, packets punted from the ESP to the RP are received by the Linux kernel of the RP. The Linux kernel then sends those packets to the IOSD process through LSMPI, as you can see in the following diagram.


So, this is the complete punt path on the ASR1000 Series router:

QFP <==> RP Linux Kernel <==> LSMPI <==> Fast-Path Thread <==> Cisco IOS Thread

Since there is a built-in limit on the pps rate that the LSMPI can handle, by fragmenting internally the packets due to their size, the internal pps rate increases and some sub-packets get discarded, which leads to complete packets being dropped and then retransmitted from the neighboring routers, which in turn leads to longer convergence times...if convergence can be accomplished after such packet losses (there were cases that even after minutes there was no convergence, or adjacency seemed FULL but the RIB didn't have any entries from the LSADB). Things can get messier if you also run BGP (with path mtu discovery) and there is a large number of updates that need to be processed. Some time in the past, IOS included code that retransmitted internally only the lost 512-bytes packets so OSPF couldn't actually understand it was losing packets, but due to it causing other issues (probably overloading the LSMPI even more) it got removed.

So this leads to the question "at what layer should the router handle internally control plane packet loss"? As low as possible in order to hide it from the actual protocol or just leave everything to the protocol itself?

You can use the following command in order to check for issues in the LSMPI path (look out for "Device xmit fail").

ASR1001#show platform software infrastructure lsmpi

LSMPI Driver stat ver: 3

Packets:
         In: 17916
        Out: 4713

Rings:
         RX: 2047 free    0    in-use    2048 total
         TX: 2047 free    0    in-use    2048 total
     RXDONE: 2047 free    0    in-use    2048 total
     TXDONE: 2046 free    1    in-use    2048 total

Buffers:
         RX: 6877 free    1317 in-use    8194 total

Reason for RX drops (sticky):
     Ring full        : 0
     Ring put failed  : 0
     No free buffer   : 0
     Receive failed   : 0
     Packet too large : 0
     Other inst buf   : 0
     Consecutive SOPs : 0
     No SOP or EOP    : 0
     EOP but no SOP   : 0
     Particle overrun : 0
     Bad particle ins : 0
     Bad buf cond     : 0
     DS rd req failed : 0
     HT rd req failed : 0
Reason for TX drops (sticky):
     Bad packet len   : 0
     Bad buf len      : 0
     Bad ifindex      : 0
     No device        : 0
     No skbuff        : 0
     Device xmit fail : 103
     Device xmit rtry : 0
     Tx Done ringfull : 0
     Bad u->k xlation : 0
     No extra skbuff  : 0
     Consecutive SOPs : 0
     No SOP or EOP    : 0
     EOP but no SOP   : 0
     Particle overrun : 0
     Other inst buf   : 0
...

Keep in mind that ICMP echoes cannot be used to verify this behavior, because ICMP replies are handled by the ESP/QFP, so you won't notice this issue.

Note: Cisco has an excellent doc describing all cases of packets drops on the ASR1k platform here.
Also, there is a relevant bug (CSCtz53398) that is supposed to provide a workaround.

Cisco's answer? "You have to make the right balance between the convergence time and MTU"!

I tend to agree with them, but until now i had the impression that a larger MTU would lower the convergence time (as long as the CPU could follow). Well, time to reconsider....


Saturday, November 12, 2011

aggregate-address ... summary-only-after-a-while

As it seems, there is always something that you think you know, until it's proven the other way around.

Some years ago, when i was studying for the CCIE, i knew that in order to suppress more specific routes from an aggregate advertisement in BGP, you could use the "aggregate-address .... summary-only" command. And i believed it until recently.

Let's suppose you have the following config in a ASR1k (10.1.1.1) running 15.1(2)S2:

router bgp 100
 bgp router-id 10.1.1.1
 neighbor 10.2.2.2 remote-as 100
 neighbor 10.2.2.2 update-source Loopback0
...
 address-family ipv4
  aggregate-address 10.10.10.0 255.255.255.0 summary-only
  redistribute connected
  neighbor 10.2.2.2 activate
...

Then you have 2 subscribers logging in.
With "show bgp" the 2 /32 routes under 10.10.10.0/24 seem to be suppressed and the /24 is in the BGP table as it should be:

*> 10.10.10.0/24    0.0.0.0                            32768 i
s> 10.10.10.3/32    0.0.0.0                  0         32768 ?
s> 10.10.10.4/32    0.0.0.0                  0         32768 ?

but doing some "debug bgp updates/events" reveals the following:

BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.3/32, next 10.1.1.1, metric 0, path Local
BGP(0): 10.2.2.2 send UPDATE (format) 10.10.10.4/32, next 10.1.1.1, metric 0, path Local

...and after a while:

BGP: aggregate timer expired

BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.3/32
BGP: topo global:IPv4 Unicast:base Remove_fwdroute for 10.10.10.4/32

At the same time on the peer router (10.2.2.2) you can see the above 2 /32 routes being received:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32
*>i10.10.10.3/32    10.1.1.1            0    100      0 ?
*>i10.10.10.4/32    10.1.1.1            0    100      0 ?

and immediately afterwards you can see the 2 /32 routes being withdrawn:

RP/0/RP0/CPU0:ASR9k#sh bgp neighbor 10.1.1.1 routes | i /32

Cisco is a little bit contradicting on this behavior, as usual.


According to Cisco tac, the default aggregation logic runs every 30 seconds, but bgp update processing will be done almost every 2 seconds. That's the reason the route is being updated initially and later withdrawn (due to the aggregation processing following the initial update). They also admit that the root cause of this problem is with the BGP code. The route will be advertised as soon as the best path is completed. It may take 30 seconds or more for the aggregation logic to complete and withdraw the more specific route.

Then we also have the following:

Bug CSCsu96698

BGP: /32 route being advertised while 'summary-only' is configured

Symptoms: More specific routes are advertised and withdrawn later even if config aggregate-address net mask summary-only is configured. The BGP table shows the specific prefixes as suppressed with s>
Conditions: This occurs only with very large configurations.
Workaround: Configure a distribute-list in BGP process that denies all of the aggregation child routes.

Related Bug Information

It takes 30 seconds for BGP to form aggregate route

Symptom: for approximately 30 seconds router announces specific prefixes instead of aggregate route
Conditions: bgp session up/down
Workaround: unknown yet


Release notes of 12.2SB and 12.0S

The periodic function is by default called at 60 second intervals. The aggregate processing is normally done based on the CPU load. If there is no CPU load, then the aggregate processing function would be triggered within one second. As the CPU load increases, this function call will be triggered at higher intervals and if the CPU load is very high it could go as high as the maximum aggregate timer value configured via command. By default this maximum value is 30 seconds and is configurable with a range of 6-60 seconds and in some trains 0. So, if default values are configured, then as the CPU load increases, the chances of hitting this defect is higher.


Release notes of 12.2(33)SXH6

CLI change to bgp aggregate-timer command to suppress more specific routes.

Old Behavior: More specific routes are advertised and withdrawn later, even if aggregate-address
summary-only is configured. The BGP table shows the specific prefixes as suppressed.

New Behavior: The bgp aggregate-timer command now accepts the value of 0 (zero), which
disables the aggregate timer and suppresses the routes immediately.


Command Reference for "bgp aggregate-timer"

To set the interval at which BGP routes will be aggregated or to disable timer-based route aggregation, use the bgp aggregate-timer command in address-family or router configuration mode. To restore the default value, use the no form of this command.

bgp aggregate-timer seconds
no bgp aggregate-timer

Syntax Description

seconds

Interval (in seconds) at which the system will aggregate BGP routes.

•The range is from 6 to 60 or else 0 (zero). The default is 30.
•A value of 0 (zero) disables timer-based aggregation and starts aggregation immediately.

Command Default

30 seconds

Usage Guidelines

Use this command to change the default interval at which BGP routes are aggregated.

In very large configurations, even if the aggregate-address summary-only command is configured, more specific routes are advertised and later withdrawn. To avoid this behavior, configure the bgp aggregate-timer to 0 (zero), and the system will immediately check for aggregate routes and suppress specific routes.


The interesting part is that the command reference for "aggregate-address ... summary-only" doesn't mention anything about this behavior in order to warn you.

Using the summary-only keyword not only creates the aggregate route (for example, 192.*.*.*) but also suppresses advertisements of more-specific routes to all neighbors. If you want to suppress only advertisements to certain neighbors, you may use the neighbor distribute-list command, with caution. If a more-specific route leaks out, all BGP or mBGP routers will prefer that route over the less-specific aggregate you are generating (using longest-match routing).


The following debug logs show the default aggregate-timer which is 30 secs, vs the default BGP scan timer which is 60 secs:

Nov 12 21:45:32.468: BGP: Performing BGP general scanning
Nov 12 21:45:38.906: BGP: aggregate timer expired
Nov 12 21:46:09.637: BGP: aggregate timer expired
Nov 12 21:46:32.487: BGP: Performing BGP general scanning
Nov 12 21:46:40.379: BGP: aggregate timer expired
Nov 12 21:47:11.099: BGP: aggregate timer expired
Nov 12 21:47:32.506: BGP: Performing BGP general scanning
Nov 12 21:47:41.828: BGP: aggregate timer expired
Nov 12 21:48:12.547: BGP: aggregate timer expired
Nov 12 21:48:32.525: BGP: Performing BGP general scanning
Nov 12 21:48:43.268: BGP: aggregate timer expired
Nov 12 21:49:13.989: BGP: aggregate timer expired
Nov 12 21:49:32.544: BGP: Performing BGP general scanning
Nov 12 21:49:44.765: BGP: aggregate timer expired
Nov 12 21:50:15.510: BGP: aggregate timer expired

Guess what! After changing the aggregate-timer to 0, the cpu load increases by a steady +10%, due to the BGP Router process!

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 17%/6%; one minute: 16%; five minutes: 15%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
 61   329991518  1083305044        304  4.07%  4.09%  3.93%   0 IOSD ipc task
340   202049403    53391862       3784  1.35%  2.10%  2.11%   0 VTEMPLATE Backgr
404    84594181  2432529294          0  1.19%  1.18%  1.14%   0 PPP Events
229    49275197     1710570      28806  0.71%  0.33%  0.39%   0 QoS stats proces
152    39536838   801801056         49  0.63%  0.56%  0.51%   0 SSM connection m
159    51982155   383585236        135  0.47%  0.66%  0.64%   0 SSS Manager

ASR1k#conf t
Enter configuration commands, one per line.  End with CNTL/Z.
ASR1k(config)#router bgp 100
ASR1k(config-router)# address-family ipv4
ASR1k(config-router-af)# bgp aggregate-timer 0
ASR1k(config-router-af)#^Z

...and after a while:

ASR1k#sh proc cpu s | exc 0.00
CPU utilization for five seconds: 29%/6%; one minute: 26%; five minutes: 25%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
394   143774989    19776654       7269  9.91%  7.93%  7.32%   0 BGP Router
 61   329983414  1083280872        304  4.79%  3.97%  3.90%   0 IOSD ipc task
340   202044852    53390504       3784  2.71%  2.22%  2.04%   0 VTEMPLATE Backgr
404    84591714  2432460324          0  1.03%  1.16%  1.09%   0 PPP Events
159    51980758   383575060        135  0.79%  0.66%  0.62%   0 SSS Manager
152    39535734   801779715         49  0.63%  0.54%  0.50%   0 SSM connection m

Conclusions

1) By default, the "aggregate-address ... summary-only" command doesn't immediately stop the announcement of more specific routes, as it's supposed to. You need to also change the BGP aggregate-timer to 0.
2) After changing the BGP aggregate-timer to 0, the announcement of more specific routes stops, but the cpu load increases by 10%.


C'mon Cisco! You gave us NH Address Tracking and PIC, and you can't fix the aggregation process?

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Greece License.