Tuesday, March 3, 2009

Would you mind waiting for 2-3 mins in a console?

Details about bug CSCed45578 :

Console locked for 2 mins after booting with system accounting

Symptom:

When an IOS system is configured with "aaa accounting system", then users may
be prevented from starting an exec session on the console or terminal lines for
as much as three minutes, after the system reloads.

Conditions:


This delay occurs when the RADIUS and/or TACACS+ accounting server(s) are not
reachable at boot time.

Workaround:


Configure the following hidden global command:

router(config)#no aaa accounting system guarantee-first


Imagine this now on a router using many different authentication methods :


aaa authentication login default group tacacs+
aaa authentication login CONSOLE-ACCESS local
aaa authentication ppp default group radius
!
aaa accounting exec default start-stop group tacacs+
aaa accounting network default start-stop group radius
aaa accounting system default start-stop group radius
!
line con 0
login authentication CONSOLE-ACCESS

Terminal clients are authenticated through tacacs and their accounting is sent to tacacs too. These are probably the engineers trying to do their job in the router.

Console clients are authenticated using the local usernames ("enable" could be used too) but their accounting is sent to tacacs. These are probably the high-level engineers looking to do a somewhat more important job in the router.

PPP (i.e. adsl) customers are authenticated through radius and their accounting is sent to radius too.

System accounting (before reload and after reload/power-on) is also sent to radius, so ppp users from this router left in the database can be erased after the forementioned events.


What do you think will happen immediately after a reload, if the radius server is not reachable, but the router has ip connectivity?


1) ppp customers will be unable to connect (radius auth)
2) engineers will be unable to connect (tacacs auth)
3) high-level engineers will be unable to connect (local auth)

If you answered 1 (like i believed until some months ago), then you're 1/3 correct.
NOBODY will be able to connect to this router, regardless of the access method used.
This will last for 2-3 minutes in the best case (retransmit and timeout intervals of the aaa server do not seem to have any effect), but i have seen it lasting for 7-8 minutes in some cases (i guess before bug CSCed45578 was fixed).

Well, tell me honestly; would you mind waiting for 2-3 mins in a console, while trying to troubleshoot an issue? But let's suppose you can wait (some mins are nothing in comparison to eternity). What if there is a crash or an event happening during this small time and you need to further troubleshoot the issue? EEM can help in some circumstances, but it's not a panacea.

Unfortunately you have to live with it! And according to Cisco this behavior is expected.

But, there is a workaround! No need to worry. "no aaa accounting system guarantee-first", which was a hidden command in older releases, will fix the above issue, so anyone not affected by the non-working aaa server can have access to the router as soon as it boots up.

And when you though that finally a solution has been found, you discover that this command is causing another headache. And here is the complete story...


When "aaa accounting system default start-stop group radius" is enabled, router will sent an Accounting-On packet to the radius server sometime when it boots up. It will also try to sent an Accounting-Off, but that is not guaranteed (i.e.: after a crash or lost connection).

When the radius server receives the Accounting-Off packet, it assumes that all users disconnected from the router. For the same reason, when it receives the Accounting-On packet it assumes that the router just booted up, so the users were disconnected some time ago and now there should be nobody connected.

If you have configured:

aaa accounting system default start-stop group radius
aaa accounting system guarantee-first ! enabled by default => not shown

then router guarantees that the first aaa packet sent will be the Accounting-On. So, no login (in whatever form) is allowed, until a response is received from the radius server (or whatever system accounting method you have used), or the 2-3 mins waiting period passes. This way you're sure that the Accounting-On packet will be the first one to reach the radius.

If you have configured:

aaa accounting system default start-stop group radius
no aaa accounting system guarantee-first

then some users might log in before the router sends the Accounting-On packet. As a consequence, there is a possibility that some accounting start packets will be sent before the Accounting-On packet and/or some others might be lost. For most radius server implementations it is expected that all the previous user entries will be cleared from the database, when the Accounting-On packet is received. So, in your database, you'll see the above logged-in users as not connected.

I understand the motivation behind Cisco's implementation, but i think it lacks flexibility. In my case i don't care what will happen to the accounting records produced by the con/vty exec sessions during the bootup time. I prefer to lose them (i might even not create them in the first place), instead of postponing my access to the router.

A solution that comes to my mind right now would be to add this first-packet-guarantee specifically on each authentication method, like below:

aaa authentication login default group tacacs+
aaa authentication login CONSOLE-ACCESS enable
aaa authentication ppp default group radius guarantee-system-acct-first
aaa authentication ppp AAA-RADIUS group radius guarantee-system-acct-first

or disable this first-packet-guarantee per aaa method :

aaa authentication login default group tacacs+ no-guarantee-system-acct-first
aaa authentication login CONSOLE-ACCESS enable no-guarantee-system-acct-first
aaa authentication ppp default group radius
aaa authentication ppp AAA-RADIUS group radius

or enable/disable it per aaa server group, something like :

aaa group server radius RADIUS-GROUP
server 1.1.1.1 auth-port 1645 acct-port 1646
server 2.2.2.2 auth-port 1645 acct-port 1646
guarantee-system-acct-first


Before an aaa method or server group with the "guarantee-system-acct-first" parameter is invoked, the router should make sure that the "Accounting-On" packet has been sent and a response has been received. All other aaa methods should be invoked without any restrictions.

This way, engineers will be able to access the router without any waiting (through tacacs/local auth), but ppp customers will not be allowed to login, until the radius server has responded to the Accounting-On packet or the waiting time has passed.

Cisco recommends this to be handled through a PER (Product Enhancement Request) without of course being able to guarantee (have they tried using the above command? :-p) its solution.

I, personally, have negative experience from 3 PERs until now and i don't see the reason to make them 4. How about you? Have you ever submitted a PER to Cisco and what was the result? Please spare a min and complete the poll to the right.

Update 9 Mar 2009 : Bug CSCsy20392 has been opened, for everyone interested in following it. It has been opened as enhancement (to change the default behaviour of accounting guarantee-first command) and hopefully a PER will be attached to it.

7 comments:

  1. really a good findings and workaround.

    regards
    shivlu jain

    ReplyDelete
  2. From my experience, only those who haven't developed real equipment code in C programming language and below get somewhat frustrated with bugs. Those who have know that bugs are inevitable given human nature and the fact of minimum time to market times. When you ask for the next super feature you require ASAP to maintain your client base and offer new services, think again because more bugs will surely creep in. How much time do you think a developer has to develop code and resolve bugs? Surely less than it took you to write this article and definitely much less than the time it took you to come to those conclusions. And not many people decide to write real equipment code, because it is plainly too hard, while most people try to get it out the easy way in this life. While others are waiting for mistakes to stop happening, I am going to make my next mistake, because I just can't help it. Grace to those who never made any mistakes in their lives.

    ReplyDelete
  3. Anonymous, my complain is not about a bug. It's about a misbehavior that is NOT considered a bug.
    If Cisco would consider it a bug, then i wouldn't mind waiting for some months until it would get fixed.

    ReplyDelete
  4. Misbehavior happens everywhere. No matter in which environment you work, I am sure you have noticed cases with your colleagues misbehaving to clients or not paying attention to detail in complaints from clients, while you thought you could do better, but even in those cases you simply could not decide to do their job for the rest of your life. Managers are interested in average behavior and do not hire only the best, because they cannot afford to be left with only a handful of people. In many environments there exist escalation procedures and only if you manage to go up to very high levels would your issues have chances to be resolved. And even this would depend on the severity of your issue and whether workarounds exist compared to many other existing issues.

    People often find easy to complain against an organization altogether, such as cisco, and they do not consider the fact that every organization consists of many people and what they say actually corresponds to the work of some of those people or even in factors not directly under those people's control.

    Also have in mind that in the engineering area, it is common for one type of engineering team to blame the other. Software people think their software would have been perfect and would have met the deadlines, if only hardware people had done their job right. Network engineers think they are the best (especially in the provider arena), and their networks would be perfect, if only those bugs did not exist. This story goes even further and affects non-engineering teams. Sales people think they would sell more if those network engineers were more careful when touching their routers and could resolve all client issues. I guess I feel the most pity for the hardware people who can only blame themselves and God!

    ReplyDelete
  5. I have to agree with most of your sayings. But this doesn't change the fact that all cisco engineers i have talked to, have agreed that what is happening in my case is not correct. It's the BU that has another opinion and considers the above as expected.

    ReplyDelete
  6. reading up on this one just now. thanks for the well explained post. question is: for older ios (presumably those without CSCed45578 or CSCed24846 integrated) that do not have the option to turn off guarantee first, we will have the above problem with no solution at all, except to turn off 'aaa accouting system' ?

    ReplyDelete
    Replies
    1. Although i haven't verified that, i believe what you say will probably happen.

      Delete

 
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Greece License.