Should the crash be an option?

Saturday, May 16, 2009

Should the crash be an option?

During the previous weekend i had a 7600/RSP720 crashing multiple times in the middle of the night without any apparent reason. Console was showing the following messages:


%Software-forced reload


03:31:06 EET Sat May 9 2009: Unexpected exception to CPU Vector 1500, PC = 0x0A7FA5       
       B0, LR = 0x0A7FA550

-Traceback= A7FA5B0 A7FA550 A558DD8 A40B698 A442928 A442994 8814E18 8815FB8 8816CD4       
      A874C8C A8755BC A82EFCC A82F1C0 A8306C0 A7F05EC A87A0DC

CPU Register Context:
MSR = 0x00029200  CR  = 0x24444024  CTR = 0x0A44DA0C  XER   = 0x00000000
R0  = 0x0A7FA550  R1  = 0x13D4C3E0  R2  = 0xFFFCFFFC  R3    = 0x137005B4
R4  = 0xFFFFFFFE  R5  = 0x00000000  R6  = 0x13D4C3B8  R7    = 0x00029200
R8  = 0x00029200  R9  = 0x00000000  R10 = 0x0D06DBF8  R11   = 0x137005B0
R12 = 0x00013FB4  R13 = 0x04044000  R14 = 0x00000000  R15   = 0x00000000
R16 = 0x0E844B94  R17 = 0x0E8663A8  R18 = 0x0C236BD4  R19   = 0x00000000
R20 = 0x13731364  R21 = 0x79D97210  R22 = 0x13D4C580  R23   = 0x0E851E4C
R24 = 0x0D050000  R25 = 0x00000000  R26 = 0x1373158C  R27   = 0x0000001F
R28 = 0x00000001  R29 = 0x0C420000  R30 = 0x103FEAA0  R31   = 0x00000000

Writing crashinfo to bootdisk:crashinfo_20090509-033106-EET

*** System received a Software forced crash ***
signal= 0x17, code= 0x1500, context= 0xce08b9c
PC = 0x8273dcc, Vector = 0x1500, SP = 0x1b394d08
rommon 2 >

Trying to boot manually from the rommon didn't have any positive result, because after the boot was completed (and you were given about 10-20 secs of CLI access!), the RSP crashed again. Since rommon doesn't provide a way to view crashinfo files, you're stuck with the crashes, trying your best at various guesses.

This was a crash that appeared at a random time and on a 7600 that gets configured once in a month, so by best guess was a faulty or stuck supervisor card. After the card was OIRed by the oncall engineer, it booted fine and i finally had a look at the crashinfos. The crashinfo on the RP didn't show anything useful. But the crashinfo on the SP showed the following:


%ONLINE-SP-6-INITFAIL: Module 6: Failed to synchronize Port asic

CCO describes it like below:

Description
The cause of the crash is that the Pinnacle ASIC failed to synchronize. This is usually caused by a bad contact or a badly seated card.

Workaround
The system recovers without user intervention. If the error message recurs, then reseat the concerned line card or module.

Yeah, sure. The system was stuck to rommon after some crashes:


Active crashed three times, disabling auto-boot and dropping to rommon

If anyone would like to answer, these are my questions:

Why couldn't the RSP print an error message (after 3 crashes) on the console while in IOS and sit there waiting for the user to act? What benefit does this specific crash and return to rommon offer?

I understand that generally a crash/reload is used to force the device recover from a bad situation and possibly avoid a long downturn situation. In my case, the situation was a bad contact or a badly seated card (although there is lot of arguing that can be done here). How was the reload supposed to solve this? Was the reload going to "reseat" the module somehow? Are there any pins moved while reloading? Does rommon offer a better seating?

After all, as it proved out the crash didn't help at all, because the RSP returned to ROM, sitting there and waiting, while it could stay in IOS, which provides much better feedback to the user.

On the other hand, it's the redundancy factor. Which, imho, is the only case that a crash/reload should occur (i'm still referring to the case of a badly seated module). As long as you have a 2nd working RSP, it should take the active role. But the system knows whether you have a redundant setup long before giving you the CLI prompt, so it should be easy to check upon it.

What i'm trying to say is that a crash followed by a reload is not always a panacea in case of a problem. It should be an option used, only when there is no other solution. In case of a badly seated module, crashing and moving into rommon isn't of any help (at least until rommon gets upgraded to something more intelligent).

6 comments:

stretch26 May, 2009 00:40
In two words: bad code. I'm not ragging on Cisco here, but problems like this are symptomatic of IOS' ancient monolithic architecture. The absence of process and memory separation make proper error containment and handling difficult. Hopefully we'll see more mature error recovery as modular IOS and NX-OS gain ground.
ReplyDelete
Replies
Tassos28 May, 2009 23:43
Jeremy, generally i agree with you. I even opened a tac case in order to get some feedback from the developers, but nothing good came out. Imho in this particular case, IOS could help if a cleverer code was used.
ReplyDelete
Replies
Anonymous03 September, 2009 07:20
For IOS disabling autoboot after 3 consecutive active crashes: Would you rather IOS just be perminatly suck in an endless crash-reboot cycle? Is that going to reseat your module and fix your problem for you? Nope, more than likely all you will get is a bootflash overflowing with crash dump files.

As far as the "I can't boot the system up to view the crashdump files" situation. Ask you Cisco contact what the "CRASHINFO" enviroment variable in rommon does, and see if you can get it to write crashdump files to another location, like disk0: which uses a simple enough FAT format that you should be able to view the files on a typical PC.

Its not that Cisco IOS has bad code, its that their documentaion concerning IOS crash situations and rommon enviroment variables is poor. (the latter being non-existant)
ReplyDelete
Replies
Tassos06 September, 2009 17:13
Anynomous,

I would prefer to have it stuck in IOS, like it was for some seconds before the crash. I don't expect all services to work, but i expect to have access to system processes and files in order to do my job as fast as possible. And since this was happening for 20 secs (and during this time cli access was working fine), i tend to believe that it's possible to make it happen for longer times.

Regarding the usage of crashinfo variable, i was looking actually for a way to see the crashdump file while it's on the router. When this router is in a remote location and your only help is the oncall 1st level engineer, you're trying to do as much as possible remotely. Using IOS, i can do "more file", see its contents and probably find the cause, something that is not available in rommon.
ReplyDelete
Replies
Anonymous21 November, 2009 12:17
Hi

you can use priv in rommon and then fdump dev:filename to view the crashdump.

brgds

TS
ReplyDelete
Replies
Anonymous15 November, 2010 15:07
This is a hardware problem with RSP720. But, this can also be an outcome of a possible bug, which may be triggered by some config changes.
ReplyDelete
Replies