During the previous weekend i had a 7600/RSP720 crashing multiple times in the middle of the night without any apparent reason. Console was showing the following messages:
03:31:06 EET Sat May 9 2009: Unexpected exception to CPU Vector 1500, PC = 0x0A7FA5
B0, LR = 0x0A7FA550
-Traceback= A7FA5B0 A7FA550 A558DD8 A40B698 A442928 A442994 8814E18 8815FB8 8816CD4
A874C8C A8755BC A82EFCC A82F1C0 A8306C0 A7F05EC A87A0DC
CPU Register Context:
MSR = 0x00029200 CR = 0x24444024 CTR = 0x0A44DA0C XER = 0x00000000
R0 = 0x0A7FA550 R1 = 0x13D4C3E0 R2 = 0xFFFCFFFC R3 = 0x137005B4
R4 = 0xFFFFFFFE R5 = 0x00000000 R6 = 0x13D4C3B8 R7 = 0x00029200
R8 = 0x00029200 R9 = 0x00000000 R10 = 0x0D06DBF8 R11 = 0x137005B0
R12 = 0x00013FB4 R13 = 0x04044000 R14 = 0x00000000 R15 = 0x00000000
R16 = 0x0E844B94 R17 = 0x0E8663A8 R18 = 0x0C236BD4 R19 = 0x00000000
R20 = 0x13731364 R21 = 0x79D97210 R22 = 0x13D4C580 R23 = 0x0E851E4C
R24 = 0x0D050000 R25 = 0x00000000 R26 = 0x1373158C R27 = 0x0000001F
R28 = 0x00000001 R29 = 0x0C420000 R30 = 0x103FEAA0 R31 = 0x00000000
Writing crashinfo to bootdisk:crashinfo_20090509-033106-EET
*** System received a Software forced crash ***
signal= 0x17, code= 0x1500, context= 0xce08b9c
PC = 0x8273dcc, Vector = 0x1500, SP = 0x1b394d08
rommon 2 >
Trying to boot manually from the rommon didn't have any positive result, because after the boot was completed (and you were given about 10-20 secs of CLI access!), the RSP crashed again. Since rommon doesn't provide a way to view crashinfo files, you're stuck with the crashes, trying your best at various guesses.
This was a crash that appeared at a random time and on a 7600 that gets configured once in a month, so by best guess was a faulty or stuck supervisor card. After the card was OIRed by the oncall engineer, it booted fine and i finally had a look at the crashinfos. The crashinfo on the RP didn't show anything useful. But the crashinfo on the SP showed the following:
%ONLINE-SP-6-INITFAIL: Module 6: Failed to synchronize Port asic
CCO describes it like below:
The cause of the crash is that the Pinnacle ASIC failed to synchronize. This is usually caused by a bad contact or a badly seated card.
The system recovers without user intervention. If the error message recurs, then reseat the concerned line card or module.
Yeah, sure. The system was stuck to rommon after some crashes:
Active crashed three times, disabling auto-boot and dropping to rommon
If anyone would like to answer, these are my questions:
Why couldn't the RSP print an error message (after 3 crashes) on the console while in IOS and sit there waiting for the user to act? What benefit does this specific crash and return to rommon offer?
I understand that generally a crash/reload is used to force the device recover from a bad situation and possibly avoid a long downturn situation. In my case, the situation was a bad contact or a badly seated card (although there is lot of arguing that can be done here). How was the reload supposed to solve this? Was the reload going to "reseat" the module somehow? Are there any pins moved while reloading? Does rommon offer a better seating?
After all, as it proved out the crash didn't help at all, because the RSP returned to ROM, sitting there and waiting, while it could stay in IOS, which provides much better feedback to the user.
On the other hand, it's the redundancy factor. Which, imho, is the only case that a crash/reload should occur (i'm still referring to the case of a badly seated module). As long as you have a 2nd working RSP, it should take the active role. But the system knows whether you have a redundant setup long before giving you the CLI prompt, so it should be easy to check upon it.
What i'm trying to say is that a crash followed by a reload is not always a panacea in case of a problem. It should be an option used, only when there is no other solution. In case of a badly seated module, crashing and moving into rommon isn't of any help (at least until rommon gets upgraded to something more intelligent).