When Disaster Strikes…

So, everyone would agree that a helping hand is nice to have now and then. Like the time I thought it would be a good idea to skateboard while holding onto my brother’s car as he drove down the street. It was his helping hand reaching down to pick me up off the road (bleeding) and sitting me in the car that I won’t forget (I still have that scar on my hip). It was brother helping brother – an understanding that when one is down, the other will help get him on his feet (hopefully before mom sees so that we could get our story coordinated as to how it happened). In the UCS world, the brothers in this scenario are the Fabric Interconnects (I’m not sure who the mother is).

There are times when a Fabric Interconnect might encounter a software failure – for whatever reason, and land at the “loader” prompt. It’s rare, but it can happen. The loader prompt is a lonely place and it’s not pleasant. The good news is that if you still have a single FI working, you can use it to resurrect the broken FI. First off, if you ever find yourself staring at the loader prompt, stop cursing and just try to unplug it and plug it back in. Don’t worry if “dir” shows no files – just try it. I’ve seen it work and the FI comes right back on the next boot. If that doesn’t work, you have some work to do…

The loader is just that – a loader. It’s “loads” an OS – like a bootstrap. You need 3 files to permanently get out of the loader – kickstart, system, and the UCSM image. Luckily all of these live on your remaining FI. The bad news is you can’t get to them without bringing it down as well. So, if you’re in production and can’t get afford to bring down the entire UCS pod, you should stop reading and call TAC. They can get you the 3 magic files you need and can get it all running without bringing anything additional offline. But if you’re in a situation where you can afford to take down the remaining FI, you can fix this problem yourself.

To make this work, you will need:

  • Non-functional FI
  • Functional FI
  • FTP Server
  • TFTP Server

Your basic recovery will include:

  1. Disconnect the L1/L2 cables between the FI’s to avoid messing up the cluster data they share
  2. Boot FI-A to loader
  3. Force FI-B to loader
  4. Boot kickstart on FI-B
  5. Assign IP address to FI-B
  6. FTP kickstart, image, and ucsm images from FI-B to an FTP server
  7. Reboot FI-B back to its normal state
  8. Get kickstart image onto TFTP server (unless FTP/TFTP are the same server)
  9. Boot kickstart image on FI-A via TFTP server
  10. FTP kickstart, system, and ucsm images down to FI-A
  11. Copy ucsm image file to the root
  12. Load system image on FI-A
  13. “Activate” the firmware on FI-A
  14. Connect L1/L2 cables back and rejoin the cluster

Reboot the “good” FI (known in this document now as FI-B), and begin pressing CTRL+R to interrupt the boot process. You will find FI-B now stops at the loader prompt too. Now type

boot /installables/switch/ <tab>

which will show you all files in this folder. You are looking for the obvious kickstart file and you want the latest one. To make the display easier to read, I would type this:

boot /installables/switch/ucs-6100-k9-kickstart <tab>

Backspace is not an option so if you make an error, use the arrow keys and the “delete” key to fix typos.

Select the latest image, hit enter, and FI-B now beings to boot the kickstart image. Give it a few minutes and you should find it stops at the “boot” prompt. This prompt is not as lonely as the loader prompt, but it’s still not a fun place to be (at least you can backspace now). You actually will have much more functionality then you did with loader, but won’t need it for this exercise. At this point you need to assign and IP address to FI-B so that you can FTP the kickstart image to an FTP server. The commands will look like this:

#Config t

#int mgmt 0

#ip address X.X.X.X <mask>

#no shut

Wait 10-15 seconds

#<ctrl+z to return the shell to the top level>

# copy bootflash:installables/switch/ucs-6100-k9-kickstart <tab>

Select the latest version and copy it to the FTP server.

DO NOT USE THE FILES IN THE ROOT OF BOOTFLASH AT ANY TIME DURING THIS PROCESS. Nothing catastrophic will happen, but the FI will not boot in the end.

The shell will prompt you for ftp server address and credentials and it should look something like this:

You need to allow about 10-15 additional seconds after you “no shut” the interface for the IP to become active and useable.

You now need to copy the system and UCSM images as well as you will need them soon enough. The other two files will look something like:

installables/switch/ucs-manager-k9.2.1.0.418.bin

installables/switch/ucs-6100-k9-system.5.0.3.N2.2.10.418.bin

again – your versions will be different

Once you are returned to the boot prompt, and all 3 files are copied, the kickstart file is on the FTP server. You should boot FI-B back into production. You now need to get the kickstart file available via TFTP using whatever process you do to make that happen. One word of caution here – TFTP blows. It runs on UDP, it’s slow, and it has no error checking. If your first attempt at booting fails, try a different TFTP server program (trust me on this – I had bruises on my head from banging it on the wall). Once the file is available via TFTP, return to FI-A which is at the loader prompt. You will now boot that kickstart image via TFTP using these commands:

Incidentally, you cannot ping this address from an outside station. Just FYI

Then it begins loading the image. It should take just a few seconds to actually start booting. FI-A will not land at the boot prompt like FI-B did earlier. You need to rebuild the filesystem on FI-A, so type:

#init system

This will take a few minutes. When it’s done, you can now use FTP to copy the 3 files down to FI-A. Use this command to retrieve each file:

#Copy ftp: bootflash:

The shell will prompt you for everything it needs to copy the files.

After all 3 are copied, one very important command needs to be run now and it won’t make sense, but you must do this. You need to run this command:

Copy bootflash:/ucs-manager-k9.2.1.0.418.bin bootflash:/nuova-sim-mgmt-nsg.0.1.0.001.bin

The nuova-sim-mgmt-nsg.0.1.0.001.bin is an exact name that is needed here.

Now that you have all 3 files local on the FI, you would be able to recover much quicker if the FI were to lose power. At this moment, if that happened, you would be returned to the loader prompt, but you would be able to boot via bootflash instead of TFTP. Anyway, you are now at the boot prompt and need to finish booting. Type load bootflash://ucs-6100-k9-system.5.0.3.N2.2.10.418.bin. This will start loading the system image and when it’s done loading, it will look for the UCSM image that you also copied and the FI should come up. It will walk you through the setup menu and since the L1/l2 cables are not connected, I would go ahead and set it up as standalone – we will join it to the cluster soon. Once you are logged into the FI, you need to activate the current firmware to set the startup variables. The easiest way to do this is in the GUI. Just go into Firmware Management and select “Activate Firmware” and select the FI. You will likely see that no version is in the startup column. Regardless, you need to activate the version that is already there. If it doesn’t let you, exit Firmware Management and navigate to the Fabric Interconnect on the left-side Tree menu and activate the firmware from there using the “force” option. This will fix up the ucsm image file that we copied to the root as well (turns it into a symbolic link).

That’s about it. You should be OK to erase the config on FI-A (#connect local-mgmt), hook up the L1/l2 cables and rejoin the cluster on the next reboot. I really hope you don’t need to ever use this… I mainly wrote this blog for myself because in the lab we do a lot of crazy stuff and I often forget a step here and there. So I wanted it all written down to refer back to and I’ve wanted to get this one done for quite some time.

Thanks for reading…

-Jeff

28 thoughts on “When Disaster Strikes…

  1. Jeff, Call me as I’m in Atlanta for training with Earthlink Tuesday and Wednesday, March 12th and 13th. I hope all is well. I sent you a Linked In invite as well but don’t have your number. My cell is still the same if you have it. Andrew

  2. Pingback: Resetting UCS to Factory Defaults | Jeff Said So

  3. Massive thanks for writing this up. You wished no one would ever need to use it but I just did after a failed upgrade to 2.2(1c)

    I was lucky the kickstart image had copied ok to FI-A during an auto firmware install. It failed to copy the system or manager. So i skipped the TFTP boot part but used the rest.

    I used this to grab both of the working FI-B and copy them down to FI-A wipe and reload/rejoin the cluster.

    Great work and thanks again.

    Joe.

  4. This was a lifesaver. Had an FI that was panicking after booting up. Used the method to get in and do the system init and get it on the latest code. Hope I never have to use this again… but glad this info was out there. Thanks!

  5. [in engineering lab]

    So, I have both FI’s with corrupted bootflash. I did “init system” on both, and loaded 2.2(1.210) kickstart/system on both. I brought both up as standalone in order to install full infrastructure bundle on both (if i brought them up in “cluster” mode, I could not do anything with them, could not login to UCSM GUI, could not even do “scope firmware” from console, gave “timeout communicating with DME”, etc.) OK. got both completely configured, 2.2(1.214) running on both UCSM and FI.

    Now, I convert both to cluster mode via:
    connect local-mgmt
    enable cluster 10.193.184.190

    Now, I can no longer log into the GUI using the mgmt IP: (“UCSM is not available on secondary node”), and I cannot login to the GUI using the virtual IP (10.193.184.190) (no response whatsoever)

    Both FIs still think they are primary (-A) – I did not see a way when transitioning from standalone to cluster mode to tell the node it was about to become a secondary.

    I’ve been at this for two days now, need this setup for fixing some VIC FW issues that are due soon, any help appreciated!!

    thanks,
    -reese

    • So, I can tell you for certain that if you put an FI into cluster mode and never join the subordinate node, you will get the error logging in that you described. The method you used to create the cluster is ONLY used on the primary. The only method for the subordinate to join is through initial setup. So you need to a) make sure the ethernet connection is good between the two nodes on the L1 and/or L2 ports, b) enter local mgmt on either of the nodes and run “erase config” and reboot and c) when that node comes back up, tell it to join the cluster that has already been created.

  6. thank you very much!

    i come from china. I have a ccie dc Rack.

    I upgrade my ccie dc rack useing you method.

    you give one big help is that I know special name nuova-sim-mgmt-nsg.0.1.0.001.bin and special directory with /installables/switch/ .

    thank you again.

    • I assume you have already rebooted the FI? Does any output come to the console? Make sure the console cable is known to be working.

      • Hello Jeff,

        I have the same problem… both FIs are POWERed ON but no output on the screen. The console cable is oK, I’ve tested it with a switch that is above the FIs.
        All lights are green, but the stat LED is off and if I hit the ID button, it does not start flashing blue. This happened after an upgrade to version 2.2.x

  7. Hi guys.

    I just did this after a corrupted MGMT partition.
    There is NO NEED to take FI A down.
    Download the infrastructure bundle file (ucs-k9-bundle-infra.2.2.3c.A.bin). Use 7-zip to extract the .A file inside.
    The extract verything from this.

    /Kristian

  8. hy jeff!
    whenever i login my ucsm and give login and password config/config.i get the ERROR:
    login error
    failed login info: UCSM is not available on secondary node

    PLZ!!! tell me how to solve this problem. I will very very thankful to you.:)

        • I believe if you ssh to the FI and run
          Scope fabric interconnect a (or b)
          Show

          It will show the ip of the FI. Do this on both FI’s and then try to login to each one in the GUI.
          There is a virtual ip shared between the FI’s but I’m not at a system to find the command.

  9. Hi Jeff

    could you please let me know how to fix these issues. Actually i tried to downgrade the FI’s because its not able to discover the Rack mount servers

    primary FI : not able to login with our default password

    Secondary FI : svcconfig init: /opt/db/sam.config NOT Found. Sleeping for a while
    svcconfig init: /opt/db/sam.config NOT Found. Sleeping for a while
    svcconfig init: /opt/db/sam.config NOT Found. Sleeping for a while

    • There was a corner-case issue where the FI file system came up read-only. I believe rebooting the Fi fixed the issue and this was addresses in a later 2.0.1 release. If rebooting did not fix it, TAC can RMA the FI. If you do not have TAC coverage, look at this article (there is nothing physically wrong with your FI). http://jeffsaidso.com/2013/01/when-disaster-strikes/

  10. hi am getting,
    FI-B is want to update the firmware version of peer FI-A ?
    when choosed ‘NO’
    it is going to the initial setup, what i have to do?

    • In regards to the orphaned code FI’s:

      Perform a con local; erase config on both

      Setup temporarily each in standalone with L1/L2 disconnected

      Erase config; reconnect L1/L2

      Setup cluster..

  11. Jeff:

    Cisco is putting out such crappy software – getting off 2.2.3d ALWAYS leads to some form of corruption. In my lab, auto-upgrade or manual processes fail with F997438. This process is not a palatable fix for an upgrade process that worked well for 6 years. WHAT is going on over there?

  12. Hi Sir,

    My office has an UCS 6120xp which lost all his image stuff. it starts at loader prompt and “dir” shows no files. The 6120xp is quite old so we do not have TAC coverage any more. And we dont have any other FI either. Is there any chance i can get the necessary three image file from somewhere?
    So desperate…

  13. Hi Jeff. . .

    We are using 2 FIs, FI A and FI B in cluster mode. Now I want to use FI A in standalone mode without affecting FI B and without affecting service profiles on FI A. How we can done it?

    • I’m afraid you can’t do this smoothly. At least not that I am aware of, and I’ve been away from UCS for a few years now. You can convert from Standalone to Cluster mode, but not the other way. You can remove the standby/secondary FI without disrupting anything (and erase it’s config and use it how you like), but the surviving FI will always know it’s missing. This will cascade redundancy errors throughout the system. You can reduce some of the errors by editing the service profiles to remove all storage and network connections that were attached to the failed FI. Wish there was better news. Maybe someone else will know something that was done for this after I left that team. The reason I doubt it exists is because it all teams have limited resources to inject new features. The features have to be prioritized and are driven by customer demand. For this to get done, there would need to be high demand for it to usurp other features on the list. It’s just sort of the way it works – customer requests and demands really do drive the business. Good luck.

Leave a Reply

Your email address will not be published. Required fields are marked *