UCS Boot-from-SAN Troubleshooting with the Cisco VIC (Part 2)

So, first let me define some terms….the Cisco VIC is also called “Palo” – a codename that sort of stuck (much the chagrin of the marketing team). Palo’s official name is M81KR – now do you see why “Palo” sort of stuck 🙂 ? We have some new VIC cards as well – the VIC-1240 and VIC-1280 and Sean McGee (@mseanmcgee) talks more about the VIC-1280 here. The VIC-1240 is a built-in option on the M3 blades. Now that we settled that, where is Part 1 of this article? Well, my good friend Ryan Hughes (@angryjesters) got the ball rolling on this. He took it upon himself to write an excellent article explaining how to access the obscure-but-useful command called LUNLIST. So if you are looking for Part 1 to this article, I’m not the author of it. I learned some things reading Ryan’s article, which is not all the surprising since I’m rarely with Ryan when I don’t learn something. You should check out his site if you have not seen the article already, but briefly, LUNLIST is a command that shows you what the Cisco VIC HBA can actually “see” on the fabric – much like a typical HBA BIOS would…but way cooler.

Why am I writing part 2? Well, in the comments to Ryan’s article, a responder noted that during the HBA POST process, the VIC itself will show you if zoning and LUN masking are correct and that LUNLIST may not be needed. While that comment is partly true, LUNLIST is definitely needed and is a great help in troubleshooting. There are prerequisites that must be met for the VIC to show success during POST, and when POST does not show you what you expect, you don’t always know where to start. Is the problem in the profile, the Fabric Interconnect, the upstream FC switch, or the array itself? It’s this kind of thing that makes server administrators irritated with boot from SAN to begin with. There is too much of the setup that is out of their control – and requires a lot of joint troubleshooting with the SAN team. Cisco UCS certainly makes this a lot easier, and I wrote an article back in late 2010 that outlines the basics of a boot from san scenario in UCS. Check it out if you are not familiar with the process, but I believe there is always room for improvement, and this area is no different. So with Cisco UCS 2.0 we introduced LUNLIST (and we’re not close to being done in this area by the way). UCS has a cousin command called LUNMAP that has been around a long time, but LUNLIST is the steroid-using one of the two and when I am troubleshooting, I solely look to LUNLIST. Let’s see why…

As Ryan pointed out, LUNLIST only works prior to the OS HBA driver loading. Once the driver loads, the VIC boot BIOS is no longer in control and will not return valid data. This means that it’s more difficult to use LUNLIST to determine if your configuration is “looser” than it should be by having excess LUNs allowed to the wrong host(s). One reason I like LUNLIST compared with legacy HBA BIOS tools is that I do not have to open a KVM to the server in question and I do not have to catch the server at just the right second during POST. I can just let all the servers attempt to boot, and from one CLI, quickly and easily look at any number of HBA’s in any number of servers. Pretty cool stuff. Another reason I like LUNLIST better is that in a single output, it can tell me if my problem is in the Boot Policy, the zoning config, or the LUN masking. Let’s take a look at some output to show you what I mean.

To get to the command, you need to gain access to the UCS CLI and run the following:

  • connect: connects to the VICs management processor
  • attach-fls: attaches to the fabric login service of the adapter

Once you run lunlist, you see output similar to the below. This one is from a server where the end-to-end configuration was all done correctly and the server could boot from SAN or attempt an installation to do so:

Now let’s break it apart and describe what you are seeing:

So, you now may be starting to see the usefulness of this command. But perhaps it will make more sense if you look at the output of a non-working configuration….

  1. Incorrect LUN masking:

    Here is the LUNLIST output from a server that is having an issue with incorrect LUN masking. The host has not been allowed access to the LUN. The same problem would likely result if the host is not setup in the array at all, or if it was created on the array but someone mis-typed the host’s WWPN. Zoning is correct because the Nameserver Query Response succeeds (line 11) and returns a WWPN target that matches the WWPN target in the boot policy (line 5). The HBA successfully logged into the fabric and was able to see that a LUN of ID 0x00 is visible (line 9). But when the LUN is queried for additional information, it fails with “access failure” (line 7).

  2. Incorrect Zoning:

    In this example, the host is not zoned correctly. It is either in a zone by itself, not zoned at all. This is an easier one to troubleshoot because the host cannot see a LUN nor can it see any available WWPN targets. Look at lines 8 and 9 and notice that there is no response returned for either of these queries. Note that the PLOGI is unsuccessful (fc_id in line 5 is 0x000000) because the host was unable to successfully establish a session with the target.

     

     

  3. Incorrect SAN Boot Target in the boot policy:

    In this example, you can clearly see that the WWPN configured in the boot policy (line 5) does not match the available target found on the fabric (line 10). In this situation, the PLOGI (line 5) is once again unsuccessful because a session cannot be established between the host and the target.

     

  4. Incorrect LUN ID in the boot policy

    In this example, someone entered the incorrect LUN ID into the boot policy for the server (line 7) and it does not match the LUN ID found on the fabric (line 9).

     

  5. Lastly, I want to show what it looks like when a properly configured host has multiple LUNs presented. I simulated additional targets in the out below, and I wish I could show you actual multiple targets too, but my lab just isn’t that big 🙂 . However, if you would like to donate a larger array, I’d be happy to include it in my future examples 🙂

6. One last example is what happens when you run LUNLIST and the OS is up and running with the driver for the VIC loaded (which means LUNLIST won’t work). You will get this instead:

That’s pretty much all there is to it. Hopefully this will be useful to you when you need to troubleshoot a UCS blade that’s not booting from SAN. Remember, you need UCS 2.x (or higher) for the command to work and you can only use the command prior to the OS loading. As always, please let me know your feedback and thanks for stopping by.

-Jeff

Article Update:

If you have more than 2 vHBAs, you may need to know which are which. There is an additional command in the adapter shell you can run called “vnicpci” which will list all interfaces along with their associated server interfaces.

Update #2

If you are using rack servers (UCS C-series), the syntax changes slightly to target the specific server (because there is no chassis). To target rack server 5, for example, you would type:

“connect adapter 0/5/1” (chassis 0, server 5, adapter 1). Technically speaking, you could just use “connect adapter 5/1” since rack servers do not require a chassis #, but to keep the syntax as close as possible to blades, I add the “0” in place of the chassis.

9 thoughts on “UCS Boot-from-SAN Troubleshooting with the Cisco VIC (Part 2)

  1. Hi Jeff..
    It was a nice article..
    I have a question about adding vbas to the service profile in the case of SAN boot. ie is it really needed that to separate SAN boot LUN vhba from clustered shared vmfs LUN vhbas if it yes can you please explain in brief. What will happen in case if i configure both SAN boot lun and vmfs shares LUNs to pass through same vHBAs.

    Thanks in advance
    Anil

    • I’ll start by saying that this is a question better suited for VMware than me, but I would separate them myself. It adds very little in terms of complexity, but could add quite a bit in terms of performance. Assume you had no SAN at all. Would you boot ESX and have it access its VMFS “data” LUN on the same SCSI controller, or would you prefer to use different controllers if you could?
      My only caveat here is that I believe the ESXi kernel loads totally into RAM once booted, so it would not need to go back and access the boot LUN. However, that’s not something I know with 100% certainty so you should get some opinions from VMware as well.
      I would also suggest not booting from FC at all for ESXi and instead boot from PXE.

  2. Very good article for troubleshooting.Thanks for that.
    I’ve got one problem in troubleshooting.SAN boot doesn’t work.
    I configured 6324 FI along with NetApp. I directly connect FI port to NetApp Storage (End to End FCoE,not FC).
    When I used that command, I found this output.
    vnic : 16 lifid: 5
    – FLOGI State : flogi est (fc_id 0x5b0080)
    – PLOGI Sessions
    – WWNN 20:08:00:a0:98:7e:a3:ae WWPN 20:08:00:a0:98:7e:a3:ae fc_id 0x5b0003
    – LUN’s configured (SCSI Type, Version, Vendor, Serial No.)
    LUN ID : 0x0000000000000000 access failure
    – REPORT LUNs Query Response
    – Nameserver Query Response
    – WWPN : 20:08:00:a0:98:7e:a3:ae

    Based on your explanation, I checked the zoning.It matches with flogi database.It’s correct.
    Boot Target is OK as follow
    vHBA2, Target Primary: 20:08:00:a0:98:7e:a3:ae

    Test4-UCS-B(nxos)# show flogi database
    ——————————————————————————–
    INTERFACE VSAN FCID PORT NAME NODE NAME
    ——————————————————————————–
    vfc686 102 0x5b0000 50:0a:09:81:80:60:c0:10 50:0a:09:80:80:60:c0:10
    vfc686 102 0x5b0003 20:08:00:a0:98:7e:a3:ae 20:09:00:a0:98:7e:a3:ae
    vfc731 102 0x5b0080 20:00:00:25:b5:00:0b:0e 20:00:00:25:b5:00:00:1e

    Test-UCS-B(nxos)# show zoneset
    zoneset name ucs-Test4-UCS-vsan-102-zoneset vsan 102
    zone name ucs_Test4-UCS_B_1_Blade1SP3_vHBA2 vsan 102
    pwwn 20:00:00:25:b5:00:0b:0e
    pwwn 20:08:00:a0:98:7e:a3:ae

    zone name ucs_Test4-UCS_B_2_Blade1SP3_vHBA2 vsan 102
    pwwn 20:00:00:25:b5:00:0b:0e
    pwwn 20:08:00:a0:98:7e:a3:ae

    If you have time,could you please figure it out? I can’t find way to do.
    Looking forwards to your feedback soon.
    Thanks & Best Regards,

    • Same problem am also facing what would be the reason any idea…?Please share me if your resolved the problem….
      ASAP

  3. Really nice article, awesome explanation and very useful for troubleshooting :).

    @ Thant Zin Soe – seems you have Direct integration of Netapp storage system wherein the FI is in “SAN switch mode”. Please verify the zone creation of Initiator & target on FI.

Leave a Reply

Your email address will not be published. Required fields are marked *