UCS Chassis Discovery Policy

So, today’s article will be a short one, but a useful one nonetheless. Here’s the scenario….You have 3 different sets of workloads on your blades that require 3 different levels of bandwidth. Because of this, you put them in different chassis’ to accommodate. Some of these chassis require 20G, some require 40G, and yet some require 80G. Just because they have varying bandwidth requirements should not mean that you cannot move a workload from say the 20G chassis to the 80G chassis if that happens to be where your excess server capacity lies at the moment. UCS is totally flexible with any bandwidth requirement you have (you might call it a ‘FlexNetwork’, but I won’t J). Unlike competing blade solutions available, UCS can deliver this varying bandwidth functionality while maintaining all the servers under a single UI in a single domain of management. If you are a user of HP Virtual Connect Enterprise Manager, this would be analogues to having all of your blades into a single “Domain Group”, but still have varying bandwidth requirements. But why stop at blades? Why not be able to manage “server” objects generically and allow rack and blade servers to be pooled together? We’ve got ya covered there too

UCS has a feature called the “Chassis Discovery Policy” that can be used to handle the varying bandwidth requirements. Although the initials are CDP, this is not your father’s CDP (Cisco Discovery Protocol) J. The purpose of the policy is to tell UCS Manager the MINIMUM number of IO Module (IOM) to Fabric Interconnect (FI) links that must be present in order for the chassis to be properly discovered. Go re-read that last sentence – it’s important. This policy can be found on the Global Policies screen when selecting Equipment on the Tree-View of the Equipment tab as seen below. While it may seem straight forward, it has some caveats, so let me answer the most common questions first.

 

Q1. From what I can tell, there is only one policy for all my chassis. Does this mean that all my chassis have to have the same number of links?

A1. No. That would make my opening paragraph a lie and I don’t (knowingly) lie on my own blog! J Remember, this is a discovery policy only, not a link-management policy. If the chassis is already discovered, this policy has no effect. If you are trying to reduce the number of IOM->FI cables, this setting has no effect.

 

Q2. Why is there no 3-link setting?

A2. We don’t support 3 links as an initial setting because there are 8 blade slots in the chassis and the connections for those 8 blades does not divide evenly into 3 ports (in a cable failure scenario of an existing and properly configured chassis, we do support 3 links).

 

Q3.Should I just set the policy to match the number of chassis links I have cabled?

A3. Not necessarily. Read on…

 

Q4. Do I need change the policy in order to add or remove IOM-FI cables?

A4. No. All you need to do in that situation is acknowledge the chassis.

 

The policy values above of 1-link, 2-link, and 4-link refer to number of links between a single IO module (IOM) and the respective Fabric Interconnect (FI). In other words, it’s the number of links for a single side of a chassis, not the total number of links from an entire chassis.

Now, because this is my blog and I don’t technically speak for Cisco here, I’m free to give my own opinion, so I will. What I recommend to my customers is to leave this policy at its default value of 1-link – never change it. With this setting, if a chassis with more than one link is discovered, UCS Manager will find it and add it to inventory (along with the hardware it contains). However, the chassis will be in a non-optimal state because not all chassis links are active yet. In turn, UCSM will return some errors such as “FEX not configured”, “fabric-unsupported-conn”, or “unsupported connectivity”. Like I said, the chassis and its blade are functional at this point, but only using a single link from each IO module. To make the additional links functional, you need to right-click the chassis and choose “Acknowledge Chassis”. Because you are still staging the chassis, acknowledge the warning to continue the process. If the chassis were in production already, you should get a maintenance window to do this. Once complete, all links that are cabled become active automatically and begin carrying traffic.

That’s all there is to it. Let me know if you have any questions. Thanks for reading!

-Jeff

P.S. The current User Guide speaks to the fact that an incorrect setting of the Discovery Policy could lead to a chassis not getting discovered at all. That was the case at one time (and may be the case again some day), but the current behavior in 1.3.1 – 1.4.3 is that a chassis is always discovered, regardless of the policy setting. However, the newly discovered chassis will warn the user that the policy doesn’t match the actual topology. If the behavior changes, I’ll try to remember to come back and update this blog. If not, the documentation will likely get updated. If this really bugs you, read my previous article on how you can contact the UCS Docs team and let them know.

Update (5-8-12):  Starting with 2.0 versions of UCSM, the chassis will NOT be discovered if the policy is set for a number of links greater than the physical topology contains.

 

28 thoughts on “UCS Chassis Discovery Policy

  1. Jeff:

    Cool article – answered some of the questions we have regarding link policy and connections. The remaining questions are: What happens if a chassis used a 4-link policy to get in and that chassis were to lose two links, would that pose a problem requiring re-acknowledgement to clear said fault? Is re-acknowledgement disruptive if servers are already in that chassis with running service profiles? How can we go about swapping chassis and reclaiming a chassis number?

    • Hey Jean – glad you liked the article. Answers to your questions:
      If you lose x # of links during production, the server NIC/HBA pinned to those links will fail. The redundant path will take over at that point. If the links are repaired, UCS will automatically use them again. If the customer chooses to re-acknowledge with some links down, UCS will re-pin the failed nics/hba in each affected server to the remaining good links. UCS will warn you before a re-ack that the action is disruptive to the servers in just that chassis.
      Starting in 1.4.2b, chassis renumbering is supported. So you can decommission any chassis and when you recommision it, UCS will ask you what chassis number it should use (if it is available).

      Thanks for stopping by Jean.

  2. Good one – perfect understanding of the chassis discovery policy and number of links between IOM & FI. I would add just 2c here — care must be taken about “when” you hit the “acknowledge chassis” button. When you “acknowledge chassis”, what you are telling UCSM is that “I acknowledge current connectivity of the chassis”. Every “Fabric Port” of the IOM (the one connected to FI) has two states: ‘Discovery’ and ‘Acknowledgement’. You can see that under “Status Details” of a given fabric port (under ‘Equpment’ –> ‘Chassis’ –> ‘IO Module’ in the GUI). Discovery is an operational state – it can be ‘absent’ or ‘present’. Ack tells whether the link is used by the system or not.

    When admin hits “acknowledge chassis”, UCSM takes the snapshot of Discovery state – and if link is ‘Present’, then it is marked as ‘Acknowledge’ (and if not present, then un-ack) — and all the ack’ed ports are used to pass data.

    So, before hitting ‘acknowledge chassis’, it is advisable to make sure that the links are all in ‘present’ state.

  3. A follow-up. The 6248 and 2204XP/2208XP IOM now support port-channeling of IOM-to-FI links. There is a new link grouping preference for this.

    The port-channel uses an 8-bit hash of the L2/L3 header to select an outgoing port number, which means it’s fairly efficient for any number of links, from 1 to 8 links. I.e., you’re not stuck to a binary number of links like 1, 2, 4, or 8. The “show port-channel load-balance forwarding-path” command can be used to identify which path is taken for a given L2/L3 header.

    A benefit of port-channels is that links can be added/removed without needing to re-ACK the chassis. In fact, in my testing the “action” number of links did not seem to affect port-channeling behavior one bit. So Jeff’s recommendation to keep the “action” to “1-link” is probably the right one.

  4. While I would normally agree with anyone who says my recommendation is the right one, there are some caveats on this topic.

    From a support perspective, you are still required to use 1,2,4, or 8 links. Here is what the UCS help file says:
    *****************
    Action field
    Specifies the minimum threshold for the number of links between the
    chassis and the fabric interconnect. This can be one of the following:

    1-link
    2-link
    4-link
    8-link
    Platform Max
    ***************************************

    I personally tried to get this changed, because as you point out, you are not restricted to these choices when using port channels. However, I was overruled because there are too many corner-cases of how customers might choose to swap back and forth between PC and pinning and that would cause issues.

    Further, I heard about an issue on 2.x that required the user to select the exact number of links that were in use for proper discovery. I do not agree with this behavior and have not tried it myself to verify, but that’s full-disclosure.

    Thanks for reading Craig.

  5. Hi all,
    I reached this article searching for “acknowledge chassis”. Here is the issue I have, yesterday I removed the expansion module from FI A because I had two FC ports down. Cisco sent the part and I changed it.

    The thing is that I have some VIF down, waiting for flogi errors in the GUI, which are false positives. I logged in to FI via SSH and did show flogi database and every WWN is there. I guess I am affected by defect CSCtn89396 http://tools.cisco.com/Support/BugToolKit/search/getBugDetails.do?method=fetchBugDetails&bugId=CSCtn89396

    The workaround:
    Option #1 Initiate fail over again
    Option #2 Re-acknowledge the blade

    My question is, if I acknowledge the chassises, will I have some service disruption?

    Thank you guys
    Javi

    • Acknowledging the chassis is disruptive (today). The bug is totally cosmetic – I would ignore it. I “think” this is fixed in 2.01m and above.

      • Ok, I will ignore the flogi errors.

        one more thing, I have just seen in my SAN uplink switch that I have 14 vHBAs (all of them actually, the rest of the ports are just waiting) through the same FC port…

        how can I redistribute FC traffic without disruption?

        Thanks for your help.

        • UCS will automatically distribute traffic accross SAN uplinks within each FI a) when the servers login and 2) as uplinks ports are added or removed.

    • If you re-ack the chassis, you will lose connectivity to the chassis for a second or two. A maint window is recommended. This is different than re-ack for a blade though. A single blade re-ack is disruptive to just that blade and re-applies the entire profile.

  6. Hi i want to know if i decommision the Chasis and recommision back, would that chasis gain back all the settings or it is disruptive??

    Thanks in Advance

    • It is most definitely disruptive. Decomm is meant for troubleshooting purposes and for removing retired hardware from the database. If you were to decomm/recomm, the chassis would come back exactly as it was and all profiles would re-associate automatically.

  7. Hi Jeff,

    Best practices dictate that I need to change my chassis policy from 4-link to port-channel. Can I do this while the system is in production or is this a disruptive process?

    Cheers

    • I need to update this article to include the port-channel options that didn’t exist at that time. Port Channel is definitely what I recommend, but that is independent of the number of links. Changing the discovery policy is not disruptive because it only affects a chassis that you have not discovered yet. However, what you are most likely wanting to do is change a chassis that is already discovered. In that case, you can do this two different ways:
      1) Click on the chassis in question and change the Connectivity Policy (this would need to be done on each chassis)
      2) Change the global Policy to Port Channel. This will affect each chassis, but not immediately. By default, each chassis uses the global policy for connectivity. If you change the global policy, any chassis that you re-ack will pick up the new settings. This is disruptive. There is no way to go from discrete to port channel on an already discovered chassis without disruption.

      You can tell what any chassis is using at any time by clicking on a chassis and checking the Connectivity Policy tab. Look at the two fabrics listed (A and B) and see the Ctype line. If it’s discrete, it will say “Mux Fabric”. If it’s Port Channel, it will say “Mux FabricPC”.

      Hope that helps.

  8. Hi Jeff,

    Thanks for the info. I was hoping I could do it on the fly but I guess I always knew in the back of my mind that would not be possible. I will have to schedule downtime over a weekend then. The burdens of an IT person 🙂

  9. Pingback: Cisco UCS Port-Channeling | Keeping It Classless

  10. Jeff,
    If I configure a server port and it gets automatically added into the port channel, and if I want to remove it down the road, will I need to re-ack the chassis? Or does the port channel configuration allow you to add/remove on the fly with no disruption? Thanks!

    • Port channels are great in that they do not require re-ack to add/remove cables. I always set m chassis links up for PC, even if they have just 1 link to give me this flexibility.

  11. Hi Jeff,

    Just a follow up question regarding my changing to port channel question above. If I go with option 1) Click on the chassis in question and change the Connectivity Policy (this would need to be done on each chassis) – Is the following assumptions true:

    1) I can do this on a production system one fabric at a time without interruption.
    2) If I leave the global policy to none and change only the policies on the chassis, I do not need to re-ack the chassis after implementing change 1)

    I look forward to your response.

    • I’m pretty certain a re-ack is going to be needed and this will cause a 2-3 second disconnect. I’d plan for downtime.

  12. Jeff,

    great blog with numerous information. I have a query on the last part of your answer to question 1.

    A1. No. That would make my opening paragraph a lie and I don’t (knowingly) lie on my own blog! J Remember, this is a discovery policy only, not a link-management policy. If the chassis is already discovered, this policy has no effect.

    “If you are trying to reduce the number of IOM->FI cables, this setting has no effect”

    –> UCSM 2.x
    discovery policy 8x-link
    Port-channeling enabled
    We need to reduce the number of server ports to 4 from 8. So does removing four cables on either IOM require to re-ack the chassis so that blades pinned onto those cables will be re-pinned onto other ports?? or can we simply remove those cables without disruption, means the blades will automatically be re-pinned since port-channeling is enabled??

    I see Q4 that add/remove cables requires chassis acknowledgement. Here the port-channeling confusing the part of it.

    Thanks in advance..:)

    • I wrote this prior to the portchannel (PC) feature existing between FI and IOM. If you have functioning PC already, you can add and remove cables at your pleasure with no need to re-ack.
      I don’t do a lot of deployments these days but when I did I always encouraged people to use PC ALWAYS – even if it’s just a single cable. That way you don’t have to worry about the re-ack ever again.
      Note that switching to PC from pinning in the very beginning does require re-ack. Good luck and thanks for stopping by.

  13. Hi Jeff,
    Great blog, We had an issue this morning where we upgraded from 2.2.(2c) to 2.2.(3d), post upgrade we have noticed that 1 of our 7 chassis of all 4 uplinks either side only the 1st per IOM is active and the other 3 are in an un-initilized state, I take it to resolve this we would need to re-ack the chassis… These are production ESX and we can do it in a maint. windows but would we be required to power off the VM’s or will the disruption be short enough to been seen as a temporary drop in traffic ?

    Second question is : the global policy is set to 1-link and none, for chassis 1-4 we enabled port channel at the chassis level, chassis 5-7 we over looked this and the Admin Sate is set to global so all Mux Fabric ( all chassis have 4 uplinks per side). If we change the global policy to port channel, I understand that chassis 5-7 will require a re-ack, but will chassis 1-4 as well, or because they are not configured to global these will remain the same ?
    Thanks in advance.

    • With ESXi, you likely can ACK without disruption, but I would schedule a window. Switching to port channels means you won’t ever have to re-ACK in the future – which is great. To make the switch does require a re-ACK, however, it will only happen on the chassis that rely on the global policy (assuming you only change the global policy to fix 5-7).

Leave a Reply

Your email address will not be published. Required fields are marked *