*
News: SMF - Just Installed!


Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Topics - PureSystemsTech

Pages: [1] 2
1) Nutanix / NOS Upgrade: did not grant shutdown token
« on: February 16, 2018, 09:55:01 AM »
All,

This post is in reference to out error you may see in the log file "Genesis.out" under /home/nutanix/data/logs in a situation where an upgrade might be hung and not continuing with the upgrade.

This happened to me on my CE cluster which is a single node cluster where the a shutdown token is not being granted to allow the upgrade to finish..

When you run upgrade_status on the CVM cli you will see the following for a while until you realize "hey this thing is hung"

Quote
nutanix@NTNX-d02003b1-A-CVM:192.168.1.41:~$ upgrade_status
2018-02-16 06:42:47 INFO zookeeper_session.py:110 upgrade_status is attempting to connect to Zookeeper
2018-02-16 06:42:47 INFO upgrade_status:38 Target release version: el7.3-release-ce-2018.01.31-stable-c3b9964290bf2f28799481fed5cf32f92ab3dadc
2018-02-16 06:42:47 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling upgrade
2018-02-16 06:42:47 INFO upgrade_status:96 SVM 192.168.1.41 still needs to be upgraded. Installed release version: el7.3-release-ce-2017.11.30-stable-ab2ac46f51d4745d43126c9ad1871b7314400bab, node is currently upgrading

If you see this message in Genesis.out

Quote
Master 192.168.1.41 did not grant shutdown token to my ip 192.168.1.41, trying again in 30 seconds

Try to run the following command to grant a token.

Quote
echo -n '{"request_reason": "nos_upgrade", "request_time": 1496860678.099324, "requester_ip": "192.168.1.50"}' | zkwrite /appliance/logical/genesis/node_shutdown_token

Then tail the Genesis.out logs for the following messages.

Quote
Failed to read zknode /appliance/logical/genesis/node_shutdown_priority_list with error 'no node'
2018-02-16 06:45:15 INFO cluster_manager.py:4224 Successfully granted token to 192.168.1.41 reason nos_upgrade
2018-02-16 06:45:15 INFO node_manager.py:2266 Finishing upgrade to version el7.3-release-ce-2018.01.31-stable-c3b9964290bf2f28799481fed5cf32f92ab3dadc, you can view progress at /home/nutanix/data/logs/finish.out
2018-02-16 06:45:17 INFO ha_service.py:959 Checking if any nodes are shutting down due to upgrade
2018-02-16 06:45:17 INFO ha_service.py:977 Node 192.168.1.41 is going down

Now when you run upgrade_status on the CVM cli you should see that your SVM (CVM) is up to date.

Quote
nutanix@NTNX-d02003b1-A-CVM:192.168.1.41:~$ upgrade_status
2018-02-16 14:52:44 INFO zookeeper_session.py:110 upgrade_status is attempting to connect to Zookeeper
2018-02-16 14:52:44 INFO upgrade_status:38 Target release version: el7.3-release-ce-2018.01.31-stable-c3b9964290bf2f28799481fed5cf32f92ab3dadc
2018-02-16 14:52:44 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling upgrade
2018-02-16 14:52:44 INFO upgrade_status:96 SVM 192.168.1.41 is up to date

I hope this helps! :)

2) VMware Support / Packet Drops on DOWN VMNIC
« on: September 16, 2015, 08:46:10 AM »
All,

As of yet I have no solution but am documenting this issue on MPS portfolio.

The issue is simple (or so we thought). We are seeing packet drops on vmnics that are administratively down in VMware.

The Infrastructure
Two HS23 (both exhibiting same issues)
Running ESXi 5.5 Update 2
Emulex 4-port One Connect Ethernet Adapter (driver version - 10.2.298.5)


The Issue
As you can see below. When I pull statistics from vmnic0 it is showing a large amount packets received and a large amount of packet drops. However, as mentioned above, this vmnic is administratively down as you will see below.

esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
   Packets received: 31547633303
   Packets sent: 0
   Bytes received: 16187874324373
   Bytes sent: 0
   Receive packets dropped: 1482863150
   Transmit packets dropped: 0
   Total receive errors: 0
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 0
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0


As you can see, I have no vmnic0/1 in any vSwitch.

/usr/sbin/esxcfg-vswitch -l
Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitch0         5632        9           128               1500    vmnic2,vmnic3

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  Production            0        2           vmnic2,vmnic3
  VMotion               0        1           vmnic2,vmnic3
  Management Network    0        1           vmnic2,vmnic3

Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
vSwitchUSB0      5632        4           128               1500    vusb0

  PortGroup Name        VLAN ID  Used Ports  Uplinks
  IMM_Network0          0        1           vusb0


If I list the statistics of vmnic2 right after I see that packets received on this vmnic IS ALMOST THE SAME AS vmnic0!
esxcli network nic stats get -n vmnic2
NIC statistics for vmnic2
   Packets received: 31547645008
   Packets sent: 23805666719
   Bytes received: 16187878339822
   Bytes sent: 39966911268856
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Total receive errors: 0
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 0
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0


When I run the below command and use WireShark to review the .pcap file I see (as confirmed by VMware technical support) that the file shows packets that mirror what vmnic2 is receiving. Almost as if the vmnic ports are replicating their traffic across ports??

pktcap-uw --uplink vmnic0 ?o /vmfs/volumes/ESXFshare-01/vmnic0_pktcap.pcap


Currently this is unresolved. VMware has asked that I open a hardware ticket for my Emulex card to see if there is a hardware defect causing this "packet replication" issue. I will update when/if a resolution is reached.

Thanks!

3) VMware Support / VMware Performance Monitoring in the CLOUD
« on: July 03, 2015, 12:18:14 PM »
Identify issues with your VMware environments before they happen. Identify configuration issues, locate exhausted or underutilized resources, and isolate problems. Access configuration information?including operating system (OS) utilization levels, devices, processor/memory information, and virtual features. At-a-Glance Cluster level, Host level, and virtual machine (VM) level statistics allow you to customize your graphical user interface (GUI) dashboard for each level within your VMware infrastructure. Configuration and Capacity Data allow you to view configurations, objects, and capacity from your dashboard. You can drill down from the Cluster level, through the Host level, and into the VM level statistics.

Configuration and Charts
Full Usage Details for CPU, Memory, Disk and Network
Object types / Charts for:
vCenters (Data Center)
Clusters
Hosts
VMs
Configuration and Capacity Data for all object types
Real world problem solving assistance charts, for example:
Memory Ballooning
Memory Swap In and Out Rate
CPU Ready Time Percentage
I/O Latency
Network Packet Drops
Stacked charts showing VM metrics across the entire environment
Comparison of metrics between any two virtual machines in your environment


http://galileosuite.com/monitoring/server/vmware.php

4) VMware Support / Windows 2012 VM Loses WAN Access - ESXi 5.5
« on: May 08, 2015, 07:57:04 AM »
All,

I had a very strange problem after running Windows updates this morning on all 3 of our Domain Controllers running Windows 2012 Standard on ESXi 5.5 (build ID 2403361). After reading numerous articles telling me VMXNET3 adapters are the way to go I tried what was quite possibly the dumbest fix ever.

We already had VMXNET3 because this has happened before on e1000 adapters so we made the change a while back.

To fix the problem I did the following:

1. Open the vSphere Client and find the VM
2. Right-click and "Edit Settings"
3. Remove the network adapter then add the same adapter back using the add hardware wizard choosing VMXNET3
4. Click "OK" to apply the change
5. After a quit blip in your Remote Desktop Session you should see your internet connection back to normal.

And the Internet Returns!  ;)

5) VMware Support / VMFS Datastore Not Persistently Mounting Snapshots
« on: April 08, 2015, 03:01:18 PM »
All,

Recently we added a couple of hosts to our VMware environment and noticed that 10 out of 21 datastores would not mount upon reboot.

VMware gave us the following solution.

run esxcli storage vmfs snapshot mount -l <datastore_name>
The -l in this command stands for Persistent-mount which according to VMwares documentation means mount persistently across reboots

This was not the case for us as rebooting still resulted in these 10 datastores from being seen and VMware had no solution to the problem other than taking downtime to perform datastore resignatures which ended up to be the only viable option after almost 2 months waiting for VMware to fix their persistent-mount command.

THE SOLUTION WAS AS FOLLOWS:

The final solution was to resignature the VMFS volumes using the following command.

esxcli storage vmfs snapshot resignature -l=<datastore_name>

PLEASE NOTE - this command needs to be run on a datastore with NO RUNNING VMS!


   1. Shutdown VMs running on datastore
   2. On problem host run esxcli storage vmfs snapshot list ....to verify it is seen as a snapshot volume
Example:
49708f15-345694dc-877a-00215e521038
   Volume Name: ESXSshare-06
   VMFS UUID: 49708f15-345694dc-877a-00215e521038
   Can mount: true
   Reason for un-mountability:
   Can resignature: true
   Reason for non-resignaturability:
   Unresolved Extent Count: 1

   3. Remove any VMs from inventory
   4. Unmount datastore from ALL hosts FIRST before performing maintainance
   5. Perform resignature

      1. Example -  esxcli storage vmfs snapshot resignature -l=ESXSshare-06

   6. In vSphere datastore view rename the datastore back to its original name
   7. Rescan for VMFS volumes on ALL hosts
   8. Validate datastore is seen from ALL hosts tab in the datastore view

Now this is where it gets a little tricky. The datastore will now automatically mount but every single VM that was removed from inventory now has pointers to the disks old signature in their .VMX file. To fix this you need to browse the datastore and download the .VMX file locally. Open it with Notepad or Notepad++ (preferred utility) and change the path to the vmdk files.

Example:

In the .VMX file find the following line(s)......
sched.swap.derivedName = "/vmfs/volumes/49708cc0-68198024-4582-00215e521038/ATSVMDC/ATSVMDC-16886a2c.vswp"
Change the above UUID in the filepath to the NEW UUID that was changed. This can be found a few ways but I like to SSH to the host and cd to the /volumes directory then ls -l to list the volumes and their mounts, the mounts are shown as DATASTORE_NAME->UUID.

Example:
lrwxr-xr-x    1 root     root            35 Apr  8 18:58ESXFshare-01 ->  551c95f2-989d58c6-d057-40f2e987aa40
lrwxr-xr-x    1 root     root            35 Apr  8 18:58 ESXFshare-02 -> 551c94a1-1ade708e-24ce-40f2e987aa40
lrwxr-xr-x    1 root     root            35 Apr  8 18:58 ESXFshare-03 -> 551c95b0-82cded88-fe13-40f2e987aa40



Once this step is carefully completed, move onto step #9.

    9. Add back VMs to inventory
   10. Bring up VMs that were originally shutdown
   11. Repeat


If none of the above steps work please respond to this post. If you are not a member join already, ITS FREE!  ;D

All,

Today we tried to setup LDAP on our Flex v7000. When you go into the v7000 it shows the LDAP configuration as being pre-defined by an internal registry. To me this means that the FSM is the controller of LDAP for the chassis including the v7000. HOWEVER, the option to have the FSM predefined as the LDAP server for the chassis has been turned off in our case because we had an issue previously where we lost the FSM and then couldn't get into the CMM or IMMs without a total CMM reset.

I'm opening this thread in the hopes that we can come to a resolution on this issue that will be posted here. We will keep working on the issue and I will probably test setting the FSM as the internal LDAP server JUST TO SEE if we can get into the v7000 via LDAP when this option is turned on.

Has anyone else seen this?

7) Chassis Networking / SI4093 - INT ports DISABLED
« on: November 24, 2014, 01:21:08 PM »
All,

I am writing this topic after an experience I had onsite with a customer recently.

The Setup
> Pure Chassis 1 x240
> 2 - SI4093 Switches - no upgrade
> 1 - FC5022 SAN switch

The Issue
Upon installation of the OS on the x240 we noticed the NIC was "disconnected". When we checked the interfaces we saw the below.

IBMBC2-E1#sh interface status
-----------------------------------------------------------------------
Alias   Port   Speed    Duplex     Flow Ctrl      Link     Description
------- ----   -----   --------  --TX-----RX--   ------   -------------
INTA1    1     1G/10G    full      no     no    disabled     INTA1
INTA2    2     1G/10G    full      no     no    disabled     INTA2
INTA3    3     1G/10G    full      no     no    disabled     INTA3
INTA4    4     1G/10G    full      no     no    disabled     INTA4
INTA5    5     1G/10G    full      no     no    disabled     INTA5
INTA6    6     1G/10G    full      no     no    disabled     INTA6
INTA7    7     1G/10G    full      no     no    disabled     INTA7
INTA8    8     1G/10G    full      no     no    disabled     INTA8
INTA9    9     1G/10G    full      no     no    disabled     INTA9
INTA10   10    1G/10G    full      no     no    disabled     INTA10
INTA11   11    1G/10G    full      no     no    disabled     INTA11
INTA12   12    1G/10G    full      no     no    disabled     INTA12
INTA13   13    1G/10G    full      no     no    disabled     INTA13
INTA14   14    1G/10G    full      no     no    disabled     INTA14
EXT1     43    10000     full      no     no       up        2-CISCO-Te9-7
EXT2     44    1G/10G    full      no     no      down       EXT2
EXT3     45    1G/10G    full      no     no      down       EXT3
EXT4     46    1G/10G    full      no     no      down       EXT4
EXT5     47    1G/10G    full      no     no      down       EXT5
EXT6     48    1G/10G    full      no     no      down       EXT6
EXT7     49    1G/10G    full      no     no      down       EXT7
EXT8     50    1G/10G    full      no     no      down       EXT8
EXT9     51    1G/10G    full      no     no      down       EXT9
EXT10    52    1G/10G    full      no     no      down       EXT10
EXTM     65      any     auto     yes    yes      down       EXTM
MGT1     66     1000     full      no     no       up        MGT1



The Solution:

We determined the issue was with the SI4093s feature called Failover Monitoring (there is a link to the SI4093 7.8 admin guide below). This is the recommendation from IBM development team to alleviate this problem.

The SPAR definition for SPAR 1 (default) is configured to monitor an LACP aggregation group and from the switch log,

The upstream  switch for EXT1 is not configured for LACP:
SPAR 1 definition
spar 1
 uplink adminkey 1000
domain default member INTA1-INTA14
enable
exit

Switch log indicating LACP status:
Nov 12 12:19:52 IBMBC2-E1 NOTICE  link: link up on port EXT1
Nov 12 12:19:52 IBMBC2-E1 WARNING failover: Trigger 1 is down, control ports are auto disabled.                                                         
Nov 12 12:19:52 IBMBC2-E1 NOTICE  lacp: LACP is down on port EXT1                         
Nov 12 12:19:54 IBMBC2-E1 NOTICE  lacp: LACP is suspended on port for not receiving any LACPDUs

In the trigger, the configuration is as follows:                                         
Failover Info: Trigger                                                                   
Current global Failover setting: OFF                                                     
Current global VLAN Monitor settings: OFF                                                 
Current Trigger 1 setting: enabled                                                       
limit 5                                                                                   
Auto Monitor settings:                                                                   
Manual Monitor settings:                                                                 
Manual Monitor settings:                                                                 
 LACP port adminkey 1000                                                                 
Manual Control settings:                                                                 
  ports INTA1-INTA14                                                                     

The Monitor is then triggered by the state of LACP group adminkey 1000, and as LACP is in down/suspended state,                             
The Control ports are disabled.                                                           
(Note:  if LACP were active, the Limit of 5 would also cause the trigger.                                                                             
From the SI4093 Applications Guide:                                                       
The failover limit lets you specify the minimum number of operational links required within each trigger before the trigger initiates a failover event.

For example, if the limit is two, a failover event occurs when the number of operational links in the trigger is two or fewer.

When you set the limit to zero (the default for each trigger),                                                           
the SI4093 initiates a failover event only when no links in the trigger are operational.                                                                 
** The above would apply if LACP was active on the customer uplinks **
                   
Please verify whether the switch on the uplink is LACP capable and whether configuring the uplink switch is an option.                                       
It is also possible to reconfigure the SI4093 without LACP as the aggregation.
** Very important note from the Applications Guide **                                     
** to prevent network loop **                                                             

Each SPAR must include one or more internal server ports and one or more external uplink ports.
However, if multiple external ports are to be included in a particular SPAR, they must first be configured as a Link Aggregation Group (LAG), thus operating together as a single logical port connected to the same upstream network entity. Any given SPAR cannot include multiple,  independent (non-LAG) uplinkports.                                                       
Each internal or external port can be member of only one SPAR at any given time.                                               

Because the SI4093 does not permit any SPAR to include multiple non-LAG uplink ports, the possibility of creating a broadcast loop is eliminated.                                                             

Please see document at url following (SPAR overview, pg 85):                             
http://pic.dhe.ibm.com/infocenter/flexsys/information/topic/com.ibm.acc.networkdevices.doc/00cg964.pdf                                               
IBM Flex System Fabric SI4093 System Interconnect Module                                 
Application Guide for Networking OS 7.8   

Summary:
The default failover trigger is invoked because the SPAR is defined for an LACP aggregation, and the EXT 1 uplink is not active to the uplink switch on LACP protocol.                                                           
Please verify if you have LACP defined on the uplink switch port for the switch EXT 1 is connected to.                                         
It is possible to modify the SI4093 for different uplink configurations and to reflect this change in failover to disable internal links for teaming

Please review the SI4093 documentation prior to changing, as the configuration of uplinks from a SPAR must follow recommendations to prevent network loop

Hello all,

I am receiving this message when I try to access an LPAR via the FSMs VTMENU. Has anyone else experienced this issue? I will reply if I find a resolution, thinking a reset of the FSP (flexible service processor) might do the trick. I'll try that first.

ALL MYPURESUPPORT USERS! PLEASE BE AWARE OF THE FOLLOWING TECHNICAL SECURITY BULLETIN FROM IBM FOR SECURITY ISSUES WITH THE FSM SOFTWARE.


1.  PureFlex System

- TITLE: Security Bulletin: IBM Flex System Manager (FSM) is affected by vulnerabilities stemming from FSM?’s use of IBM DB2: (CVE-2012-2194, CVE-2012-2196, CVE-2012-2197, CVE-2012-4826, CVE-2013-4033, CVE-2013-5466)
- URL: http://www.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5096284&brandind=5431802&myns=purflex&mync=E
- ABSTRACT: Security vulnerabilities have been discovered in versions of IBM DB2 that are embedded in IBM FSM.

10) Chassis Management Module (CMM) / CMM HUNG Can't Login via GUI or SSH
« on: August 29, 2014, 10:49:33 AM »
All,

I am creating this topic because we are currently experiencing issues where we cannot login to our CMM via the GUI or CLI. This happened to our lab Flex a few weeks ago and the only thing I could do to fix it was a manual reseat of the CMM (however this is not easily possible for this customer since their Flex is in a remote location). All commands from the FSM to the CMM do not work.

I am opening a ticket with support to see if they can help determine what causes this.

Stay tuned!

12) General Discussion / TRY GALILEO PERFORMANCE EXPLORER!
« on: June 10, 2014, 03:18:53 PM »
At Advanced Technology Services Group we have developed a performance monitoring and metrics trending tool to show historical performance statistics from core systems in your enterprise infrastructure landscape called Galileo Performance Explorer.

At the link below are good business cases for the tool and I suggest everyone take a look and also sign up for a free trial and give it a test run!

http://galileosuite.com/case-studies.php

"IBM PureSystems users can gain greater insights into server and storage assets by leveraging Galileo performance monitoring customized and certified for the integrated PureFlex infrastructure.

This unique Galileo solution runs as a virtual appliance in an affordable "private cloud" structure. This secure intranet environment maintains in-house control over data while delivering Galileo's deeper server and storage insights into actual system performance that can help you:

  • Empower better capacity planning and IT budgeting
  • Exploit virtualization technology more efficiently, for optimum benefit
  • Avoid disruption of business-critical applications, due to system overload
  • Minimize underutilization of hardware, as well as unnecessary purchases
The Galileo solution for PureFlex systems supports:

  • CentOS (Community ENTerprise Operating System)
  • Red Hat Enterprise Linux
  • SUSE Linux Enterprise Server

All,

I recently had a issue during an FSM install. For 3 days we could not get networking on ETH1. The first vendor had given up and used a single network for data and management using ETH0 which routes through the CMM. However, this networking schema does not allow the FSM to use a 10GB network but more importantly it does not allow for management deeper into the systems and some more of the advanced functions that the FSM provides.

To make a long story short I worked with support for those 3 long days and it was determined by LEVEL 3 (product engineers) that we had a bad NIC and requested a replacement. I had brought my own FSM with me and after a rebuild of that it was still having the issue so I dug further and noticed that VLAN tagging was enabled for the internal switch-port INTA1. THIS IS NOT CORRECT!!

Since the FSM setup does not allow for configuration of a VLAN you must have VLAN tagging DISABLED and the customers VLAN defined on the switch-port the FSM is on (in my case INTA1). Once disabled we were able to get network immediately and ping the customers gateway.

WHAT A NIGHTMARE.  >:(

14) VMware Support / IBM Upward Integration Module for VMware
« on: April 28, 2014, 02:36:04 PM »
If any of you VMware admins haven't seen this check it out. As long as you have the FSM licensed than I am told you would get this integration module free of charge!

https://www-947.ibm.com/support/entry/myportal/docdisplay?lndocid=migr-vmware

It can do the following:

  • Overview of the host or cluster status including information summary and health messages of the managed entities
  • Collects and analyzes system information to help diagnose system problems.
  • Acquires and applies the latest UpdateXpress System Packs and individual firmware updates to your ESXi system.Provides non-disruptive system updates which automates the update process of the hosts in a cluster environment without any workload interruption. NOTE: I have seen more success using this tool than the FSM for node updates  8)
  • Monitoring and providing a summary of power usage, thermal history, and fan speed and a trend chart of the managed host. Enable or disabled the Power Metric function on a host and set the power capping for a power-capping capable host to limit the server power usage. Support power throttling and provide notification if the server power usage exceeds the specific value.
  • Manage the current system settings on the host including IMM, uEFI, and boot order settings for the host.
  • Monitoring the server hardware status and automatically evacuating virtual machines in response to predictive failure alerts to protect your workloads.

15) Flex System Manager (FSM) / HEARTBLEED FIX FOR FSM 1.3.1
« on: April 24, 2014, 10:06:20 AM »
All,

I received a fix bulletin from IBM for the hearbleed OpenSSL vulnerability.

http://www.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5095202&brandind=5431802&myns=purflex&mync=E

Please respond with any questions about the fix. I am currently implementing this in our lab.

Thanks!

Pages: [1] 2