Tuesday, April 17, 2012

A lesson learned about RAM...

One of my customers has a simple whitebox ESXi 5 server, with only local SATA disks. A whitebox VMWare ESXi 5 server, is just a more or less standard PC, with industry standard PC components. Nothing fancy, except that you need a modern CPU with virtual hardware support and a decent network card (not standard Realtek NIC's that you typically find in a standard PC, a dedicated NIC is often needed).

Anyway, the customer has this ESXi 5 server running around 9 VM's, and with a total of 32 GB. For a month ago, some VM's randomly got hit by the typical  BlueScreen (STOP error) on Windows VM's. The STOP error indicated driver errors, and memory errors. I was first thinking about something wrong with the disk, maybe some read errors?

Anyway, the problems disappeared after some days, thinking everything is just in perfect order. And then it started again.

This time, I suspected the RAM modules.

So I downloaded MemTest+ (http://www.memtest.org/) and booted the server. And boy, that was a lot errors. I counted over 6000 errors after 2 pass with all the modules installed. The errors show up pretty quickly in the test, so in my case, I did not care for testing for days, as some other people do.

After removing and testing one by one RAM module, I found a faulty module, RMA'ed to the seller and now we are back on track.

Investigation:
Now I started to wonder how can this happen, as it worked fine for almost a year. RAM errors just don't often happen by it self in a 24/7 running server, usually RAM errors are present from the factory. Then I found out that my customer for almost 1 year was only running 3 small windows servers, with a total RAM usage of around 5 GB. A little bit overkill with 32 GB RAM of course, but the customer added 6 new VM's just prior to when the problem started. And then the total RAM usage was around 27 GB.

The faulty RAM was in slot 2 (starting from slot 0), so I guess that the faulty RAM module where hardly used or at least the faulty registers where not heavily used. I am not sure how ESXi are using the RAM modules, but I presume that it is more or less random. And with only 5 GB of 32 GB in use, there was a low chance to hit the faulty registers.

Lesson learned:
Always do a Memtest before you put a server in production, even it it's a costly HP/DELL/IBM server with lots of fancy hardware. Especially if it is a ESXi server, running many VM's.

A note about "real" server RAM:
Real servers from a well known company always use ECC RAM, versus Non-ECC RAM for standard PC's. The price tag is a lot higher on ECC-RAM, but in the other hand, one of the nice things that ECC is doing, is correcting on the fly RAM errors that I experienced. Of course it cannot handle all types of errors, but it definitively decreases the chances of your server going crazy. 

Thursday, April 12, 2012

How to use OS customization for CentOS 6 in vCenter 5

Do you have a VMWare vCenter 5 Server with a CentOS 6 template, just to discover that you cannot use OS customizations on it, like you can with Ubuntu and most Windows OS's?

This is the solution for you.
(warning: there is an easier way to do it, look at the comments section)

Background:
If you don't know what I am talking about, the goal is this:
You have spent a lot of time of creating a great VM, with OS and maybe some applications as well, to be used as a master copy that you would like your new VM's to be a copy of. That stuff works great in Windows OS's and a few Linux OS's (like Ubuntu and RedHat), but not with CentOS. CentOS is based on RedHat and is very popular in the IT hosting industry.

Now the problem is that you could make it work in CentOS 5 with a little manual editing, but kernel changes in CentOS 6 broke that old solution.
Note: This step-by-step guide is not supported by VMWare or is support, so use it at your own risk!
This only works in vCenter 5, if you want to achieve the same in vCenter 4.1, you have to change the OS setting on the VM to RedHat.

What happens when you clone a Template in vCenter on a template with CentOS 6 ?
The answer is that the device manager (udev) in the kernel 2.6.13 and above remembers the NIC settings from the template, so you end up with 2 NIC's in your cloned VM. Note that you have to edit the VM template to be a RedHat server (not CentOS!) in order to use a Guest Customization in vCenter, otherwise you will receive an error message in vCenter. (hint: convert the template to a VM and than edit settings to change OS type)

Here is an example:


[root@centostemplate ~]# cat /etc/udev/rules.d/70-persistent-net.rules
# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.
# PCI device 0x15ad:0x07b0 (vmxnet3) (custom name provided by external tool)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:50:56:42:02:2f", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
# PCI device 0x15ad:0x07b0 (vmxnet3)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:50:56:42:ef:34", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"

What you end up with, is 2 nic's on the clones VM, eth0 being a clone of the original nic and eth1 being the new nic in your VM.

This is problematic as eht0 is not shown at all if you do a ifconfig, it will only show eth1 with DHCP and even if you set a static IP (in the Customization Wizard), it will not work.

Now for the solution:
Remove the section of eth0 on this file /etc/udev/rules.d/70-persistent-net.rules
Example (remove everything in red):



# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.
# PCI device 0x15ad:0x07b0 (vmxnet3) (custom name provided by external tool)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:50:56:42:02:2f", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
# PCI device 0x15ad:0x07b0 (vmxnet3)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="00:50:56:42:ef:34", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"


And than change NAME from eth1 to eth0.

Now you have a working nic, but with wrong config.

To correct the config:

[root@centostemplate ~]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="static"
HWADDR="00:50:56:42:ef:34"
IPV6INIT="no"
IPV6_AUTOCONF="no"
NM_CONTROLLED="no"
ONBOOT="yes"
IPADDR="192.168.10.125"
NETMASK="255.255.255.0"
NETWORK="192.168.10.0"
BROADCAST="192.168.10.255"


Note that you have to edit the HWADDR to match the new nic's mac address. If you are unsure what is the correct mac address, just edit the VM and look on the network card mac settings.

Reboot the server and your done!
That's it, maybe a little extra work, but on the other hand, now you can use Guest Customizations on vCenter, which saves a lot of work hours!

Credit to http://aaronwalrath.wordpress.com/2011/02/26/cloned-red-hatcentosscientific-linux-virtual-machines-and-device-eth0-does-not-seem-to-be-present-message/ for getting me in the rightt direction!