Tuesday, April 17, 2012

A lesson learned about RAM...

One of my customers has a simple whitebox ESXi 5 server, with only local SATA disks. A whitebox VMWare ESXi 5 server, is just a more or less standard PC, with industry standard PC components. Nothing fancy, except that you need a modern CPU with virtual hardware support and a decent network card (not standard Realtek NIC's that you typically find in a standard PC, a dedicated NIC is often needed).

Anyway, the customer has this ESXi 5 server running around 9 VM's, and with a total of 32 GB. For a month ago, some VM's randomly got hit by the typical  BlueScreen (STOP error) on Windows VM's. The STOP error indicated driver errors, and memory errors. I was first thinking about something wrong with the disk, maybe some read errors?

Anyway, the problems disappeared after some days, thinking everything is just in perfect order. And then it started again.

This time, I suspected the RAM modules.

So I downloaded MemTest+ (http://www.memtest.org/) and booted the server. And boy, that was a lot errors. I counted over 6000 errors after 2 pass with all the modules installed. The errors show up pretty quickly in the test, so in my case, I did not care for testing for days, as some other people do.

After removing and testing one by one RAM module, I found a faulty module, RMA'ed to the seller and now we are back on track.

Investigation:
Now I started to wonder how can this happen, as it worked fine for almost a year. RAM errors just don't often happen by it self in a 24/7 running server, usually RAM errors are present from the factory. Then I found out that my customer for almost 1 year was only running 3 small windows servers, with a total RAM usage of around 5 GB. A little bit overkill with 32 GB RAM of course, but the customer added 6 new VM's just prior to when the problem started. And then the total RAM usage was around 27 GB.

The faulty RAM was in slot 2 (starting from slot 0), so I guess that the faulty RAM module where hardly used or at least the faulty registers where not heavily used. I am not sure how ESXi are using the RAM modules, but I presume that it is more or less random. And with only 5 GB of 32 GB in use, there was a low chance to hit the faulty registers.

Lesson learned:
Always do a Memtest before you put a server in production, even it it's a costly HP/DELL/IBM server with lots of fancy hardware. Especially if it is a ESXi server, running many VM's.

A note about "real" server RAM:
Real servers from a well known company always use ECC RAM, versus Non-ECC RAM for standard PC's. The price tag is a lot higher on ECC-RAM, but in the other hand, one of the nice things that ECC is doing, is correcting on the fly RAM errors that I experienced. Of course it cannot handle all types of errors, but it definitively decreases the chances of your server going crazy. 

No comments:

Post a Comment