Archive for the ‘Electrical’ Category

Why did my Data Center UPS Fail?

Friday, October 7th, 2011

I hear this all the time. Most people move out of a datacenter because something bad happened, and its usually a major power failure that causes the most trouble. In this article, I am going to outline and analyze a power failure event that occurred at an unnamed facility. This is a true story.

About 2 years ago I fielded a call from someone who lost power at their current data center provider. In addition to being down, they also had some equipment failures (power supplies and some RAM went bad in a few systems). Their provider told them that nothing was wrong with the UPS, rather, it was an issue with the utility caused by a brown out. As soon as I heard this, I told the person that this explanation was completely bogus.

Lets recap the cardinal rules of a good UPS:

1. An online UPS setups should always provide clean line power regardless of supply.

2. If an online UPS fails, an auto-sync transformer bridges line power and utility within 1 Hz and no power is lost, only backup capability is lost.

And lets recap what you need to do in order to make sure the above rules always apply:

1. Check your batteries every 3 months.

2. Replace a battery as soon as its internal resistance rises by 10%

3. Replace a battery as soon as its 4 years old, even if its internal resistance is still within spec.

4. Provide suitable cooling to the UPS.

5. CHECK THE BATTERIES.

I cant stress enough how important batteries are. The entire UPS is built around the concept of having working batteries. Almost every line-effecting outage of a UPS is due to a battery problem. At Quonix, we use Liebert Series 300 UPS systems that have had inverter boards fail, induction coils burn out, and input filter short out, and we NEVER lost output line power. That’s why the Liebert’s cost so much, they are designed to handle failures, but it requires good batteries.

Getting back to the story about the brown out. Any UPS that experiences a brown out or any kind of dirty power, would immediately engage batteries in order to provide clean power while it activates the GENSET cut-over. This requires the UPS to run on batteries for 5-7 seconds. If the batteries cant hold, the UPS will drop offline into bypass mode and auto-sync to utility line power. Once a UPS goes into bypass and syncs to utility power it no longer provides power protection or line conditioning. So all the dirty power goes straight through. If power was lost, GENSET power now comes straight through. And when utility power returns, the GENSET cuts out causing another small blip. This is why the server power supplies and RAM went bad. The dirty, and possibly surging power came right through the UPS into the rack cabinet.

Many providers dont properly maintain their batteries. They just assume the batteries will last 4-5 years. Not the case. I’ve seen brand new battery cabinets have 1 battery go bad after as little as 1 year. Sometimes its just a random manufacturer defect. And in many cases, all it takes is 1 bad battery to foul the entire array.

Want to be sure if your provider is on top of things? Easy, just ask for a copy of their UPS and battery preventative maintenance contract. If they have one, and they should, it should be easy to fax or email you a copy. You can even request a battery report. At Quonix, the vendor we use for our battery maintenance sends us a detailed graphical report with the health of each battery – voltage, impedance, internal resistance, temperature, and age.

Why are breaker ratings so important? Should I care if my datacenter does not follow them?

Wednesday, July 1st, 2009

Those who use datacenter services for their IT operations know much about OS technologies, databases, and server hardware, but unfortunately, many IT professionals are not familiar with safe power practices and the ramifications if they are not followed.

The most common area of confusion is the 80% breaker rating limit. Some datacenters will only allow you to run continuos duty equipment up to 80% of the breakers amp rating. For example, on a 20amp 120V breaker power feed, that means your limited to 16amps of continuous power draw. These datacenters are following national fire and electrical code standards, and they are operating in a safe environment.

The sad thing is many datacenters do not enforce the 80% rule, and instead allows their users to run all the way to 100% breaker rating of continous draw. In addition to running an unsafe environment, these datacenters further propagate the misnomer of how to effectively run your power environment. When you allow someone to do something that is incorrect, that person assumes it is correct since you let them do it. Time and time again we have seen new customers come into our facility and say, “my previous provider allowed us to do…”. We then have to break down that myth and explain that what they were doing was unsafe and a code violation.

Why is it dangerous to exceed 80% breaker rating?

Lets start with the obvious. First, your more likely to trip your breaker. Second, your more likely to degrade the lifetime of the breaker causing it to fail or false trip. And more importantly, you run the risk of an electrical fire. NEC standards for wire sizing are based on 80% limits as well, so if you run 100% breaker rating power through those wires 24×7, the wires will overheat. Over time, the excess heat can cause the jacket to degrade and eventually fail, causing a bare wire scenario which could trigger an electrical fire.

Additional concerns involve the slow degradation of your power system as a whole, including your UPS, invertors, and transformers. All electrical equipment is designed by the manufacturerto not exceed 80% max rating for more then 3 hours. If you do exceed the limit, you violate the warranty and run the gear in a fashion that it was not designed for.

Why would a datacenter violate the 80% power rule?

There are many reasons. One is money. By allowing customers to pull a few extra amps, they can get a bit more business since their offering will appear cheaper. Another reason is they lack the ability to monitor power usage. A more serious reason, though, is there is a lack of communication between sales people, customer service, and operations. The operations people may know its wrong, but the sales people dont, and they allow it to occur. Its not caught due to the communication breakdown.

What does this all mean?

Simple. Stay away from datacenters that dont enforce 80% breaker limits when looking for a colocation provider. If they dont, they run an unsafe environment. By itself, it may not seem like a big deal, but that environment usually will fail to meet other criteria of a quality provider. In some ways, the handling of power infrastructure is a litmus test for quality datacenter operations as a whole. If the operations group of a datacenter provider has a very good handling of power, more often then not, they have a good handle on everything else!