Jul 11, 2014

So what does a reliability engineer do?

Reader Smitty50 (an old and dear friend) asks "What does a reliability engineer do?"

Good question.

Depends on the kind of organization we are talking about. In an organization that operates systems, a reliability engineer uses measurements and statistical methods to ensure that the overall system stays within the required level of reliability as the system and the components from which it is built age and the systems capabilities change.

These are critically important people. Our lives and our civilization depend on the engineers responsible for ensuring the reliability of the telecommunication systems, our bridges and road systems, our sewers, our water supply, the great hydroelectric damns, and so many other things that we all take for granted.

But, reliability has to be built in from the beginning. A system and every component of the system has to be designed and built to be reliable. You can not test in reliability any more than you can test in quality. So, the reliability engineer must be an important part of development too! The difference between a mature engineering discipline and typical new product development is the presence of reliability engineers in the design process and the quality of the mathematical tools they have and use.

Pretty common in things like civil engineering, so common in fact that you never see them because civil engineers are trained in reliability from the ground up. Remember the Tacoma narrows bridge? Never again. You can also find people in that role all through aerospace and telecommunications.

In software? Ehh... not so much. My bet is that if Microsoft had to comply with the same reliability requirements that Boeing or Airbus have to comply with they would never have shipped a single product. I was at a meeting with MS in Redmond and remember being called a liar by an MS manager when we told him the reliability of the telephone system. We were told that what we claimed (a matter of public record BTW) was not possible. But then, Windows blue screens a lot more often than a 747 crashes or your land line phone stops working.

Recently I had my cancer treatments disrupted because the version of Windows that ran the equipment blue screened and the medical staff had to wait two weeks for a technician to come out and fix it. I found a facility that uses equipment that is not controlled by Windows. Who wants to actually die, or at least be seriously burned, by a blue screen of death?

OK, shouldn't just pick on MS, pretty much every software product is just as bad. The only software that seems to even approach a professional level of reliability is software used in aircraft, and open source software. Not all of open source is that good. But, I can't remember the last time my Ubuntu Linux box crashed.