Here's my latest take on high availability:
1. Very, very few really need it.
2. The vast majority can live with being down for days.
3. Those who really, truly need it must be able to spend a lot of everything on it - continutally.
4. Those who buy HA-gear (software and hardware) without realising how much they have to spend on it will suffer downtime much worse than without the HA-gear.
Pity those folks who buy the gear and think it works like advertised, out of the box, like the vendor said, like the POC showed, and so on. They're the true victims.
Typically, it takes 8-9 months to truly test and stabilise a RAC system. As I've said somewhere else, some people elect to spend all of those nine months before going production whereas others split it so that some of the time is spent before and, indeed, some of it after going production.
But that's not all: Even when the system has been stabilised and runs fine, it will a couple of times a year or more often go down and create problems that you never saw before.
It's then time to call in external experts, but instead of just fixing the current cause of your IT crisis, I'd like to suggest that you instead consider the situation as one where you need to spend a good deal of resources in stabilising your system again - until the next IT crisis shows up.
Your system will never be truly stable when it's complex. The amount of effort and money you'll need to spend on humans being able to react to problems, running the system day-to-day, and - very important - keep them on their toes by having realistic (terribly expensive) test systems, courses, drills on realistic gear, networks of people who can help right now, and so forth... is huge.
The ironic thing is this: If you decide that you can live with downtime, and therefor with a much less complex system - your uptime will increase. Of course.
Mark Souza of Microsoft said it: Complexity is the enemy of availability.