So few really need uptime
Here's my latest take on high availability:
1. Very, very few really need it.
2. The vast majority can live with being down for days.
3. Those who really, truly need it must be able to spend a lot of everything on it - continutally.
4. Those who buy HA-gear (software and hardware) without realising how much they have to spend on it will suffer downtime much worse than without the HA-gear.
Pity those folks who buy the gear and think it works like advertised, out of the box, like the vendor said, like the POC showed, and so on. They're the true victims.
Typically, it takes 8-9 months to truly test and stabilise a RAC system. As I've said somewhere else, some people elect to spend all of those nine months before going production whereas others split it so that some of the time is spent before and, indeed, some of it after going production.
But that's not all: Even when the system has been stabilised and runs fine, it will a couple of times a year or more often go down and create problems that you never saw before.
It's then time to call in external experts, but instead of just fixing the current cause of your IT crisis, I'd like to suggest that you instead consider the situation as one where you need to spend a good deal of resources in stabilising your system again - until the next IT crisis shows up.
Your system will never be truly stable when it's complex. The amount of effort and money you'll need to spend on humans being able to react to problems, running the system day-to-day, and - very important - keep them on their toes by having realistic (terribly expensive) test systems, courses, drills on realistic gear, networks of people who can help right now, and so forth... is huge.
The ironic thing is this: If you decide that you can live with downtime, and therefor with a much less complex system - your uptime will increase. Of course.
Mark Souza of Microsoft said it: Complexity is the enemy of availability.
1. Very, very few really need it.
2. The vast majority can live with being down for days.
3. Those who really, truly need it must be able to spend a lot of everything on it - continutally.
4. Those who buy HA-gear (software and hardware) without realising how much they have to spend on it will suffer downtime much worse than without the HA-gear.
Pity those folks who buy the gear and think it works like advertised, out of the box, like the vendor said, like the POC showed, and so on. They're the true victims.
Typically, it takes 8-9 months to truly test and stabilise a RAC system. As I've said somewhere else, some people elect to spend all of those nine months before going production whereas others split it so that some of the time is spent before and, indeed, some of it after going production.
But that's not all: Even when the system has been stabilised and runs fine, it will a couple of times a year or more often go down and create problems that you never saw before.
It's then time to call in external experts, but instead of just fixing the current cause of your IT crisis, I'd like to suggest that you instead consider the situation as one where you need to spend a good deal of resources in stabilising your system again - until the next IT crisis shows up.
Your system will never be truly stable when it's complex. The amount of effort and money you'll need to spend on humans being able to react to problems, running the system day-to-day, and - very important - keep them on their toes by having realistic (terribly expensive) test systems, courses, drills on realistic gear, networks of people who can help right now, and so forth... is huge.
The ironic thing is this: If you decide that you can live with downtime, and therefor with a much less complex system - your uptime will increase. Of course.
Mark Souza of Microsoft said it: Complexity is the enemy of availability.

14 Comments:
I totally agree.
Dominic Delmolino
www.oraclemusings.com
And you should know!
For those of you who don't know Dom, he was one of the pillars (2nd in command, or so) of Oracle's System Performance Group under Cary Millsap. Dom was the first human being on the Planet to master Symmetric Replication (as it was called).
But don't all added features increase complexity at some level? I once posted that if you are 100% averse to complexity you should be using dbopen(3),recno(3),hash(3) (and family ) on flat files, not Oracle. Or maybe C-ISAM?
Sure, the "features" that enable RAC functionality are a lot more visible than the underpinnings of, say, data guard but both are very complex. No?
I totally agree too.
We used to have an (dual) OPS instance way back from 1996-1999: Oracle7.
In the end, it *definitely* gave us more downtime than it promised us uptime.
We're back to a single instance now: way more stable!
I remember one particular situation where both OPS instances would crash if a stored procedure was deployed on instance A, that would (temporarily) invalidate another stored pl/sql module that some session would still have a shared pool lock on at the other instance.
Oracle never got around to fixing that one for us.
Wonder if it's still there.
We had a couple of query scripts that needed to be run prior to delivering a stored procedure, just to check for a (known to us) situation that would crash the lot. And sometimes we had to just opt to bringing down one instance, do the deploy, and re-integrate the other instance again.
And I'm not talking about the unexplained freezes that would occur when the instances were running for a long time (that is, 2-3 weeks). Yep, proactive bounces in the weekend were the cure.
We (the dba-team) used to joke, that Larry's software was starting to look a lot like Bill's.
Toon
Yeah, everything is complex if you dive deep enough into it, I guess.
A fair question to ask these days could for instance be: Is a relational database really needed for this or that data and purpose? And another could be: Would it be OK to lose data?
We've been brought up in the RDBMS school to think the obvious and only answer is Yes to both questions, but we shouldn't let that cloud our judgement of the real needs of customers.
When something is not really needed, although it would be cool if it worked, you've got the cart before the horse.
A simple backup of your system, where restore and recovery is tested regularly, is what most people really need. They don't need DR-sites, DataGuard, and such.
For political reasons you might have to implement all sorts of things, and I still haven't found an effective way of preventing that from happening.
Toon's story about OPS reminds me of the early 90's where an airport was running OPS to make sure the passengers could find their gates.
The system would crash (cluster, OPS, the works) two-four times a year, which was pretty bad since nobody knew where to go then.
Every time I'd look and look at logs, dumps, traces, bug texts, TAR's (called PMS, and such back then) without finding the reason for the crash. The hardware vendor would do the same.
We were probably not good enough back then, either, but we tried for a couple of years.
Then I suggested they just run on one node, ie. stop using OPS, and they've been stable ever since.
No, OPS is not RAC, but similar situations exist aplenty to prove my point that it takes SO much effort and dedication to get REAL high-availability.
I should point out to you a few guys who have been very inspirational for me in this area:
James Morle with his "Disposable Computing Architecture" (if it breaks, throw it away and get the spare from the cuppard).
Jeff Needham, who introduced me to the notion that a DL-585 from HP in fact IS your HA - and also introduced me to the term "moving parts".
Mark Souza, head of Customer Advisory Team at MS, for the quote "Complexity is the enemy of availability" (or something like it).
Excellent point, most 'standard' hardware these days is so redundant anyway - multiples of everything, hot swappable stuff, etc that the odds of the server going down are pretty small. The odds of some BOFH doing something bad to the data that does not get picked up for days or a drunk sysadmin/DBA (like me) doing an rm -r in the wrong directory are a lot higher too me thinks. I also worry about the storage system that RACs and stuff run on, there's usually only one of them to share between the servers unless there is a spare that is being mirrored to. What if a repair guy forgets his radioactive pace maker inside there for a few days? The KISS principle seems to applicable here too - Keep It Simple Stupid
If you are looking wow power leveling, buy warcraft gold as well as WOW Power Leveling and World Of wow levelingWhen you need someone to listen,FFXI Gil, I'll be there. When you need a hug, cheap FFXI Gil,I'll be there. When you need someone to hold your hand, I'll be there. When you need someone to wipe your tears, guess what? I'll be there. William Shakespeare
When the Wow Gold wolf finally found the wow gold cheap hole in the chimney he crawled cheap wow gold down and KERSPLASH right into that kettle of water and that was cheapest wow gold the end of his troubles with the big bad wolf.
game4power.
The next day the Buy Wow Goldlittle pig invited hisbuy gold wow mother over . She said "You see it is just as mttgamingI told you. The way to get along in the world is to do world of warcraft gold things as well as you can." Fortunately for that little pig, he buy cheap wow gold learned that lesson. And he just agamegold lived happily ever after!.
You are a owner of such a nice blog. And also it has got more than 60 indexed posts. So I am little surprised with the negative title used for it. Don't you do blogs regularly!
It's referring to a website I created several years ago called Vi Bruger Ikke (vibrugerikke.dk) which was later translated into wedonotuse.com.
So it's just a joke calling it we do not use blogs.
Mogens
Hi Mogens
I did like your take on this. Conversely - those customers who do have the discipline to put a cost on their downtime also often have the discipline and savvy to really make RAC (and other resilience technologies) really work - and they really are great to work with...
Kind regards
Andrew (back in touch after some years...)
http://blogs.oracle.com/asparks
great post!
Hi Andrew,
Ah, good to have you back. What have you been doing?
Mogens
After being wow gold informed of wow power leveling the problem, wow power leveling their daughter's date dog apparel said he could get the peanut out.wow power leveling With that, Wow Power Level the pilot threw open dog clothing the door and jumped from the plane.flyff power leveling the young man's Atlantica power leveling sunburn started power leveling acting up again.dog clothes He asked to be excused,dog clothes wholesale went into the kitchen power leveling The executoner said that if pet clothing this happens a second archlord power leveling time throws out a grenade and says, "i'm in the army, world of warcraft gold i can get these whenever i need them."dog clothes so they all land pet clothes safely
Post a Comment
Links to this post:
Create a Link
<< Home