Wednesday, December 14, 2005

System-wide slowdowns

When I talk to customers with performance problems, I ask these two questions:

1. Is everything slow, or is it just specific stuff like a form, a report or a job (Cary Millsap would call it a Business Function, I think :) )?
2. Has it always been slow or did it just start?
3. Do you know what changed?

If everything really IS slow, ie. what I call a system-wide slowdown, then I still don't think you can find the answer fast and reliably (and certainly not without pure luck combined with decades of experience) by looking at system-wide data such as 'sar', StatsPack, v$sysstat, v$system_event and such.

[With one exception, though: Most of the times, this situation is caused by running too many reports or other heavy jobs at the same time, ie saturating the CPU's with CPU-intensive, long-running jobs. Cary wrote about 'The Magic of 2' some years ago, and he'll 'kill' that topic during his Master Class in Copenhagen in January, so that'll be fun. Anyway, when you can establish that CPU usage is 100% (or 0% idle) it's mostly just because of Month's End or similar situations where many reports have to be run.]

Back to the main topic: In the relatively few cases with system-wide slowdowns I've been involved in where it wasn't just because of too many 'batch' or 'CPU-intensive' jobs running simultaneously, it has been possible to find the reason for the general slowdown by 10046'ing something that was used widely, because whatever slows the system down will also impact this fellow.

To be safe, you might want to 10046 (perhaps it should be called Deep-six when it's level 12....) two or three such typical thingies to be sure.

The symptom of whatever is bothering the whole system will show up in the trace files.

So, again, I don't think you can use the system-wide stuff for anything :-).

This should conclude my trilogy on this topic. 'Baselining' and 'StatsPack' were the first two.

Monday, December 05, 2005

What good is Statspack really?

I remember the days of bstat/estat, and especially the Oracle Support bulletin (version 6 days) about how to interpret the numbers in the resulting reports.txt file. The bulletin was a long list of things you could look at in the file, and that was about it. From it, you could of course conclude nothing about performance problems.

Of course? Yes, looking at aggregations will not, ever, tell you the reason for a slowdown in an application or even in a whole system. The tracing of a form, a report or a job can do that. Not various summaries and averages.

The one exception is when only one thing was running :-) ... then the summary becomes just a summary of that one form, report or whatever. Otherwise, things will be mixed and hence messed up.

Dave Ensor, Graham Wood and a few others made the st7 utility, which allowed you to collect more snapshots and get a report of the time interval between any two snapshots. Still I couldn't figure out how to link the users' performance problems to these numbers.

Then came Statspack. Now, Statspack is much better technically, AND it can collect SQL statement information that will help you identify heavy statements. AND it will summarise the waits, etc. at the top.

It's still not possible for me to find anyone who can use a Statspack output to identify the reason for a form, report or job running slow.

Also, having the haviest SQL statements mentioned is fine, but if the response time changes (see one of my former posts) for an end user, its likely cause is a new and exciting execution plan for a SQL statement in the application, and that SQL statement doesn't have to be the heaviest at all.

I fail to see what I can use Statspack for at all. Please enlighten me :-).