Performance Monitoring's Cutting Edge

Howard Fosdick

A monitoring tool must be able to lead the DBA straight to a problem across heterogeneous, highly distributed environments. How are the major vendors meeting the challenge?

If Rip Van Winkle were an Oracle DBA, he'd only need to have taken a three-year nap to be astounded by the advances made in performance and system monitoring. He might be a little overwhelmed by how important performance and database management are to deriving full value from his corporation's applications and by how essential it is to monitor the Oracle database to ensure that it's available and working efficiently. He would have to learn about security, managing tablespaces and other database objects, proper placement of data on disk, and space management.
      Understanding current trends is essential to deciding upon your company's approach to performance monitoring and what tools you may need.
      Performance management occurs at several different levels. The highest level relies on systemwide or aggregate statistics that define the state of the system. This high-level monitoring takes place operating system-, DBMS-, and applicationwide. And the combination of these three high-level viewpoints should constitute a "macro" view that provides immediate information on the state of the system and on the general nature of any problems that may exist. A good monitoring tool will give you sufficient information to address any operating system-, database-, or applicationwide problems. For example, it should be able to alert you to turn on OS asynchronous I/O, increase the size of the database buffers, or adjust the application's batch schedule to fit your nightly window.
      In addition to these global views, you'll also need views to monitor individual programs, SQL statements, or queries. At this level, a good performance monitoring and analysis tool should lead the intelligent programmer or DBA straight to the source of the problem and provide sufficient information to help resolve it. It should provide details on a broad range of program or query behavior, such as locking traces, real I/Os, buffer hits, and the like.
      The market currently offers three types of performance-monitoring tools: those specializing in global views, those that focus on specific aspects of SQL statements or program performance, and those that integrate both views into a single tool. If you choose this third type, look carefully to see how it handles the transition from the global to the specific. For example, make sure it can trace a particular global problem down to the particular SQL statement or query whose behavior causes the problem.
      Performance-monitoring tools are currently able to measure and display a vast range of "counters" or performance analytics. Computer Associates' Unicenter-TNG, for example, was initially oriented toward operating system and network availability and management but has now added components such as an Oracle Agent to expand the product's purview into the realm of database availability and reliability. In contrast, BMC Software's traditional core strength has been database technology, and its Patrol products have grown to cover closely related networking and operating system problems. Platinum Technology also evolved from a core expertise in DBMSs and widened its scope by purchasing a large number of companies in the past two years that produced specific "point" solutions to operating system, networking, and database issues. Platinum is now busy molding these products into a more coherent, integrated whole. Oracle's products have always focused on database issues but necessarily cover some closely related platform availability concerns. Oracle's tools address several operating systems but target only Oracle's DBMS.

TRENDS

The past few years have seen a variety of trends in database management and performance (see Table 1). One overarching trend is the shift from physically centralized homogeneous systems toward distributed systems on a variety of platforms. Most companies today expect to be able to oversee a shortlist of diverse platforms from advanced workstations at any physical location. Environments frequently consist of various operating systems--MVS, various flavors of Unix, and Windows NT--and a mixture of databases, including Oracle, Informix, SQL Server, and DB2. Networks also add to the blend, with companies mixing NetWare and LANs, wide-area TCP/IP networks, and mainframe networks based on SNA and VTAM.

TABLE 1. Major trends in database and performance management.

Distributed Issues Trends

Control of data center systems • Workstation control of distributed systems

Host monitoring • Intelligent agents to monitor distributed systems via workstation consoles

Managing homogeneous platforms (single operating system/single database system) • Managing heterogeneous platforms (multiple operating systems/several database systems)

Point tool solutions • Integrated toolsets monitor and manage operating systems, databases, and networks

Narrow monitoring subjects • Expanded range of monitoring counters

Active Monitoring Issues Trends

Human monitoring via data center consoles • Software-system monitoring
• Agents raise alerts or event triggers to indicate problems

Scheduled jobs to address problems • "Fix-it" programs run automatically to resolve problems

Passive monitoring • Proactive monitoring and proactive, preventative maintenance

Expertise Issues Trends

High expertise requirement • Expertise embedded in software
• Software recommendations
• Proactive software
• Self-monitoring/self-maintaining systems
• Embedded expert systems

Character interfaces • Graphical management

Help systems that explain complex parameters in their own terms • Extensive, highly explanatory help systems that make explicit recommendations

Databases no longer function in isolation: They are bound up, inseparably, with operating systems and networks, with availability and performance issues cutting across all three areas. Companies want to move systems out to their users and design their computer support according to the companies' needs (not the other way around, as was necessary when expensive computer hardware dictated the plan). For this reason, monitoring tools must be able to handle heterogeneous, highly distributed environments.
The tools from Computer Associates, BMC, Platinum, and Oracle represent a new breed of monitoring/availability tools that have evolved to meet most of these requirements only in the past several years. How have they done it? One key is that all their products are based on intelligent agent architecture, a design that places independent modules, or agents, on each monitored node in a distributed system (see Figure 1). These agents are responsible for collecting performance data at the node on which they run and reporting it back to a central collector node. Workstations running GUI tools then access the data from the collector node and present it to administrators, DBAs, and other support personnel.

      Separating remote agents from the collector mechanism can yield many benefits. For example, individual nodes only need to run the code or agents necessary to monitoring that machine. One node might run an Oracle agent, another a Sybase agent, and another both. Similarly, machines can run operating system- specific agents for platforms such as AIX, HP/UX, Solaris, and NT. Separating remote agents from the collector mechanism also ensures higher reliability in a distributed system. If a node goes down, only its agent(s) are out of the picture. The collector continues to oversee other platforms and agents. When the node recovers, it can "catch up" by receiving any queued messages, commands, or schedules from the collector node. Thus an agent architecture permits the greatest degree of reliability and flexibility possible in complex distributed systems.
      The relative "intelligence" of agents is a competitive issue among vendors. Some agents are merely data-gathering and reporting tools. CA Unicenter-TNG, BMC Patrol, Platinum POEMS, and Oracle Enterprise Manager and Performance Pack all go beyond these simple tasks by raising event alerts or triggers that notify administrators when predefined thresholds are exceeded. For example, if the Oracle log archive directory fills up, intelligent agents in the tools monitoring the event can either notify a person (through console flags, email, phone, or pager) or trigger a program that can automatically address the problem. Either way, the agents are able to take measures proactively before Oracle is affected.
      Ultimately, agents may develop a greater degree of independent decision making. After all, as a DBA, do I need to know at 3 a.m. that the log archive has filled up? Why not just have the intelligent agent run a corrective script to fix the problem, and then have it send me an email that I'll see when I log in first thing in the morning?
      The degree of "proactivity" in agents is as important as their intelligence. The current trend is to shift from simple monitoring to sophisticated independent action and the ability to address and correct an increasingly wide range of problems. The ultimate goal is to fully automate systems monitoring and correction through agents.
      Several competitive points also arise among vendors concerning the collector node. Can responsibility for the collector shift dynamically among nodes as performance or availability requirements dictate? What resources and overhead do the collector node and remote monitoring agents consume? Does the data reside in a standard database that administrators can query directly?
      A key feature in making agent architecture work well is the ability to schedule and monitor batch jobs and scripts remotely. The collector node needs to be able to create job schedules of arbitrary complexity in the distributed system, then communicate with the affected remote nodes, assign them their portions of the schedule, and ensure that they successfully execute the programs. Not only must the monitoring tool have a sophisticated job scheduler embedded, but this job scheduler must also address all the complexities inherent to distributed environments. For example, if a node goes down, the tool must know whether and when to reschedule the job, ensure that the job still completes, and notify administrators as to what happened and why.
      Intelligent agent architecture also yields design benefits in distributed scheduling. Individual agents are responsible for executing and monitoring their portions of the overall schedule on their individual machines: They can queue jobs for execution, organize scheduling, and reschedule jobs in a timely manner if the machine becomes unavailable for some reason or if a crisis occurs.

EXPERTISE ISSUES

Performance measurement, management, and improvement is typically a high-expertise endeavor. DBAs and support personnel often work with product-specific terminology. How well a tool masks this complexity--by explaining terms or concepts and rendering the environment easier to manage--is a key aspect of product differentiation. Of course, the trend is toward having software rather than humans manage system performance, reducing the need for expertise. For example, if a system is intelligent enough to see the need for a data reorganization, generate the script to achieve this task, schedule the script, and oversee its successful completion, fewer personnel will need to have intimate knowledge of reorganization utilities.
While total automation may be the holy grail, even relatively unsophisticated management tools can do a lot to relieve the need for top-notch expertise. For example, merely having good documentation (online and printed) and a decent help system goes a long way toward helping database personnel do their jobs. Just as many programmers don't like to document their programs, many vendors skimp on their product documentation. When they do provide documentation, it is usually of a much lower caliber than the quality of the product calls for.
Table 2 shows examples of "poor," "better," and "best" documentation. The example of poor documentation represents the kind of help a software developer might write if his or her work is not checked or held to a high standard. The example of better documentation at least provides an explanation of the term that is not self-referential. A DBA can understand exactly what is meant by the "buffer hit ratio" from this definition. The example of best documentation goes the extra mile. It capitalizes on recent browser technology advancements by including hyperlinks to related topics. More importantly, it gives advice on what to do--a significant step beyond merely describing a problem. Finally, the example of the kind of documentation we should expect in the future shows that expert systems should prescribe as well as describe. Optimally, such a help system should evolve into a self-correcting mechanism that describes the actions it took on its own to resolve an availability problem or to tune performance dynamically.

TABLE 2. Variations among help systems.

POOR:
Buffer Hit Ratio: relative number of buffer hits.
BETTER:
Buffer Hit Ratio: relative number of I/O requests that are satisfied by reading information from Oracle's memory buffer, vs. those that require actual reading of data from disk.
BEST:
Buffer Hit Ratio: relative number of I/O requests that are satisfied by reading information from Oracle's buffer, versus those that require actual reading of data from disk.
Because Oracle's buffer is memory or RAM, I/O requests that are satisfied by buffer access are much faster than those that require actual disk I/O. A system with a high buffer hit ratio performs much better than one with a low buffer hit ratio.
If the buffer hit ratio is consistently below 90 percent, you may wish to improve it by increasing the size of Oracle's database buffers. This may entail purchasing more memory for your system.
For a conceptual overview, click on SYSTEM GLOBAL AREA (SGA) and ORACLE BUFFER CACHE. For detailed information on setting the size of the buffer cache, click on the INIT.ORA parameter: DB_BLOCK_BUFFERS. Also click on related parameter: SHARED_POOL_SIZE.

THE FUTURE:
Contains the explanatory information of "The Best" above, plus:
To initiate dynamic data collection on the buffer hit ratio and receive recommendations on optimal settings for this system, click here.
To generate and schedule a script that will implement the recommended optimal settings, click here.

UNICENTER-TNG

Computer Associates first gained recognition as leader in providing add-in software for data center management back in the mainframe era. In the 1990s, it expanded its purview to cover almost all the common operating systems as well as the networking components of distributed systems. Recently it has expanded into database monitoring and problem resolution as well.
      Table 3 lists some of the major areas of functionality for Unicenter-TNG for operating system platforms, including event management, file management, tape management, workload management or scheduling, security management, problem management, system resource accounting, and report and spool management. In addition to propagating these tools to almost every popular platform, Computer Associates has also vastly expanded its automated alert and trigger system with a Systems Alert component. Systems Alert specializes in identifying and averting a wide range of common operating system problems, such as full file systems, undue swapping, kernel problems, excessive CPU utilization, and thrashing. Computer Associates' Systems Alert and Event Management are both thoroughly grounded in the intelligent agent architecture.

TABLE 3. CA-Unicenter components.

Component Function

Event Management Centralized message and alert collection; can oversee or automatically respond to messages and alerts

File Management A file management system that includes backup/recovery capabilities

Tape Management Labels and tracks tapes; ensures that tape backups are properly retained

Workload Management
(job scheduling): An industrial-strength job scheduling system that handles any level of job-scheduling complexity or remote job submission and control, and uses calendars to control schedules

Security Management A policy-based security system that uses a central security database for all access determinations

Problem Management Tracks and monitors problems so that a site can easily identify their status and systematically manage their resolution.

Report and Spool Management Manage printers and reports in the distributed system so that they are spooled and routed in the most efficient and appropriate manner

System Resource Accounting A facility designed to track computer resource usage and permit chargeback accounting and billing if desired

Systems Alert Separate systems alert modules are available that provide intelligent agents to monitor remote operating systems, raise alerts when appropriate, and run scripts to automate problem handling

CA-20/20 Spreadsheet interface for viewing system statistics and reporting

      The company tackled the challenge of remote database management with its Oracle Agent. As Table 4 shows, this intelligent agent handles a number of common database problems in the areas of table and tablespace use, sequence numbers, disk I/O, workload balancing, and licensing. As with competing products, Oracle Agent gathers key monitoring information from Oracle's dynamic V$ tables. It measures parameters in dba_users, dba_tablespaces, dba_extents, dba_free_ space, dba_sequences, dba_data_files, dba_ segments, and other areas.

TABLE 4. CA-Unicenter's Oracle Agent features.

Feature Function

Tablespace utilization Space monitoring for tablespaces

Table utilization Space monitoring for tables and their constituent extents

Tablespace fragmentation Monitors tablespaces for excessive fragmentation and issues event alerts if thresholds are exceeded

Sequence number rollover Detects rollovers when sequence numbers reach their allowable maximums and can issue a trap in this event

Disk I/O Can watch for, detect, and help correct unbalanced workloads across channels or I/O adaptors

Licensing Oracle databases can have either session usage licensing or named-users licensing. This tool issues traps or event alerts for violations of user and session thresholds

      A key distinguishing factor among products is configurability: How customizable is the tool, and how much expertise does customization require? The Oracle Agent is highly configurable and can be tailored as necessary for unique environments.
      Computer Associates' remote agent architecture relies on SNMP running on top of TCP/IP for communication. The tools use the management information base as their repository. Whether products employ widely recognized standards such as these wherever possible is another distinguishing competitive feature.
      While CA's tools are exceptional for monitoring and managing operating systems, their Oracle-specific capabilities are still evolving. The Oracle Agent focuses primarily on availability issues and threshold alerts. It offers no administrative interface for the DBA and little that aids database performance analysis and tuning.

OEM AND PERFORMANCE PACK

As you'd expect, Oracle's tools have a different scope than those of its competitors; they monitor Oracle DBMSs only. However, they accomplish this task across different operating systems (and do an excellent job of masking operating system differences).
As the tools are from the database vendor, they feature a wider range of performance and availability parameters than many competing tools and embody more intimate knowledge of the database. They show elegant design and integration with the underlying database. They are more likely to stay up to the latest release of the base DBMS than tools from vendors that must address a range of different DBMSs. On the downside, Oracle's tools address operating system and network monitoring and control them only in so far as they affect the DBMS, rather than as areas of concern and management in their own right. You'll probably need to buy some other product to manage these areas if you use Oracle's tools.
Figure 2 illustrates Oracle's version of the intelligent agent architecture I described earlier, which it calls the Oracle Enterprise Manager (OEM) architecture. Intelligent agents reside on remote monitored nodes while the repository serves as the centralized collector node and a communication daemon provides the failsafe communication mechanism across these components.

OEM's GUI enables four major functions: job scheduling (for scheduling and overseeing of remote jobs), event management (for remotely monitoring database events and raising alerts if thresholds are exceeded or problems occur), service discovery (for automating setup of the environment by discovering Oracle services throughout the network), and utilities (for database management, maintenance, and optimization). See Table 5 for a list of individual GUI tools.

TABLE 5. OEM tools.
Tool	Function
Instance Manager	Starts up/shuts down databases, reviews/edits initialization parameters, manages in-doubt transactions and user sessions
Schema Manager	Creates, alters, deletes, and views database objects including tables, views, triggers, indexes, synonyms, clusters, database links, functions, procedures, packages and package bodies, snapshots, snapshot logs, refresh groups, and sequences
Security Manager	Creates, alters, deletes, and views users, profiles, and roles
Storage Manager	Manages tablespaces, data files, and rollback segments
Backup Manager	Backs up databases, tablespaces, and data files via scripts and the Backup Wizard
Data Manager	Exports, imports, and loads data, optionally via wizards
SQL Worksheet	Runs scripts, SQL queries, PL/SQL code, and OEM commands
Software Manager	Runs software configuration jobs and tracks assets
Network Topology Manager	Assists in defining and configuring the network topology for the DBMS

Oracle emphasizes the fact that its OEM console permits the use of third-party plug-in tools and allows customers to integrate their own applications.
Beyond OEM, Oracle offers a separate license for its Performance Pack, its basic tool for dynamic performance monitoring and tuning (see Table 6). This add-in integrates so seamlessly into OEM that many DBAs are not aware that a separate license is involved. Finally, Oracle offers additional systems management applications for specific database functions, such as Replication Manager, Media Server Manager, Parallel Server Manager, WebServer Manager, and Biometric Manager for sign-on via fingerprint.

TABLE 6. Oracle Performance Pack components.
Component	Function
Oracle Expert	Aids in database configuration and collects statistics to aid in databasewide performance tuning. Provides tuning recommendations and educates personnel in how to best address performance issues.
Lock Manager	Monitors locks to help avoid deadlocks and lock waits. Displays user name, session ID, owner of the locked object, name of the object, and object type.
Performance Manager	Realtime charts for DBMS-wide performance monitoring. Can chart global information on users, throughput, tablespaces, redo logs, buffers, caches, and I/O in any number of ways.
Topsessions	Allows you to list high resource-consumption database users by sorting on any of several different resources. Identifies performance "hogs" and helps rectify these programs or queries.
Tablespace Manager	Previously part of the base OEM product, Tablespace Manger helps monitor tablespaces and correct fragmentation and coalescing problems.
Oracle Trace	Collects data on events you specify, for example: SQL statements, a user logon, SQL events, and transaction events. Useful for performance analysis and diagnostic purposes.

MORE TO COME

This brief review of two major products for monitoring, managing, and tuning Oracle databases illustrates how far such products have come in the past three years and how far they have yet to go. Even the largest, most aggressive vendors are challenged by the tremendous scope of the tasks before them in meeting customer needs. One thing's certain: If we gave Rip Van Winkle another three-year nap, he'd wake up even more astounded by how far we'll have traveled toward automated monitoring of highly distributed heterogeneous systems.

Howard Fosdick is an independent consultant who works with Oracle, SQL Server, and DB2 under various Unix flavors and Windows NT; he specializes in performance and availability issues. You can reach Howard at [email protected].

This is a copy of an article published @ http://www.oreview.com/