Today I would like to review Harvard Pilgrim’s experience with the implementation of OpenView ITO to control and monitor its HP3000 servers. I will start with a short history of the data processing environment at HPHC and the initiation of the OpenView project. Then I will review the role of the console in computer operations for those who may not be Operators or may be somewhat removed for day to day operations.
Next we will look at OpenView for the HP3000 in detail. We will examine how OpenView interfaces to the HP3000, how it collects data, and message processing in OpenView.
Lastly we will look at viewing messages using the Active Message Browser, the View/some command, and message browsers derived from node icons and message group ions. We will end with an example about setting up message filters in OpenView and some tips on the construction of message filters.
Harvard Community Health Care is the top HMO in the Northeastern United States. It was formed from the merger of Harvard Community Health Plan and Pilgrim Health Care in 1995. Both organizations, at the time, used the HP3000 for their data processing needs. Pilgrim Health Care employed seven HP3000’s and Harvard Community Health Plan used two HP3000’s. After the merger, five additional HP3000’s were purchased. Two data centers were maintained until June of 1998. In the fall of that year, they were combined into a single data center.
The new data center housed fourteen HP3000’s under one roof and a host of other technologies also including IBM, VAX, HP-UX, NT and Digital Unix systems. Only one or two Operators focus on the HP3000 and its operation during a shift.
Those of you who have worked closely with the HP3000 know that the Operators “live” on the console. In our data center, there are also several workstations and terminals that are used to perform tasks on the HP3000 that are not specific to the console.
The need for an enterprise management solution was recognized as part of the Data Center consolidation. At that time, we chose HP OpenView as the tool that would give us the functionality and flexibility to manage our operations. We contracted with HP for a consulting engagement in the fall of 1998. The consultant was on site for two periods. First for five days and latter for eight with a week hiatus in between. HP provided tools in the form of scripts that could be used to monitor HP3000 resources and processes. There was also extensive one-on-one training on how to configure and use HP OpenView. Latter, two HPHC staff members were sent to an OpenView course at HP. During the fall of 1998, we gained practical experience in the configuration HP OpenView. It was time of experimentation in which we learned how to use many of OpenView’s features.
During the fall and the winter of 1998-1999 certain important lessons were learned about enterprise management.
1. Enterprise management is not for the Timid!
2. Enterprise management is not a turnkey system.
3. Each site must do a significant amount of configuration work to tailor any enterprise management solution to their operation.
4. Training the operators and technical staff is an essential key to success.
5. In-house staff must serve the role of on-going development and training.
6. Development is on-going. There is no final end point, only the next milestone.>
7. Management must strongly support the effort if it is going to succeed.
Next, let’s take a look at the HP3000 console. The Operators “live” on the console. Almost all software, including operating system software, custom software, and third party software send messages to the console. The messages that are observed on the console are limited because the scrollable memory of the console is limited. This is true if the console is a PC or a terminal. New messages push the old messages out of memory at a fairly rapid rate.
There are certain key events you don’t want to miss in the life of an HP3000 or any other computer for that matter. For our operation, the key events we monitor with OpenView are the following:
Resource or Process |
Critical Events |
1. Backups |
Bad tapes, mount requests |
2. Tape drives |
tapes stuck in drive |
3. Printers |
Forms request, out of paper statues, mech. problems |
4. Job Schedulers |
Hung schedules, abended jobs |
5. Network software and hardware |
Connection problems |
6. Mirrored disks |
Lost mirrored pairs |
7. Disks |
Bad sectors |
8. Disk space |
Low disk space |
9. Databases |
Database shadow problems, full datasets |
10. Programs |
Critical success or failure messages |
Next, let take a look at how HP OpenView connects to the HP3000. There are two parts to the OpenView software. There is software that runs on a Management Server. In our case, this is an HP9000/K420. It is on the same network segment as the HP3000s it manages. This is not a requirement for OpenView, only the way we have chosen to implement it. The OpenView management server could be far way in some other city in the case of a widely distributed system. The Management Server is accessed using an HP700 workstation, or a HP X-term or a PC running Reflection-X.
On the HP3000 there is an OpenView account called OVOPC that contains the HP OpenView software. The OpenView job that runs continuously is called OPCAGTJ,AGENT.OVOPC. You can check on the status of the OpenView agent by doing a SHOWJOB to see if it is running and by running the program
opcagt.bin.ovopc –status.
This will show you what OpenView processes are running on the HP3000.
On the HP3000 there are four primary processes by which OpenView collects information about the system. First is the console logfile interceptor. Console messages are intercepted, copied and sent to the OpenView agent. The messages still appear in their entirety on the physical console and the physical console still has all its functionality. The OpenView agent runs the console messages through a series of filters. These filters pick out important messages that are then sent to the management server for display.
Next is the logfile reader. Any ASCII file can be designated as a logfile for OpenView to read. The logfile must have sequential records. OpenView can run a script to perform an evaluation, generate a record from the evaluation process and add it to the logfile. Then it reads the entry and sends it to the OpenView agent. The logfile can also be read by sensing a change in the modification time and date. MPE utilities such as SHOWJOB, DISCFREE, SHOWOUT, DBUTIL, VOLUTIL, etc. can be run and the output reformatted to produce one line of text per event or resource. Also, a job can send output to a disk file which can be designated a logfile as well. Logfile readers are used to monitor disk space, volume set status, mirrored disk status, and spoolfile queues.
The third way OpenView monitors activities on the HP3000 is through the use of threshold monitors. Threshold monitors are scripts that return a value to OpenView about the status of a resource or process. OpenView can run these scripts at intervals and compare the returned value to a threshold. Thresholds can be descending, ascending or a triggered by transition in value either way. We use this method to monitor dataset fullness, outstanding tape replies, forms requests, the presence of standing jobs, and Netbase status.
The fourth method by which OpenView gets information from the HP3000 is by messages sent from the opcmsg facility. Messages can be sent directly to the OpenView agent from a job stream. The syntax is as follows:
Opcmsg.bin.ovopc “a=app o=msggroup s=severity ‘msg_text=text’”
Where
app | is the name of an application label that will appear on the OpenView message display. |
msggroup | is the name of the message group label in the OpenView display. |
Severity | is the message severity or importance. |
Msg_text | is the text that will appear in the OpenView message display. |
This program can be run in a job stream, inside an “if” statement, that tests for an error condition and spawns an OpenView message in response to that error condition. For an example consider the portion of the following job;
!JOB FULLBACK, MANAGER.SYS;OUTCLASS=LP,1,1 :: !IF CIERROR .EQ. !ERRORNO !THEN ! RUN OPCMSG.BIN.OVOPC "a=SYSTEM o=OUTPUT & !S=major 'msg_text=problem with backup detected, see & !instructions in message section' ” !ENDIF :: !EOJ
There is one other way OpenView can ascertain the state of the HP3000. This is through the use of SNMP traps. SNMP services can be run on the HP3000 and be polled by the management server. This is the method used to tell if the HP3000 is up or down, at least as far as the network is concerned. It is even possible to write you own M.I.B.’s for specialized monitoring. We have not implemented this as yet, however.
Lets take a look at how information is displayed on the management server. This is the OpenView desktop that is seen on a HP700 workstation or X-term when you log into the Management Server and bring up OpenView. If you use Reflection-X on a PC, only one or two window at a time can be displayed with any degree to detail unless you use a large monitor. There are four basic windows that display information in OpenView.
First and foremost is the Active Message Browser window. This is where all the messages from all the HP3000 systems are displayed. This is our consolidated console screen. The Active Message Browser displays all the messages that were not suppressed by the OpenView agent on the HP3000. Messages from the console interceptor, logfile reader, threshold monitor, opcmsg messages and SNMP statuses are displayed.
There are also three windows with various icons The three windows are the Node Bank window, the Message Group window and the Application Group Window. The Node Bank window contains an icon for each system that is monitored. The Message Group Window represents categories of messages. When a message is filtered and sent to the management server, a message group label is attached. It is a useful handle by which groups of like messages can be displayed together. The Application Group window contains icons that can perform tasks on a managed node. Click on the MPE icon to open up a new window of MPE applications. Drag an HP3000 icon over the application icon for disk space and a window will open up showing the output from a discfree on that system.
How you arrange these window are up to you. This is how we configured ours.
In the Browser, on the left-hand side, are colored bars under the heading of Sev. These are the message severity labels. A severity label is attached, when a message is filtered on the HP3000 and sent to the Management server. The label is an attribute that is determines when you design a message filter. Red indicates critical messages like a downed systems or bad disks. Orange indicates a major problem such as a halted production schedule. Yellow is a minor problem such as several unanswered tape replies. Blue is a warning label such as low disk space. Green means that this is a “normal” message. Green messages usually are informational and do not represent the presence of a problem. The next column from the left is headed by a funny acronym called SUIAONE. Each letter in this acronym stands for a message attribute that was attached by the filter that matched the message. The letters signify the following:
S - Status, standard message, either owned or marked.
U - Unmatched, event is unmatched by any message condition filter but
the message was passed to the browser.
I - Instructions are attached to the message suggesting a course or action.
A - Automatic actions are attached to the message
O - Operator initiated actions attached to the message
N - Notes or annotations are attached to the message
E - Escalation from another or to another management server
Next are columns for the date, time and sources of the message.
The sixth, seventh and eighth columns from the left contain three message labels that are assigned by the message filter on the HP3000. These are Application, MsgGroup and Object. You can use these labels for any short label information you want the message browser to display. Keep in mind that these labels are useful for grouping messages into like categories.
On the far right side of the browser window is the message that was passed to the Management Server from the HP3000. Many of these messages start with a label such as RRUN037 or SYS012. When the message filter processes the message on the HP3000, it is possible to rewrite the message that is sent to the message browser. This can make an esoteric message much more readable. One item I add to each re-written message, is a label that tells what filter did the processing. When you have over 350 filters processing messages, it can be hard to tell which filter is responsible if you don’t add a filter designation to the message displayed. Don’t confuse this label with the MsgGroup, Application or Object label that OpenView assigns.
The message browser displayed when an OpenView session is first started is not the only way to view system message. Other message browsers can be opened that display only selected messages. The icons in the Node Bank and MsgGroup window can be used to launch browser windows with selected messages. Right click on a node icon and a pull down menu will appear. Drag down to message browser. A new browser window will open up with only messages from that node. Right click on a message group icon and drag down to message browser. A new browser window will open up that contains only messages that have been labeled with message group name. Now you can see how useful the message group attribute can be. These additional browsers are updated in real time just like the main browser.
This work well if you are interested in all messages from a node or message group but what if you want to select by date or even a particular message phrase? This is possible too. In the main browser note the menu items across the top. Left click on the View menu and drag down to Some….
A window opens up that allows you to select messages by any number of message attributes. Messages can be selected by node, message group, application, object, severity code, a message phrase, data and time, matched or unmatched ( by a filter) and owned or unowned. Select multiple criteria to create a browser window of your own design. It can stay up continuously and display new messages that meet your criteria. This is a great way to monitor important events or processes. It also demonstrates the use of message labels such as MsgGroup, Application and Object. If all messages from backups are assigned the label “Backup” by the filters on the HP3000, you can easily monitor a backups for a particular system or systems. If you want to see all the NETIPC error messages from a point in time when you knew there was a problem, just enter the phrase NETIPC and select the date and time of the event. Browsers can find the critical messages that you need to know.
There is one more type of browser that is used. This is the History Browser. Messages in the Active Message Browser can be acknowledged. When an Operator acknowledges a message in the Active Message Browser, that message is sent to the history database. From this location, only the History Browser can view it.
The history browser allows you to look back in time even further than the Active Message Browser. From the Active Message Browser window, click on View and drag down to History. A new browser window opens up that shows all the acknowledged messages on the system. We keep a about 50,000 messages in our database. This should allow a retrospective view of approximately one month depending on the number of systems that are monitored.
In summary there are three types of browser windows; the main browser or Active Message Browser, the filtered browser and the History Browser. It is not unusual to have several browser windows open simultaneously.
Now that we have seen how to look at messages, lets take a closer look at the whole life cycle of a message. Here is the sequence:
1. Messages are created on the HP3000.
2. The OpenView agent on the HP3000 picks up messages.
3. Messages are matched against a set of selection filters.
4. If the message filter does not suppress the message, it is sent to the Management Server.
5. Automatic actions are performed if any are attached to the message
e.g.execute a script, send a helpdesk ticket, beep someone
6. Messages are displayed in the “Active Message Browser” with date, time, severity, and system of origin. The original message text may be re-written to produce text that is more meaningful.
7. The Operator observes the message.
8. The Operator decides if any operator initiated actions should be performed: ex: reply to a tape request or forms request.
9. The Operator may read any attached instructions in the message.
10. The operator adds any annotations to the message such as who was called and what was done.
11. The Operator acknowledges the messages.
12. The message leaves the active message browser and goes to the history database.
13. The message can be viewed in the History Browser if there is unfinished business.
We have talked about message attributes when we discussed the various browser windows. Now that we discussed the life cycle of a message and touched upon more attributes, it is a good time summarize all the attributes and objects of a message. These are as follows:>
Message Attributes and Objects |
|
Message Attribute |
Message Object |
Node |
Automatic actions |
Time |
Operator initiated actions |
Date |
Instructions |
Severity (importance) |
Annotations or Notes |
Application label |
|
Message group label |
|
Object label |
|
Ownership |
When you click on a message in the message browser, a window opens up showing these message attributes. This window contains fields that correspond to all the various message attributes and objects listed above.
We have talked a lot about message filtering to this point. Let’s take a closer look at message filters.
The construction of message filters is the most important task that you will perform in OpenView.
A message is one line of text on the console. Each line of a multi-line message on the console is a single message to OpenView. We only want to see a fraction of all the console messages. Just how selective do we want to be? On the average, there are 7,000-15,000 messages that pass over the console each day for the HP3000s in our data center. At present, there are fourteen HP3000’s which we wish to monitor. That equates to about 140,000 lines of messages a day! Excluding backups, we only want to see 5-10 important message a day. Therefore OpenView should reject 99.9% of all console messages and pass only 0.1% of the messages to the Active Message Browser. Message filtering has to be very selective to do this. The criteria for critical message selection have to be even more selective.
Lets look at the process for building message filters. First, collect some console logfiles. Next look for message patterns. In the message patterns, there are variable parts and constant parts. The variable parts are job numbers, job names, error numbers, and some message phrases.. Next decide on the messages to detect and the messages to ignore. There should be far more messages to ignore than messages that are important.
Select the kind of wild cards you will use. They are as follows:
Wild Card Character |
Type of character |
* |
wild card any part of a message, space, words, numbers |
@ |
Wild card a character string |
# |
Wild card a number or string of numbers |
S |
Wild card separators such as spaces or tabs |
There are other operators that are used in creating a message matching expression.
Operator |
Function |
[ ] |
Enclose groups of expressions |
| |
Logical “or” |
! |
Logical “not” |
< > |
Encloses a wild card and declares a variable. |
With these operators and wild cards we can build our message filter. Consider the representative message.
NetIPC ERROR in VT; Job: #J1543; PIN 763; Info: 1The phrase “NetIPC ERROR in VT:Job” is a constant from instance to instance. The job number, pin number and info number will vary however. Substitute wild cards for the variable portions. After substitution, the expression looks as follows:
NetIPC ERROR in VT; Job #<*>; PIN <#>; Info <#>A “*” has replaced the letter J and the job number and “#” have replace the pin numbers and info numbers.
Next, assign variables to the wild card portions. The expression now looks like this:
NetIPC ERROR in VT; Job #<*.sess>; PIN <#.pin>; Info <#.info>After the wild card, place a period and the name of the variable has been placed. The wild card and the variable name must be inside the “ < > “. If this expression matches a message, you can use the variables to print a modified message in the Active Message Browser. For example;
A NetIPC error has occurred in a VT session for Job #<sess> and Pin number <pin>. The info number is <info>
This expression will be displayed with the session number, pin number and info number from the original message. The message has been rewritten to be more user friendly. This is only a brief introduction to the pattern matching language for message filtering. There are more advanced expressions that use inequalities and more complex substitution. The documentation from HP for the OpenView software reviews this. Below is an example of a screen used to specify the above message filter. The whole re-written message scrolls beyond the Message text window.
Harvard Pilgrim employees over 350 message filters in the console filter template for the HP3000. About 120 pass messages to the Active Message Browser and the rest suppress messages.
From our experience I would like to offer the following tips and warning concerning message filters.
QA in OpenView should not be overlooked. If no messages appeared on an HP3000 console for even five minutes, this would appear strange to most Operators. This is especially true on busy systems. But what if no messages for a particular system are seen in OpenView. Since we only allow important messages to be displayed, no messages might mean everything is fine and nothing needs attention. How can you be sure this is the case?
I would recommend sending test messages to the OpenView agent on regular basis. Depending on your operation, you might want to send a series of test messages daily. Unexpected messages are the most important messages to test. Send a tellop message that contains a string that is not matched by any message filter. Is it displayed? Hopefully it is. You don’t want to miss any important but unexpected messages.
To test the message filters, run a job that send messages to the console that contain critical phrases. You can also place test entries in a logfile and test your logfile readers. It is also important to test your notification systems like computer generated beeper messages, automatic helpdesk tickets or E-mail messages. You should do this to have more confidence that your system work when you need it.