Title, Management of Exchange based Messaging Architectures

Presentation #377

Author

Ian BROMEHEAD

Company

Hewlett Packard France

Address,

Avenue Steve Biko, 38090 Villefontaine, France

Telephone Number, +33.4.74.99.31.61

Fax Number +33.4.74.99.30.05

E-mail Address ian-martyn_bromehead@hp.com

Author

Norman FOLLETT

Company

Hewlett Packard

Address,

Telephone Number,

Fax Number

E-mail Address norman_follett@hp.com

Abstract

A White Paper to illustrate management principles for Microsoft Exchange based upon HP OpenVIew products and HP Consulting designed best practices

As Exchange becomes the choice for enterprise messaging backbone design and pervasive technology for development of knowledge management, unified messaging, workflow and collaboration, clearly this poses challenges on the capacity of IT organisations to monitor and maintain the key components of such backbones.

This white paper provides insight concerning the solutions and services designed by Hewlett Packard in collaboration with Microsoft product team and Microsoft Consulting Services, to provide Exchange Management solutions to clients.

Title, Management of Exchange based Messaging Architectures

Abstract

Introduction

Enabling Exchange Management

Keeping Exchange Alive - Operations Management

Principles

Requirement Scenarios

Monitoring Exchange

Central monitoring of complete Exchange environment

Multiple Exchange systems

Server health

Service Start Order

Service Dependencies

Performance of MS Exchange against pre-defined thresholds.

Operations Management Tools

ManageX

IT/Operations

Microsoft Exchange Standard Management tools

Implementations - SNMP

Administration

Disaster Recover

Resource and Performance Management

Increasing the availability of Exchange services

HardWare Management

NetWork Management

IT Service Management

Appendix A

HP Exchange Management Products and Services

Introduction

All predictions indicate that e-mail based requirements in organisations will continue to grow in importance

Analysis of several major implementations indicate that the e-mail backbone is with internet and www infrastructures, becoming increasingly important, indicated at least through the level of influence these have on business productivity.

Clearly, the messaging backbone is being closer and closer integrated into business processes, teamwork, collaboration and tasks either via users and/or through application integration.

Text Box: This poses new challenges in specifying and designing management solutions to ensure that all key components are monitored and kept in an operational state and such that service level management services can be specified for mission critical environments.

Managing Microsoft Exchange Server and Windows NT environments are thus important responsibilities for IT organizations. Ensuring the maximum uptime and performance of Exchange Servers is vital to a company's communications infrastructure.

Whether Microsoft Exchange is being deployed within an environment of 50 users or in a global roll out of 30,000 users, best practice is to have an integrated Exchange and NT Management solution to maximize user productivity and return on investment. The ideal management solution will proactively notify operators before critical problems occur, enabling troubleshooting of the system and application, and execution of automated corrective actions without manual intervention of a local administrator.

The following are the common problems with Managing Exchange environments:

· Ensuring that Exchange Server Site Connectors are up and running

· Monitoring MTA Queue Lengths

· Monitoring Log File sizes of Information Stores

· Consolidation of Exchange and NT event log messages

· Ensuring successful Directory Replications

· Maintaining availability of Internet Mail Connectors

· Centrally managing performance of Exchange and NT counters

· Ensuring services are running

Providing consistent global guidelines, thresholds, and management policies

The solutions and services HP has developed are based upon requirements our clients have exposed, and the benefits they expect from these.

Requirement	Benefit
Proactive Monitoring	Detect & act before user call
Trend Analysis and alarming	Automated capacity planning
Distributed task delegation	Optimised management data flow
Automatic Actions	Optimal reaction time on failure
Local/remote control	Freedom to administrate
Baseline Value Analysis	Master architecture characteristics
Service Level Agreement surveillance	Optimise SLA, reduce downtime, coordinate key admin players
Disaster Recovery procedures	Reduce downtime
High Availability	Improve SLA, improve system usage predictability
Load Balancing of web user connections	Optimal user connectivity, cater for high cnx rates transparently

Hewlett Packard’s solutions allow clients to obtain a standard or fully tailored solution for Exchange Management.

The following are typical services based tasks all or part of which are performed during such projects and which will involve HP’s engineers but more typically teams comprised of the client’s resources, partners and HP personnel.

Ø Requirements Analysis & benefits assessment

Ø Solutions design and implementation

Ø Proof-of-Concept and Integration Services

Ø Large Scale, International Deployment

Ø ITM & ITSM Integrated solutions & services

Ø Turnkey solutions and outsourcing

Ø Training & Education services

Enabling Exchange Management

Our approach in solutions architecture design, deployment and management applied to Exchange covers many typical management requirements :

¨ Operations Management

¨ Administration

¨ Disaster Recovery

¨ Resource & Performance Management

¨ Reporting

¨ Hardware & Cluster Management

¨ Network Traffic Management

The following sections of this white paper provide details each of these areas and closes with exposure of IT Service management that can be proposed, building on the measurement and management power provided by the management information generated by management solutions.

Exchange management solutions and services begin with the basic monitoring and running operations that must be designed to provide Operations and Disaster Recovery management. The second subject is a very sensitive part of not only ensuring an Exchange architecture from a functional point of view, but also with respect to the architecture design as we'll see later.

Resource and performance management and reporting provide solutions targeting both short performance analysis as well as long term capacity planning to ensure that users do not suffer unplanned service degradation and that IT staff can be informed on trends that may effect service levels provided through production facilities.

Finally Network Management will provide insight into architectural issues which otherwise remain largely undetected and un-measured translating to high network operating costs.

Each of the above management areas comprise the essential measurement components required to be able to begin addressing ITSM for Exchange. The value of each of these components for designing ITSM solutions is conditioned by the type of ITSM SLO's to be addressed in the contracts.

Keeping Exchange Alive - Operations Management

Principles

So lets first look at operations management basics with example implementations using some of HP’s products and exposure to other solutions.

Operation scenarios showing typical measurement criteria which are part of HP's Exchange Management best practices include monitoring of MTA Queue length, Exchange Server resources dependencies such as CPU load, memory consumption, message delivery. Carefully specified and designed monitoring solutions will assist the operators in maintaining service and repairing in case of outages before users are seriously affected.

Operations management centres must be adapted to the number of operators, their geographical location and procedures. Much of our experiences show that use of intelligent agent architectures, which allow such monitoring to be configured from one or more consoles and delegation for remote monitoring task execution, provide powerful, scalable solutions.

This methodology is dependent on proprietary solutions since only these provide powerful task delegation to the agent sitting on the Exchange server.

We’ll see later on how industry standard protocols such as SNMP, can be used to provide more basic management, and we’ll discuss the relative merits of both.

Furthermore, and before we discuss our advice to you concerning what should be managed in Exchange architectures, it is important to understand that whilst instrumentation for measurement purposes is relatively rich in Exchange, experience shows that architectural aspects need to be designed too.

These are not modelled in standard generic solutions available on the market.

Requirement Scenarios

One scenario as an example, would be monitoring of the number and space occupied by Exchange transaction logfiles. If these significantly increase, then it’s likely that the backup jobs are not functioning correctly, since these logfiles are typically removed when a online backup is performed successfully.

The ultimate result would be the Exchange IS services stalling and no e-mail services to the user.

Consider another scenario whereby users connect to the Exchange infrastructure via Inter or Intranet servers.

In this case, the initial design of the Web services needs to be qualified for example through monitoring of the number of simultaneous connections from the web. If these connections are more than the initial or desired design criteria, the net result will become poor user response times.

In this case the measurement may just be informational, or could be used to condition load balancing software such that secondary web servers are activated to share the load.

In fact these and many other operations management scenarios are encountered.

Monitoring Exchange

Central monitoring of complete Exchange environment

implies that some form of agent will perform at least monitoring of each Exchange server in the architecture. Consider that when monitoring the whole architecture, it will facilitate greatly the operators’ tasks if there is some form of graphical representation of the architecture. Simple iconic symbols are frequently sufficient and are to be preferred to continuous operator visual analysis of the event log the same events and their gravity are used to control the state and colour of such icons.

Multiple Exchange systems

It may seem obvious, but its worth underlining that the management solution has to span all of the Exchange servers including bridgeheads, connectors, and web servers providing access to Exchange user services via the web. If remote access , fax and portable telephone servers are used, then the management must be able to provide operators with management tasks related to these.

Server health

As we’ve already discussed, the use of an intelligent agent, provides the power to delegate management tasks to the agent sitting on the Exchange server. The alternatives typically absorb network bandwidth through polling between the manager and the agents.

As mentioned earlier, Exchange is quite well instrumented in version 5.5 SP1. The perfmon counters it exposes assist in designing the management solution and we’ll go through some of those which are most useful.
But this approach is not sufficient to monitor the health of Exchange. We will show that intelligent monitoring of several types of management data, calculations based upon them, and then alarming on the results obtained, can lead an operator more quickly to ascertaining what the true state and health of Exchange services really are.

As above, the actual state of each service needs to be monitored such that any state changes are detected, operators alerted, history data generated.
Although at first sight it may seem logical and desirable to program the intelligent agent to automatically perform actions to restart these services, this is not always best practice.
Consider the case of the MSExchangeIS service stopping. This means that all messaging services are no longer available to the users. A major service outage. In this particular case, we may consider that what we need to do from an operational perspective is to restart the service so that users can work, but the Exchange services experts will more than likely want to analyse event information available in logfiles, to diagnose the original failure. In some case valuable diagnosis information is destroyed when a service restart is attempted.

Conversely, if a connector service on a bridgehead server stops, its may be acceptable to simply attempt to restart this service.

The management solutions must not only report of the state of Exchange services, but also provide the capacity to perform actions on them, both manually by an operator, and through task delegation to the agent.

Service Start Order

The startup order for the Exchange service is:

Microsoft Exchange System Attendant

Microsoft Exchange Directory

Microsoft Exchange Information Store

Microsoft Exchange Messaging Transfer Agent

Microsoft Exchange Event Service

Microsoft Exchange Internet Mail Connector

Microsoft Exchange Key Manager

Service Dependencies

Since there are dependencies between services, starting a service that depends on a second service will start the needed service(s). For example, if the Information Store is requested to start before the System Attendant and the Directory, the NT service manager will start the System Attendant and the Directory.

The Exchange dependencies are:

The DS depends on the SA

The IS depends on the DS & SA

The ES depends on the DS & IS

The MTA depends on the DS & SA

The IMC depends on the SA, DS, IS & MTA

Given these rules concerning operational state changes to Exchange, its clear that the is room to automate these sequences and respect the dependencies even when changing this state manually. Through programming of an intelligent agent which can faithfully reproduce the correct sequence repeatedly for an operator without error

If its required to stop/start several servers simultaneously, in a particular sequence, or automatically, then use of an intelligent agent and management console should be preferred.
Note that native tools supplied with Exchange provide a means to interact with 1 server at a time since they do not supply an agent in relationship with the management console. Operations on multiple servers become fastidious and sensitive to human errors.

Performance of MS Exchange against pre-defined thresholds.

The counters exposed by through perfmon provide valuable information on the state of Exchange. Whilst these can be exposed to a SNMP agent through MIB extensions, it is desirable to delegate analysis of these counters to an intelligent agent. The basis of the management solutions then relies on the agent locally analysing the values of key counters and comparing these to thresholds values. This has the added advantage of only generating messages for the operators attention when exceptions occur, enables solutions whereby designers can develop specific actions that the agent can execute when these exceptions occur

An important part of ensuring that IT Management teams are equipped to deal with exceptional circumstances lies in the experience and judgement capacity of the staff. In our experience solutions that provide customisation capacity to develop features to facilitate the staff’s decisions through intelligent calculations, analyses and correlation should be sought.

An example would be to design a solution which checks IS datastore compression times over say a weekly period (an event is generated by Exchange version 5.5 SP1). If this increases significantly it might due to the server being loaded on that day, or it might be indicative of a degradation of the datastore in the IS. Whilst this degradation, if confirmed, is a problem quite difficult to resolve, our advice in this case would be to work on establishing a secondary server from backup tapes or replication, ready to take over should the primary actually fail. Prevention being better than cure being a good management criteria.

ANY changes to Exchange architectures and components including management, should be subject to rigorous "Change Management" procedures, such that in the event or critical solutions all the changes can be understood and eventually played back.

Operations Management Tools

ManageX

The product respects key Microsoft technologies ;

- MMC, WBEM (V4), CIM, COM/DCOM (Manager<>Agent, & agent tasks)

MMC is 1 to 1, ManageX MMC extends this to 1 to many. This means that one action can applied to multiple servers in one operation.

Also 3rd party MMC plug-ins are fully supported allowing true management integration.

ManageX also supports SNMP and is integrated with IT/Operations and TopTools out-of-the-box

Users may gain access to management functions from any ManageX console or through Web interfaces

Each ManageX console may be customised according to operator's functions in the organisation

ManageX automatically deploys all components to the target machine, its intelligent agent, the smart broker, takes care of this.

The features are obtained through tasks developed included in intelligence policies and functionality modules. The Smart broker supports all Microsoft supported scripting and development languages and others ; WSH, VB, VBS, C++, perl, REXX, JavaScript

Snap-Ins provide basic management features for these products, and optional Smart-PlugIns add even further management capacity

Although Microsoft Windows-NT’s perfmon is usually well known to most operators, getting directly to it with pre-configured graphs from the operator desktop and ManageX is a useful feature for diagnosis and trouble shooting

Furthermore, ManageX is increasing in version 4 its capacity to provide data collection.

Its purpose is to collect the data daily on the target nodes (data is perfmon counters exposed by NT and Exchange), and then the console transfers this from each agent. All information is then storable in an OBDC compatible DB for later use.

One such usage is via the OpenView Service Reporter, but more about this later

IT/Operations

IT/Operations is a Unix based enterprise management server, comprising intelligent agents for heterogeneous operatingsystems including Windows-NT, and capacity to expose management information and delegate tasks to operators equipped with Unix and Windows-NT graphical workstations.

The server uses an Oracle database to store and manage management data.

The server builds upon HP OpenView Network Node Manager (both Unix and Windows-NT based versions), to discover nodes and enrich its database.

The server supports many legacy applications for help desk and trouble ticket management.

The agent provides many similar features to the ManageX Smart Broker, including event interception, filtering and forwarding, services monitoring and autorestart, and perfmon counter integration for threshold based exception handling.

IT/Operations architectures are highly flexible, numerous references exist to illustrate this products scalability, and the application is popularly judged as best of its breed for management solutions by independent analysts.

The Exchange SPI is available for both management platforms.

Microsoft Exchange Standard Management tools

Microsoft's own solutions though useful tools, these are not typically positioned as major solutions for enterprise management consoles, but nevertheless do provide the means to perform efficient operations management typically on a 1 to 1 basis, and without any intelligent task delegation.

Such applications are

· MTA Queue Monitor : providing visual analysis of outstanding work

· Link Monitor : providing visual, instantaneous analysis of inter server link status

· Server Monitor : providing basic server health information

· Gateway Monitor : provides monitoring of Exchange gateway servers

· Information Store Monitor : provides status information on the IS service

· Directory Service Monitor : provides status information on the DS service

· System Attendant Monitor : indicates whether this global service is functioning

Most of these tools provide straightforward 1 to 1 server monitoring, articularly fastidious for multiple servers in several sites, typical of enterprise messaging backbones.

Implementations - SNMP

At the beginning of this section we indicated that efficient solutions are best designed using intelligent agents, typically programmable and customisable, but these rely on proprietary solutions/protocols.

SNMP is a very successful and popular protocol, and it is implemented by Microsoft to “manage” Exchange.

Operations management is possible through the use of the SNMP agents supplied with Windows-NT, which can be configured to take into account the industry standard “MAD” MIB which models messaging data and objects.
This modelling is generic, and basing solutions on this is not optimal for Exchange architectures.

It has the advantage of being a very economic solution requiring minimal skills to activate the management services.

Ø Only basic Exchange management tasks & features

Ø requires considerable MIB expertise

Ø Standard mgmt with MAD MIB

Ø personalisation through MIB add-ons using perf2mib tools

Ø little or no exception handling

Ø very limited possibility for monitoring of Exchange NT services

The basic features of the MAD mib can be extended through tools provide in the Microsoft Exchange resource kit. “Perf2mib” and “mibcc” are tools, which when used correctly allows addition of targeted performance counters into the MIB.

Perf2mib.exe is used to compile Performance Monitor counters into a new MIB for Microsoft Exchange Server.

Mibcc.exe recompiles the Perfmib.mib file created by Perf2mib.exe and creates a new Mib.bin file.

Unfortunately there is no easy method to add monitoring services, and the MAD mib does not include any definitions of SNMP traps. SMS can be used to convert Windows-NT events into SNMP.

As described in previous sections we recommend use of the counters

§ MSExchangeMTA

§ MSExchangeMTA Connections

§ MSExchangeIMC

§ MSExchangeIS

§ MSExchange Private

§ MSExchangeDS

§ Processor

§ Process

§ Logical Disk

§ Paging File

Batch a perf2mib command like:

Perf2mib Perfmib.mib Perfmib.ini MSExchangeMTA 1 MTA "MSExchangeMTA Connections" 2 MTAconnection MSExchangeIMC 3 IMC MSExchangeIS 4 IS "MSExchangeIS Private" 5 ISprivate "MSExchangeIS Public" 6 ISPublic MSExchangeDS 7 MSExchangeDS Processor 8 CPU Process 9 Process LogicalDisk 10 Disk "Paging File" 11 PagingFile

If you don’t batch - watch it !!

When the above example command is used, the following MIB table is generated and is then supported by the SNMP agent in the server. The MIB must be loaded into the management server such that it can generated requests and transmit to each SNMP agent.

This defines the only method for operations management using SNMP.
Varying degrees of network bandwidth will be absorbed through polling requests between the management server and the managed servers.

Whilst this may be performed by intermediate mid-level managers, the solutions is bound by the management designed into the MAD mib.

Advantages

ü Simple and economic

ü all Microsoft Exchange and any 3rd party perfmon counters > MIB

ü any SNMP based manager

ü can be used locally in more complex MOM type configurations

Disadvantages

ü No MSX server bound exception handling

ü Disappearing MSX services instances

ü Requires engineering for large scale installations

ü limited to perfmon objects and proprietary MIB definitions

ü requires a minimum of SNMP expertise

ü SNMP stack and sub-agent dependencies

ü unreliable over noisy nets (UDP)

Administration

Administration of Exchange in most cases and for most tasks is ONLY possible through the native administration tools.

This currently means through the "Exchange Administrator tool" supplied with the product, or through development of specific tools using the administration API.

Configuration changes are thus largely manageable through fastidious, manual procedures, which require careful execution if human error through repetition is to be avoided.

This is an area where the Configuration and Change Management procedures in ITSM offers become highly desirable.

HP is currently working with Microsoft on the development of Exchange administration through the efficient use of the Admin API to Exchange, and WBEM/CIM technologies.

Here are some of the administration features which are possible

ü Public Folder Setup

· World Wide Public Folder

· Country Specific Public Folder

· Owner

· Replication

ü Mailbox Actions

· Bulk Mail Creation Program

· Modifications via Forms

· Delete -> Hide

· Distribution lists

· Conference Room Setup

Disaster Recover

Disaster recovery begins with basic backup solutions to provide the means to repair a damaged Exchange datastore.
In the simplest cases, it may be possible to perform backup jobs offline where the information store is stopped, but it should be noted that in this case the transaction logfiles are never removed and there are risks involved in stopping the IS service.

It is best practice to perform online backup using backup products which use the Exchange service through the backup API. In this case, a maximum of datastore integrity is maintained.

Similarly, in the event of problems with the service and restore of data, this is best performed online once again through the backup API. In this case, the data is restored and transaction logs replayed to provide maximum integrity and context restore.

Exchange restore operations are disk I/O intensive and much slower than backup rates expect a maximum of 40/50Gbytes/hr backup, <15Gbytes restore time in the best cases.

The following tables expose some of our results of backup/restore rates using different popular products.

All of these tests were performed on various Intel Quad and Dual Xeon 400Mhz processor systems, equipped with 512Mbytes RAM of more and different disk subsystem types and numbers of peripheral tape drives.

Software	Task	Database size (GB)	# DLTs	Duration (h:mn:s)	Throughput (GB/h)
Arcserve	Backup	9.56	1	21:43	25.80
Arcserve	Backup	9.56	2	21:43	25.80
Seagate	Backup	9.56	1	21:41	26.47
Arcserve	Restore	9.56	2	1:40:19	5.59
Omniback IIV3	Backup	11.28	1	0:20:00	33.83
Omniback IIV3	Backup	11.28	2	0:16:57	39.91
Omniback IIV3	Restore	11.28	1	0:52:15	12.95
Omniback IIV3	Restore	11.28	2	1:01:07	11.07
Omniback IIV3	Restore	11.28	1	0:43:40	15

We are currently performing further tests with ArcServe and BackupExec, and gathering data from our clients using these products.
Based upon our tests, we expect ArcServe to provide at least similar or slightly better results to those we have measured using HP Omniback II.

Use of dedicated SCSI controllers and even RAID controllers with multiple tape drives, should be considered, to maximum tape throughput and eliminate single point of failure at the media level.

The table shows that little extra performance is obtained when using multiple tape drives especially during restore operations. Clearly restore operations, in all cases, are slowest and thus most critical in the disaster recovery situation.

These rates will vary greatly according to the hardware configuration of the backup, disks types and disk I/O systems. In fact the restore time is a deciding factor during the architecture design of the Exchange servers to the extent that in most cases the number of users per server has to be severely restricted to ensure a reasonable restore time in the event of a crash.

		20 GB	50GB	75 GB	100 GB
Disk System Type 1	Backup	30mn	1h17	1h55	2h34
Disk System Type 1	Restore	1h20	3h20	5h	6h40
Disk System Type 2	Backup	0h42	1h46	2h39	3h32
Disk System Type 2	Restore	1h22	3h25	5h08	6h51
Disk System Type 3	Backup	57mn	2h22	3h33	4h44
	Restore	1h08	2h50	4h16	5h41
	Restore	4h37	11h33	17h19	23h06

In Sharing this information, its our intention to alert those who are responsible for designing and operating disaster recovery solutions and procedures of two essential aspects of Exchange :

a. the size of the datastore should not be allowed to grow to capacities that will define a long restore time after a disaster or major overhaul.
Of course the impact of this on IT staff will depend on individual circumstances, but given the increasing importance attached to messaging systems, clearly downtime must be minimised.

b. Disk system types have the most impact on the rate of backup and restore rates, much more than the number of DLT drives used to store and recover data.

In the above chart, disk system 3 may seem completely unacceptable. Taken from a performance perspective this may be true, but the trade of in this particular case, is that this disk system has multiple power supplies and controllers, thus eliminating many if not all single points of failure.
We are thus faced with a set of compromises, and decisions to be made between performance and availability considerations at least (notwithstanding economics too).

Clearly the management system must monitor backup services to ensure that the backup’s are performing correctly. As for any other NT based service, this monitoring should monitor backup services, trap and filter events generated by the backup system, and monitor resource consumption during execution of backup services.

This needs to be taken into account to advise the client on the hardware setup for backup, and the most appropriate products and their implementation characteristics.

These best practices treat the issues concerning the design of backup/restore solutions, but are incomplete since it is also essential to ensure that the environment on which the backup system relies is also in the correct state.

Monitoring of transaction logfiles (they should be regularly deleted), free space of the volume and drive on which these are located (the IS service will cease to operate if it cannot write these logfiles), # of logs over time, are to be highly recommended.

It is surprising the number of times we encounter perfect backup systems and procedures, which do not include any media verification.
Clearly if the disaster recovery is a simple procedure based upon restore of the last backup, it is vital that some form of media testing is regularly performed.

Best practice in this sense is to regularly schedule restore of the data onto a secondary server, with automated test schedule to verify that the backup has worked correctly.

In some cases it is desirable to manage warm backup solutions. In this case, a secondary server is maintained as up to date as possible, either based on replication of the primary Exchange server datastores, or via restore of the last backup (which has the advantage of testing the media as well).

Perfecting a methodology to do this, requires a great of experienced and well designed procedures, especially when the Exchange service to the user must not be disturbed or as little as possible, but it is possible to design such setups.

A cluster of Exchange servers can provide a solution for disaster recovery since the data store is shared between two servers with passive switchover. Microsoft cluster server provides this, but this cluster solution is considered difficult to maintain in a stable state. In the case of a MsCS based Exchange configuration, the backup/restore must also be carefully designed such that all the configuration data of the cluster groups is faithfully restorable.

Given that in this type of configuration, one server is executing Exchange whilst the other is not, standard manager/agent configurations used on a normal Exchange server, cannot be used as is. They must be modified to cater for execution switchover on the active and the passive nodes, and only report on the active node.

Testing the media is vital. Offline restore test suites, potentially automated, are essential in any disaster recovery strategy.

Inclusion of 3rd party hardware subsystems such as those from EMC*2, also pose constraints on the management architecture. Management of EMC*2 solutions is possible from the enterprise console through the integration they now provide. This is obviously highly desirable, to at least supply state information to the operator, concerning the backup of data which is specially designed in the EMC*2 case.

Resource and Performance Management

We have already seen that operations management makes use of performance counters to provide valuable exception handling based on thresholds.

We have seen also, how monitoring and alerting on the consumption of server resources by Exchange should also be done to expose exceptions

Clearly use of performance can not only be used for day to day operational tasks, but also for trend analysis and architecture validation and evolution planning.

A great deal of this depends on obtaining the right measurements at the right frequency and then managing this in archives.

Best practice is to use the following counters ;

ü MSExchangeMTA, Messages Delivered per Minute
This counter measures the rate of the number of messages being delivered by the MTA to the IS. Normal load is 10 - 40 messages per minute. If this number is constantly under 5 per minute when there are pending items in the MTA queue, then it is likely that the server is under severe load or there is a problem with one of the processes. If this number is extraordinarily high (greater than 200 per minute) for an extended amount of time, then it is likely that there is a stuck message in the MTA queue

ü MSExchangeMTA, Work Queue Length
The level should increase and decrease. An acceptable range would be 0-50. When messages are stuck in the queue, the counter will remain level or only increase for extended periods of time. Watch for “artificial floors” on the MTA queue. Used here, “artificial floor” means the work queue length remains at or above a non-zero, positive integer. This can mean a number of things. It could mean that there are corrupt or stuck messages in the queues, or it could simply mean your queues house a number of messages that have been sent with the deferred delivery option in Exchange clients.

ü Exchange services processes. - This is the object: Process, Counter: % Processor time, Instances: DSAMAIN, EMSMTA, MAD, and STORE. No object should be at 0% or at 100% all of the time. An object always at 0% indicates a “dead” process. Check Service Control Manager to verify that the service is running. An object always at 100% usually indicates that something is out of order - check other services and the Event Viewer to pinpoint the problem.

ü Paging File, % Usage - Make certain that the usage is in a reasonable range, generally 15%-35%. When the level of usage exceeds 60% there is usually something wrong. If the usage constantly exceeds 90% then the situation needs to be treated as a problem: There is either a problem with one of the processes, the server needs to have a RAM upgrade, or the paging file was incorrectly allocated during setup.

ü LogicalDisk, Free Megabytes, Instance: E: - This is the amount of free space on the transaction log drive. Monitor this object to ensure that the drive does not fill up with .LOG files. Normally, the .LOG files are removed whenever an on-line backup is performed. If the .LOG files are not being removed, verify that the backups are being done correctly and complete successfully (See Backup Procedures).

ü MSExchangeIS, Active Connection Count. - This measures the number of logons to the Store Service. This number should be greater than zero. If the server has active mailboxes and there are zero connections then a problem exists. Use a test account to see if there is a problem making a connection to the server.

ü MSExhangeDS, Pending Replications - Shows the number of replication objects yet to be processed.

ü MSExchangeDS, Remaining Replication Updates - Measures the number of objects being processed by the DS. This number usually starts at 100 and decreases to 0 within 1 - 3 minutes.

We already mentioned the pre-configured graphs which we recommend be configured and used for diagnostics and to assist operators in performing regular tasks to maintain service.These graphs are also exceedingly important for defining the "baseline" behaviour of Exchange components over time, and thereby specification of improved threshold values tailored to the particular client environment.

Automated scanning and storage of Exchange performance, allows operators and administrators (architects), to later validate the choices made when the Exchange serves were designed and commissioned.

Extraction of key parameters either for real-time graph generation and potentially automated report generation, facilitates analysis of tendancy, server and user behaviour, and provides key evidence to IT staff to plan ahead and define evolution of the architecture.

These services thus ensure the efficient dissemination of the right management data empowering the decision makers at the right time.

Increasing the availability of Exchange services

When using a cluster in an Exchange architecture, Exchange 5.* in a cluster is only capable of using one server at a time, i.e.. active/passive mode.

Given this it is required to provide modifications to operations management packages such that the management can be distributed to each node, BUT only the active node agents actually manage.

This is available with some products, such that Windows-NT services not actually running on the passive node of a Microsoft Cluster server, are not restarted erroneously, and the events eventually generated on the passive node are not forwarded by default.

In the event of users gaining access to Exchange through websites, then increasing availability has to apply to the web services being provided such that the management solution covers these. Solutions are available to not only to manage the web sites themselves, but also supply load balancing and server failure/recovery management solutions too.

HardWare Management

Our experience management solutions deployments we have seen so many cases of clients forgetting or choosing not to include hardware monitoring.

One example of the importance can be found in the following example.

A RAID disk fails. The RAID configuration allows the file system services to be maintained transparent to the user. Typically the only subsystem capable of detecting this state is a hardware management package.

Its thus crucial that such hardware management packages, frequently proprietary, are included in the overall design.

Typically hardware management packages are 100% proprietary, but usually have some form of alerting and/or monitoring capability either SNMP or web-based

As we approach deployment of Windows2000 and Platinum its clear that generic hardware management will become more widely implementable through the WBEM and WMI interfaces and definitions.

TopTools version 4.1 is capable of supplying management information (based upon DMI support) for certain servers from HP, Compaq, and IBM.

Our recommendations to clients when actually choosing a hardware supplier are frequently largely based upon the following management capacities ;

ü Easy Installation

ü Manage Servers, Desktops & Mobiles from One Application

ü Inventory upload

ü Built-in Discovery and representation through SNMP and WMI

ü Built-in Alerting at least SNMP

ü Database Integration hooks
frequently these products will manage there own DB. These hooks will allow management solutions to search upon and extract vital information to facilitate management tasks

ü Web access to hardware management

ü Access by LAN and Modem

ü Secured access (eg encrypted password)

ü Role-based Access for flexible operator management

ü Modem Dialback/LAN Security

ü Unified IPMI Event Log Access

ü On-line Memory Diagnostics

ü Notification mechanisms beyond just alerts

ü Remote Control

ü Full Power Cycle

ü Graphical Console Redirection

ü Remote BIOS/Firmware Updates

NetWork Management

In the context of an Exchange network, the most crucial network management features you should aim at implementing, concern status checks of the linkages between each Exchange server in the architecture.

Monitoring of Exchange components to identify problems such as link status, bridgehead and Exchange Site connectivity ruptures, excessive bandwidth occupation is recommended . This will assist the help desk and operators to determine whether a user’s problems are due to his direct usage and provide extra information for his diagnosis of the true problem. In this case, it is particularly recommended to use tools which have graphical representation of the objects in question, the particular context is more easily determined using graphic elements and colour, than by wading through a list of events

In the case of single Exchange site configurations, there are no administration functions which allow us to modify the behavior of information flow and synchronization between Exchange servers, or how the network bandwidth between them is used.

Given this, it is in our experience useful to exploit network analysis to characterize the network usage on the basis of network connections that are expected and confirmed, and through typical bandwidth consumption measured between servers over a representative period.

The type of information generated by this analysis, serves as a baseline to define exception thresholds. These can be analysed and detected online if the management solutions allows continuos analysis (through for example using permanently installed probes and instrumentation increasingly deployed in network elements) or through reporting tools and methods.

In the case of multiple Exchange sites, the administration staff can determine which traffic flows between which servers and when. In this case, the above measurement should be used to validate correct configuration of Exchange components, and correct usage of the network architecture (routing essentially) and its bandwidth.

The network analysis tools such as these probes, can frequently be configured to perform automated capture and decode of packets based on thresholds or a specific user address helps isolate intermittent problems. These requires expert understanding of the protocols and there usage, but leverages the investment of the same network traffic analysis tools in many cases.

IT Service Management

IT Service Management solutions design is a major part of our consulting services work, once the management instrumentation, tools and hooks are implemented. These provide the management data that allows the service level agreements to be negotiated and monitoring constantly such that the appropriate parties responsible for the SLA are alerted when trends indicate service degradation and SLA rupture in the IT architecture.

SLO's being monitored here will rely on the measurement data and management functions, that I've discussed so far. Without those "hooks" and measurements, no SLO monitoring would be possible.

Basically, you cannot manage what you cannot measure.

The IT Service centre used in this example, can actually be configured to publish the dashboard seen here via the web, such that people not equipped with any management platform, can follow and be aware of the SLO monitoring through their desktop and react before users are affected.

Conclusions

ü Choose management interfaces which wherever possible will allow common “look-and-feel” to management tasks

ü With a common data store using preferrably industry standard RDBMS products

ü Use products with intelligent agents to allow delegation of management tasks close to Exchange servers

ü Choose management prodcuts which will allow links to legacy management packages (trouble ticket, inventory, cable management,…)

ü Choose products which are native to the OS thus providing strong intergration with Exchange, but ..

ü Also integration to provide a global vision of Exchange’s condition, resource consumption, trends and exceptions

ü Services to leverage from the “basic-to-basics” solutions into a scalable, management architecture to build full SLA monitoring of Exchange key components

ü Choose solutions scalable to the size of your existing Exchange Architecture design but will allow expansion through mid-level management federatable trhough enterprise solutions

ü Exchange performance baselining (network, systems, Exchange components)

ü Management associated components on which the Exchange architecture depnds (Web access, management solutions, backup/disaster recovery designing/optimisation

ü Ensure the solutions allows turnkey tailoring to fit Exchange based collaborative solutions e.g. Eastman SoftWare), Fax, GSM,SMS, .. Integrated services (e.g. Fenestrae)

You can’t manage what you can’t measure”

Appendix A

HP Exchange Management Products and Services

The solutions and best practices discussed in this whitepaper, are based upon HP Consulting services specifically developed to provide assessment, design, deployment and management solutions, turnkey to a specific client’s requirements. HP OSD also provides Exchange Management Utility services.

For more information concerning HP Consulting please consult http://www.hp.com/go/consulting

HP Consulting services are based upon but not restricted to the following HP OpenView products ;

- HP OpenView ManageX – Windows-NT and BackOffice Management

- HP OpenView IT/Operations – Heterogeneous Enterprise Systems Management

- HP OpenView Smart PlugIn for Exchange Management
for ManageX
for IT/Operations

- HP PerfView – Heterogeneous Systems Performance Management

- HP Measureware - Heterogeneous Systems Performance Management

- HP OpenView Service Reporter – Performance Reporting Management

- HP OpenView Network Node Manager – SNMP Systems and Network Management

- HP OpenView NetMetrix – Network Performance and Traffic Analysis

- HP TopTools – HP NetServer HardWare management

For more information concerning these management products, please consult http://openview.hp.com

Presentation #377

IT/Operations

Advantages

Disadvantages

Send email to Interex or to the Webmaster©Copyright 1999 Interex. All rights reserved.

Send email to Interex or to the Webmaster
©Copyright 1999 Interex. All rights reserved.