Author
Ian BROMEHEAD
Company
Hewlett
Packard France
Address,
Avenue Steve
Biko, 38090 Villefontaine, France
Telephone
Number, +33.4.74.99.31.61
Fax Number +33.4.74.99.30.05
E-mail Address ian-martyn_bromehead@hp.com
Author
Norman FOLLETT
Company
Hewlett Packard
Address,
Telephone Number,
Fax Number
E-mail Address norman_follett@hp.com
A White Paper to illustrate management principles for Microsoft Exchange based upon HP OpenVIew products and HP Consulting designed best practices
As Exchange becomes the choice for enterprise messaging backbone design and pervasive technology for development of knowledge management, unified messaging, workflow and collaboration, clearly this poses challenges on the capacity of IT organisations to monitor and maintain the key components of such backbones.
This white
paper provides
insight concerning the solutions and services
designed by Hewlett Packard in collaboration with Microsoft product team
and Microsoft Consulting Services, to provide Exchange Management solutions to
clients.
Title, Management of Exchange based Messaging Architectures
Abstract
Introduction
Enabling Exchange Management
Keeping Exchange Alive - Operations
Management
Principles
Requirement Scenarios
Monitoring Exchange
Central monitoring of complete Exchange
environment
Multiple Exchange systems
Server health
Service Start Order
Service Dependencies
Performance of MS Exchange against
pre-defined thresholds.
Operations Management Tools
ManageX
IT/Operations
Microsoft Exchange Standard Management tools
Implementations - SNMP
Administration
Disaster Recover
Resource and Performance Management
Increasing the availability of Exchange services
HardWare Management
NetWork Management
IT Service Management
Appendix A
HP Exchange Management Products and Services
All predictions indicate that e-mail based requirements in
organisations will continue to grow in importance
Analysis of several major implementations indicate that the e-mail
backbone is with internet and www infrastructures, becoming increasingly
important, indicated at least through the level of influence these have on
business productivity.
Clearly, the messaging backbone is being closer and closer integrated
into business processes, teamwork, collaboration and tasks either via users
and/or through application integration.
This poses new challenges in specifying and
designing management solutions to ensure that all key components are monitored
and kept in an operational state and such that service level management
services can be specified for mission critical environments.
Managing Microsoft Exchange Server and Windows NT environments are thus important responsibilities for IT organizations. Ensuring the maximum uptime and performance of Exchange Servers is vital to a company's communications infrastructure.
Whether Microsoft Exchange is being deployed within an environment of 50 users or in a global roll out of 30,000 users, best practice is to have an integrated Exchange and NT Management solution to maximize user productivity and return on investment. The ideal management solution will proactively notify operators before critical problems occur, enabling troubleshooting of the system and application, and execution of automated corrective actions without manual intervention of a local administrator.
The following are the common problems with Managing Exchange environments:
· Ensuring that Exchange Server Site Connectors are up and running
· Monitoring MTA Queue Lengths
· Monitoring Log File sizes of Information Stores
· Consolidation of Exchange and NT event log messages
· Ensuring successful Directory Replications
· Maintaining availability of Internet Mail Connectors
· Centrally managing performance of Exchange and NT counters
· Ensuring services are running
Providing consistent global guidelines, thresholds, and management policies
The solutions
and services HP has developed are based upon requirements our clients have
exposed, and the benefits they expect from these.
Requirement |
Benefit |
Proactive Monitoring |
Detect & act before user call |
Trend Analysis and alarming |
Automated capacity planning |
Distributed task delegation |
Optimised management data flow |
Automatic Actions |
Optimal reaction time on failure |
Local/remote control |
Freedom to administrate |
Baseline Value Analysis |
Master architecture characteristics |
Service Level Agreement surveillance |
Optimise SLA, reduce downtime, coordinate
key admin players |
Disaster Recovery procedures |
Reduce downtime |
High Availability |
Improve SLA, improve system usage
predictability |
Load Balancing of web user connections |
Optimal user connectivity, cater for high
cnx rates transparently |
Hewlett Packard’s solutions allow clients to obtain a standard or fully tailored solution for Exchange Management.
The following are typical services based tasks all or part of which are performed during such projects and which will involve HP’s engineers but more typically teams comprised of the client’s resources, partners and HP personnel.
Ø
Requirements
Analysis & benefits assessment
Ø
Solutions
design and implementation
Ø
Proof-of-Concept
and Integration Services
Ø
Large
Scale, International Deployment
Ø
ITM
& ITSM Integrated solutions & services
Ø
Turnkey
solutions and outsourcing
Ø
Training
& Education services
Our approach in
solutions architecture design, deployment and management applied to Exchange
covers many typical management requirements :
¨ Operations
Management
¨ Administration
¨ Disaster
Recovery
¨ Resource &
Performance Management
¨ Reporting
¨ Hardware &
Cluster Management
¨ Network
Traffic Management
The following
sections of this white paper provide details each of these areas and closes
with exposure of IT Service management that can be proposed, building on the
measurement and management power provided by the management information
generated by management solutions.
Exchange management solutions and
services begin with the basic monitoring and running operations that must be designed to provide Operations and Disaster
Recovery management. The second subject is a very sensitive part of not only
ensuring an Exchange architecture from a functional point of view, but also
with respect to the architecture design as we'll see later.
Resource and performance
management and reporting provide solutions targeting both short performance
analysis as well as long term capacity planning to ensure that users do not
suffer unplanned service degradation and that IT staff can be informed on
trends that may effect service levels provided through production facilities.
Finally Network Management will
provide insight into architectural issues which otherwise remain largely
undetected and un-measured translating to high network operating costs.
Each of the above management
areas comprise the essential measurement components required to be able to
begin addressing ITSM for Exchange. The value of each of these components for
designing ITSM solutions is conditioned
by the type of ITSM SLO's to be addressed in the contracts.
So lets first look at operations
management basics with example implementations using some of HP’s products and
exposure to other solutions.
Operation scenarios showing
typical measurement criteria which are part of HP's Exchange Management best
practices include monitoring of MTA Queue length, Exchange Server resources
dependencies such as CPU load, memory consumption, message delivery. Carefully
specified and designed monitoring solutions will assist the operators in
maintaining service and repairing in case of outages before users are seriously
affected.
Operations management centres
must be adapted to the number of operators, their geographical location and
procedures. Much of our experiences show that use of intelligent agent architectures,
which allow such monitoring to be configured from one or more consoles and
delegation for remote monitoring task execution, provide powerful, scalable
solutions.
This methodology is dependent on
proprietary solutions since only these provide powerful task delegation to the
agent sitting on the Exchange server.
We’ll see later on how industry
standard protocols such as SNMP, can be used to provide more basic management,
and we’ll discuss the relative merits of both.
Furthermore, and before we discuss
our advice to you concerning what should be managed in Exchange architectures,
it is important to understand that whilst instrumentation for measurement
purposes is relatively rich in Exchange, experience shows that architectural
aspects need to be designed too.
These are not modelled in
standard generic solutions available on the market.
One scenario as an example, would
be monitoring of the number and space occupied by Exchange transaction
logfiles. If these significantly increase, then it’s likely that the backup
jobs are not functioning correctly, since these logfiles are typically removed
when a online backup is performed successfully.
The ultimate result would be the
Exchange IS services stalling and no e-mail services to the user.
Consider another scenario whereby
users connect to the Exchange infrastructure via Inter or Intranet servers.
In this case, the initial design
of the Web services needs to be qualified for example through monitoring of the
number of simultaneous connections from the web. If these connections are more
than the initial or desired design criteria, the net result will become poor
user response times.
In this case the measurement may
just be informational, or could be used to condition load balancing software
such that secondary web servers are activated to share the load.
In fact these and many other operations management scenarios are
encountered.
implies that some form of agent will
perform at least monitoring of each Exchange server in the architecture.
Consider that when monitoring the whole architecture, it will facilitate
greatly the operators’ tasks if there is some form of graphical representation
of the architecture. Simple iconic symbols are frequently sufficient and are to
be preferred to continuous operator visual analysis of the event log the same
events and their gravity are used to control the state and colour of such
icons.
It may seem obvious, but its worth
underlining that the management solution has to span all of the Exchange
servers including bridgeheads, connectors, and web servers providing access to
Exchange user services via the web. If remote access , fax and portable telephone
servers are used, then the management must be able to provide operators with
management tasks related to these.
As we’ve already discussed, the use of
an intelligent agent, provides the power to delegate management tasks to the agent
sitting on the Exchange server. The alternatives typically absorb network
bandwidth through polling between the manager and the agents.
As mentioned earlier, Exchange is quite well instrumented in version 5.5 SP1.
The perfmon counters it exposes assist in designing the management solution and
we’ll go through some of those which are most useful.
But this approach is not sufficient to monitor the health of Exchange. We will
show that intelligent monitoring of several types of management data,
calculations based upon them, and then alarming on the results obtained, can
lead an operator more quickly to ascertaining what the true state and health of
Exchange services really are.
As above, the actual state of each service needs to be monitored such
that any state changes are detected, operators alerted, history data generated.
Although at first sight it may seem logical and desirable to program the
intelligent agent to automatically perform actions to restart these services,
this is not always best practice.
Consider the case of the MSExchangeIS service stopping. This means that all
messaging services are no longer available to the users. A major service
outage. In this particular case, we may consider that what we need
to do from an operational perspective is to restart the service so that users
can work, but the Exchange services experts will more than likely want to
analyse event information available in logfiles, to diagnose the original
failure. In some case valuable diagnosis information is destroyed when a
service restart is attempted.
Conversely, if a connector service on a bridgehead server stops, its may be
acceptable to simply attempt to restart this service.
The management solutions must not only report of the state of Exchange
services, but also provide the capacity to perform actions on them, both
manually by an operator, and through task delegation to the agent.
The startup order for the Exchange service
is:
Microsoft Exchange System Attendant
Microsoft Exchange Directory
Microsoft Exchange Information Store
Microsoft Exchange Messaging Transfer Agent
Microsoft Exchange Event Service
Microsoft Exchange Internet Mail Connector
Microsoft Exchange Key Manager
Since there are dependencies between services, starting a service that depends on a second service will start the needed service(s). For example, if the Information Store is requested to start before the System Attendant and the Directory, the NT service manager will start the System Attendant and the Directory.
The Exchange dependencies are:
The DS depends on the SA
The IS depends on the DS & SA
The ES depends on the DS & IS
The MTA depends on the DS & SA
The IMC
depends on the SA, DS, IS & MTA
Given these rules concerning operational state changes to Exchange, its
clear that the is room to automate these sequences and respect the dependencies
even when changing this state manually. Through programming of an intelligent
agent which can faithfully reproduce the correct sequence repeatedly for an
operator without error
If its required to stop/start several servers simultaneously, in a
particular sequence, or automatically, then use of an intelligent agent and
management console should be preferred.
Note that native tools supplied with Exchange provide a means to interact with
1 server at a time since they do not supply an agent in relationship with the
management console. Operations on multiple servers become fastidious and sensitive to human errors.
The counters exposed by through perfmon provide valuable information on the
state of Exchange. Whilst these can be exposed to a SNMP agent through MIB
extensions, it is desirable to delegate analysis of these counters to an
intelligent agent. The basis of the management solutions then relies on the
agent locally analysing the values of key counters and comparing these to
thresholds values. This has the added advantage of only generating messages for
the operators attention when exceptions occur, enables solutions whereby
designers can develop specific actions that the agent can execute when these
exceptions occur
An
important part of ensuring that IT
Management teams are equipped to deal with exceptional circumstances lies in
the experience and judgement capacity of the staff. In our experience solutions
that provide customisation capacity to develop features to facilitate the
staff’s decisions through intelligent calculations, analyses and correlation
should be sought.
An example would be to design a solution which checks IS datastore compression times over say a weekly period (an event is generated by Exchange version 5.5 SP1). If this increases significantly it might due to the server being loaded on that day, or it might be indicative of a degradation of the datastore in the IS. Whilst this degradation, if confirmed, is a problem quite difficult to resolve, our advice in this case would be to work on establishing a secondary server from backup tapes or replication, ready to take over should the primary actually fail. Prevention being better than cure being a good management criteria.
ANY changes to Exchange architectures and components
including management, should be subject to rigorous "Change
Management" procedures, such that in the event or critical solutions all
the changes can be understood and eventually played back.
The
product respects key Microsoft technologies ;
- MMC,
WBEM (V4), CIM, COM/DCOM (Manager<>Agent, & agent tasks)
MMC is 1
to 1, ManageX MMC extends this to 1 to many. This means that one action can
applied to multiple servers in one operation.
Also 3rd
party MMC plug-ins are fully supported allowing true management integration.
ManageX
also supports SNMP and is integrated with IT/Operations and TopTools
out-of-the-box
Users may
gain access to management functions from any ManageX console or through Web
interfaces
Each
ManageX console may be customised according to operator's functions in the
organisation
ManageX
automatically deploys all components to the target machine, its intelligent
agent, the smart broker, takes care of this.
The
features are obtained through tasks developed included in intelligence policies
and functionality modules. The Smart broker supports all Microsoft supported
scripting and development languages and others ; WSH, VB, VBS, C++, perl, REXX,
JavaScript
Snap-Ins
provide basic management features for these products, and optional
Smart-PlugIns add even further management capacity
Although Microsoft Windows-NT’s perfmon is usually well known to most operators, getting directly to it with pre-configured graphs from the operator desktop and ManageX is a useful feature for diagnosis and trouble shooting
Furthermore,
ManageX is increasing in version 4 its capacity to provide data collection.
Its
purpose is to collect the data daily on the target nodes (data is perfmon
counters exposed by NT and Exchange), and then the console transfers this from
each agent. All information is then storable in an OBDC compatible DB for later
use.
One such usage is via the OpenView Service Reporter, but more about this later
IT/Operations
is a Unix based enterprise management server, comprising intelligent agents for
heterogeneous operatingsystems including Windows-NT, and capacity to expose
management information and delegate tasks to operators equipped with Unix and
Windows-NT graphical workstations.
The server
uses an Oracle database to store and manage management data.
The server
builds upon HP OpenView Network Node Manager (both Unix and Windows-NT based
versions), to discover nodes and enrich its database.
The server
supports many legacy applications for help desk and trouble ticket management.
The agent
provides many similar features to the ManageX Smart Broker, including event
interception, filtering and forwarding, services monitoring and autorestart,
and perfmon
counter integration for threshold based exception handling.
IT/Operations
architectures are highly flexible, numerous references exist to illustrate this
products scalability, and the application is popularly judged as best of its
breed for management solutions by independent analysts.
The
Exchange SPI is available for both management platforms.
Microsoft's own solutions though useful tools, these are not typically positioned as major solutions for enterprise management consoles, but nevertheless do provide the means to perform efficient operations management typically on a 1 to 1 basis, and without any intelligent task delegation.
Such applications are
·
MTA Queue Monitor :
providing visual analysis of outstanding work
·
Link Monitor :
providing visual, instantaneous analysis of inter server link status
·
Server Monitor :
providing basic server health information
·
Gateway Monitor
: provides monitoring of Exchange
gateway servers
·
Information Store
Monitor : provides status information on the IS service
·
Directory Service
Monitor : provides status information on the DS service
·
System Attendant
Monitor : indicates whether this global service is functioning
Most of these tools provide straightforward 1 to 1 server monitoring,
articularly fastidious for multiple servers in several sites, typical of
enterprise messaging backbones.
At the beginning of this section we indicated that efficient solutions are best designed using intelligent agents, typically programmable and customisable, but these rely on proprietary solutions/protocols.
SNMP is a very successful and popular protocol, and it is implemented by Microsoft to “manage” Exchange.
Operations management is possible through the
use of the SNMP agents supplied with Windows-NT, which can be configured to
take into account the industry standard “MAD” MIB which models messaging data and objects.
This modelling is generic, and basing solutions on this is not optimal for
Exchange architectures.
It has the advantage of being a very economic solution requiring minimal skills to activate the management services.
Ø Only basic Exchange management
tasks & features
Ø requires considerable MIB
expertise
Ø Standard mgmt with MAD MIB
Ø personalisation through MIB
add-ons using perf2mib tools
Ø little or no exception handling
Ø very limited possibility for
monitoring of Exchange NT services
The basic features of the MAD mib can be extended through tools provide in the Microsoft Exchange resource kit. “Perf2mib” and “mibcc” are tools, which when used correctly allows addition of targeted performance counters into the MIB.
Perf2mib.exe is used to compile Performance
Monitor counters into a new MIB for Microsoft Exchange Server.
Mibcc.exe recompiles the Perfmib.mib file
created by Perf2mib.exe and creates a new Mib.bin file.
Unfortunately there is no easy method to add monitoring services, and the MAD mib does not include any definitions of SNMP traps. SMS can be used to convert Windows-NT events into SNMP.
As described in previous sections
we recommend use of the counters
§
MSExchangeMTA
§
MSExchangeMTA Connections
§
MSExchangeIMC
§
MSExchangeIS
§
MSExchange Private
§
MSExchangeDS
§
Processor
§
Process
§
Logical Disk
§
Paging File
Batch a perf2mib command like:
Perf2mib Perfmib.mib Perfmib.ini MSExchangeMTA 1 MTA "MSExchangeMTA Connections" 2 MTAconnection MSExchangeIMC 3 IMC MSExchangeIS 4 IS "MSExchangeIS Private" 5 ISprivate "MSExchangeIS Public" 6 ISPublic MSExchangeDS 7 MSExchangeDS Processor 8 CPU Process 9 Process LogicalDisk 10 Disk "Paging File" 11 PagingFile
If
you don’t batch - watch it !!
When the above example command is used, the following MIB table is generated and is then supported by the SNMP agent in the server. The MIB must be loaded into the management server such that it can generated requests and transmit to each SNMP agent.
This defines the only method for operations
management using SNMP.
Varying degrees of network bandwidth will be absorbed through polling requests
between the management server and the managed servers.
Whilst this may be performed by intermediate mid-level managers, the solutions is bound by the management designed into the MAD mib.
ü Simple and economic
ü all Microsoft Exchange and any 3rd party perfmon
counters > MIB
ü any SNMP based manager
ü can be used locally in more complex MOM type configurations
ü No MSX server bound exception handling
ü Disappearing MSX services instances
ü Requires engineering for large scale installations
ü limited to perfmon objects and proprietary MIB
definitions
ü requires a minimum of SNMP expertise
ü SNMP stack and sub-agent dependencies
ü
unreliable
over noisy nets (UDP)
Administration of Exchange in
most cases and for most tasks is ONLY possible through the native
administration tools.
This currently means through the
"Exchange Administrator tool" supplied with the product, or through
development of specific tools using the administration API.
Configuration changes are thus
largely manageable through fastidious, manual procedures, which require careful
execution if human error through repetition is to be avoided.
This is an area where the
Configuration and Change Management procedures in ITSM offers become highly
desirable.
HP is currently working with
Microsoft on the development of Exchange administration through the efficient
use of the Admin API to Exchange, and WBEM/CIM technologies.
Here are some of the
administration features which are possible
ü
Public Folder Setup
·
World Wide Public
Folder
·
Country Specific
Public Folder
·
Owner
·
Replication
ü
Mailbox Actions
·
Bulk Mail Creation
Program
·
Modifications via
Forms
·
Delete -> Hide
·
Distribution lists
·
Conference Room
Setup
Disaster recovery begins with basic backup
solutions to provide the means to repair a damaged Exchange datastore.
In the simplest cases, it may be possible to perform backup jobs offline where
the information store is stopped, but it should be noted that in this case the
transaction logfiles are never removed and there are risks involved in stopping
the IS service.
It is best practice to perform online backup using backup products which use the Exchange service through the backup API. In this case, a maximum of datastore integrity is maintained.
Similarly, in the event of problems with the service and restore of data, this is best performed online once again through the backup API. In this case, the data is restored and transaction logs replayed to provide maximum integrity and context restore.
Exchange restore operations are disk I/O
intensive and much slower than backup rates expect a maximum of 40/50Gbytes/hr
backup, <15Gbytes restore time in the best cases.
The
following tables expose some of our results of backup/restore rates using
different popular products.
All
of these tests were performed on various Intel Quad and Dual Xeon 400Mhz
processor systems, equipped with 512Mbytes RAM of more and different disk
subsystem types and numbers of peripheral tape drives.
Software |
Task |
Database size (GB) |
# DLTs |
Duration (h:mn:s) |
Throughput
(GB/h) |
Arcserve |
Backup |
9.56 |
1 |
21:43 |
25.80 |
Arcserve |
Backup |
9.56 |
2 |
21:43 |
25.80 |
Seagate |
Backup |
9.56 |
1 |
21:41 |
26.47 |
Arcserve |
Restore |
9.56 |
2 |
1:40:19 |
5.59 |
Omniback
IIV3 |
Backup |
11.28 |
1 |
0:20:00 |
33.83 |
Omniback IIV3 |
Backup |
11.28 |
2 |
0:16:57 |
39.91 |
Omniback IIV3 |
Restore |
11.28 |
1 |
0:52:15 |
12.95 |
Omniback IIV3 |
Restore |
11.28 |
2 |
1:01:07 |
11.07 |
Omniback IIV3 |
Restore |
11.28 |
1 |
0:43:40 |
15 |
We
are currently performing further tests with ArcServe and BackupExec, and
gathering data from our clients using these products.
Based upon our tests, we expect ArcServe to provide at least similar or
slightly better results to those we have measured using HP Omniback II.
Use
of dedicated SCSI controllers and even RAID controllers with multiple tape
drives, should be considered, to maximum tape throughput and eliminate single
point of failure at the media level.
The
table shows that little extra performance is obtained when using multiple tape
drives especially during restore operations. Clearly restore operations, in all
cases, are slowest and thus most critical in the disaster recovery situation.
These rates will vary greatly according to the hardware configuration
of the backup, disks types and disk I/O systems. In fact the restore time is a
deciding factor during the architecture design of the Exchange servers to the
extent that in most cases the number of users per server has to be severely
restricted to ensure a reasonable restore time in the event of a crash.
|
20 GB |
50GB |
75 GB |
100 GB |
|
Disk System Type 1 |
Backup |
30mn |
1h17 |
1h55 |
2h34 |
Restore |
1h20 |
3h20 |
5h |
6h40 |
|
Disk
System Type 2 |
Backup |
0h42 |
1h46 |
2h39 |
3h32 |
Restore |
1h22 |
3h25 |
5h08 |
6h51 |
|
Disk
System Type 3 |
Backup |
57mn |
2h22 |
3h33 |
4h44 |
Restore |
1h08 |
2h50 |
4h16 |
5h41 |
|
Restore |
4h37 |
11h33 |
17h19 |
23h06 |
In
Sharing this information, its our intention to alert those who are responsible
for designing and operating disaster recovery solutions and procedures of two
essential aspects of Exchange :
a.
the size of the datastore should not be allowed to
grow to capacities that will define a long restore time after a disaster or
major overhaul.
Of course the impact of this on IT staff will depend on individual
circumstances, but given the increasing importance attached to messaging
systems, clearly downtime must be minimised.
b.
Disk system types have the most impact on the rate
of backup and restore rates, much more than the number of DLT drives used to
store and recover data.
In
the above chart, disk system 3 may seem completely unacceptable. Taken from a
performance perspective this may be true, but the trade of in this particular
case, is that this disk system has multiple power supplies and controllers,
thus eliminating many if not all single points of failure.
We are thus faced with a set of compromises, and decisions to be made between
performance and availability considerations at least (notwithstanding economics
too).
Clearly
the management system must monitor backup services to ensure that the backup’s
are performing correctly. As for any other NT based service, this monitoring
should monitor backup services, trap and filter events generated by the backup
system, and monitor resource consumption during execution of backup services.
This
needs to be taken into account to advise the client on the hardware setup for
backup, and the most appropriate products and their implementation
characteristics.
These
best practices treat the issues concerning the design of backup/restore
solutions, but are incomplete since it is also essential to ensure that the
environment on which the backup system relies is also in the correct state.
Monitoring of transaction logfiles (they should be regularly deleted),
free space of the volume and drive on which these are located (the IS service
will cease to operate if it cannot write these logfiles), # of logs over time,
are to be highly recommended.
It is surprising the number of times we encounter perfect backup
systems and procedures, which do not include any media verification.
Clearly if the disaster recovery is a simple procedure based upon restore of
the last backup, it is vital that some form of media testing is regularly
performed.
Best practice in this sense is to regularly schedule restore of the
data onto a secondary server, with automated test schedule to verify that the
backup has worked correctly.
In some cases it is desirable to manage warm backup solutions. In this
case, a secondary server is maintained as up to date as possible, either based
on replication of the primary Exchange server datastores, or via restore of the
last backup (which has the advantage of testing the media as well).
Perfecting a methodology to do this, requires a great of experienced
and well designed procedures, especially when the Exchange service to the user
must not be disturbed or as little as possible, but it is possible to design
such setups.
A cluster of Exchange servers can provide a solution for disaster recovery since the data store is shared between two servers with passive switchover. Microsoft cluster server provides this, but this cluster solution is considered difficult to maintain in a stable state. In the case of a MsCS based Exchange configuration, the backup/restore must also be carefully designed such that all the configuration data of the cluster groups is faithfully restorable.
Given that in this type of configuration, one server is executing Exchange whilst the other is not, standard manager/agent configurations used on a normal Exchange server, cannot be used as is. They must be modified to cater for execution switchover on the active and the passive nodes, and only report on the active node.
Testing
the media is vital. Offline restore test suites, potentially automated, are
essential in any disaster recovery strategy.
Inclusion
of 3rd party hardware subsystems such as those from EMC*2, also pose
constraints on the management architecture. Management of EMC*2 solutions is
possible from the enterprise console through the integration they now provide.
This is obviously highly desirable, to at least supply state information to the
operator, concerning the backup of data which is specially designed in the
EMC*2 case.
We have already seen that operations management makes use of performance counters to provide valuable exception handling based on thresholds.
We have seen also, how monitoring and alerting on the consumption of server resources by Exchange should also be done to expose exceptions
Clearly use of performance can not only be used for day to day operational tasks, but also for trend analysis and architecture validation and evolution planning.
A great deal of this depends on obtaining the right measurements at the right frequency and then managing this in archives.
Best practice is to use the
following counters ;
ü
MSExchangeMTA, Messages Delivered per
Minute
This counter measures the rate
of the number of messages being delivered by the MTA to the IS. Normal load is 10 - 40 messages per
minute. If this number is constantly
under 5 per minute when there are pending items in the MTA queue, then it is
likely that the server is under severe load or there is a problem with one of
the processes. If this number is
extraordinarily high (greater than 200 per minute) for an extended amount of
time, then it is likely that there is a stuck message in the MTA queue
ü
MSExchangeMTA, Work Queue Length
The level should increase and
decrease. An acceptable range would be
0-50. When messages are stuck in the
queue, the counter will remain level or only increase for extended periods of
time. Watch for “artificial floors” on
the MTA queue. Used here, “artificial
floor” means the work queue length remains at or above a non-zero, positive
integer. This can mean a number of
things. It could mean that there are
corrupt or stuck messages in the queues, or it could simply mean your queues
house a number of messages that have been sent with the deferred delivery
option in Exchange clients.
ü Exchange services processes. - This is the object: Process,
Counter: % Processor time, Instances: DSAMAIN, EMSMTA, MAD, and STORE. No object should be at 0% or at 100% all
of the time. An object always at 0% indicates a “dead” process. Check Service Control Manager to verify that
the service is running. An object always at 100% usually indicates that
something is out of order - check other services and the Event Viewer to
pinpoint the problem.
ü Paging File, % Usage - Make certain that the usage is in
a reasonable range, generally 15%-35%.
When the level of usage exceeds 60% there is usually something wrong. If
the usage constantly exceeds 90% then the situation needs to be treated as a
problem: There is either a problem with one of the processes, the server needs
to have a RAM upgrade, or the paging file was incorrectly allocated during
setup.
ü LogicalDisk, Free Megabytes,
Instance: E: - This
is the amount of free space on the transaction log drive. Monitor this object
to ensure that the drive does not fill up with .LOG files. Normally, the .LOG
files are removed whenever an on-line backup is performed. If the .LOG files
are not being removed, verify that the backups are being done correctly and
complete successfully (See Backup Procedures).
ü MSExchangeIS, Active Connection
Count. - This
measures the number of logons to the Store Service. This number should be greater than zero. If the server has active
mailboxes and there are zero connections then a problem exists. Use a test
account to see if there is a problem making a connection to the server.
ü
MSExhangeDS, Pending Replications - Shows the number of replication objects yet
to be processed.
ü
MSExchangeDS, Remaining Replication Updates - Measures the number of objects being
processed by the DS. This number
usually starts at 100 and decreases to 0 within 1 - 3 minutes.
We already mentioned the pre-configured
graphs which we recommend be configured and used for diagnostics and to assist
operators in performing regular tasks to maintain service.These graphs are also exceedingly important for defining
the "baseline" behaviour of Exchange components over time, and
thereby specification of improved threshold values tailored to the particular
client environment.
Automated scanning and storage of
Exchange performance, allows operators and administrators (architects), to
later validate the choices made when the Exchange serves were designed and
commissioned.
Extraction of key parameters
either for real-time graph generation and potentially automated report
generation, facilitates analysis of tendancy, server and user behaviour, and provides
key evidence to IT staff to plan ahead and define evolution of the
architecture.
These services thus ensure the
efficient dissemination of the right management data empowering the decision
makers at the right time.
When using
a cluster in an Exchange architecture, Exchange 5.* in a cluster is only
capable of using one server at a time, i.e.. active/passive mode.
Given this it is required to
provide modifications to operations management packages such that the
management can be distributed to each node, BUT only the active node agents
actually manage.
This is available with some
products, such that Windows-NT services not actually running on the passive
node of a Microsoft Cluster server, are not restarted erroneously, and the
events eventually generated on the passive node are not forwarded by default.
In the event of users gaining
access to Exchange through websites, then increasing availability has to apply
to the web services being provided such that the management solution covers
these. Solutions are available to not only to manage the web sites themselves,
but also supply load balancing and server failure/recovery management solutions
too.
Our experience management solutions
deployments we have seen so many cases of clients forgetting or choosing not to
include hardware monitoring.
One example of the importance can
be found in the following example.
A RAID disk fails. The RAID
configuration allows the file system services to be maintained transparent to
the user. Typically the only subsystem capable of detecting this state is a
hardware management package.
Its thus crucial that such
hardware management packages, frequently proprietary, are included in the
overall design.
Typically hardware management
packages are 100% proprietary, but usually have some form of alerting and/or
monitoring capability either SNMP or web-based
As we approach deployment of
Windows2000 and Platinum its clear that generic hardware
management will become more widely implementable through the WBEM and WMI
interfaces and definitions.
TopTools version 4.1 is capable
of supplying management information (based upon DMI support) for certain
servers from HP, Compaq, and IBM.
Our recommendations to clients when actually choosing a hardware supplier are frequently largely based upon the following management capacities ;
ü Easy Installation
ü Manage Servers, Desktops & Mobiles from One Application
ü Inventory upload
ü Built-in Discovery and representation through SNMP and WMI
ü Built-in Alerting at least SNMP
ü Database Integration hooks
frequently these products will manage there own DB. These hooks will allow
management solutions to search upon and extract vital information to facilitate
management tasks
ü Web access to hardware management
ü Access by LAN and Modem
ü Secured access (eg encrypted password)
ü Role-based Access for flexible operator management
ü Modem Dialback/LAN Security
ü Unified IPMI Event Log Access
ü On-line Memory Diagnostics
ü Notification mechanisms beyond just alerts
ü Remote Control
ü Full Power Cycle
ü Graphical Console Redirection
ü Remote BIOS/Firmware Updates
In
the context of an Exchange network, the most crucial network management
features you should aim at implementing, concern status checks of the linkages
between each Exchange server in the architecture.
Monitoring
of Exchange components to identify problems such as link status, bridgehead and
Exchange Site connectivity ruptures, excessive bandwidth occupation is
recommended . This will assist the help desk and operators to determine whether
a user’s problems are due to his direct usage and provide extra information for
his diagnosis of the true problem. In this case, it is particularly recommended
to use tools which have graphical representation of the objects in question,
the particular context is more easily determined using graphic elements and
colour, than by wading through a list of events
In
the case of single Exchange site configurations, there are no administration
functions which allow us to modify the behavior of information flow and
synchronization between Exchange servers, or how the network bandwidth between
them is used.
Given this, it is in our experience
useful to exploit network analysis to characterize the network usage on the
basis of network connections that are expected and confirmed, and through
typical bandwidth consumption measured between servers over a representative
period.
The
type of information generated by this analysis, serves as a baseline to define
exception thresholds. These can be analysed and detected online if the
management solutions allows continuos analysis (through for example using
permanently installed probes and instrumentation increasingly deployed in
network elements) or through reporting tools and methods.
In
the case of multiple Exchange sites, the administration staff can determine
which traffic flows between which servers and when. In this case, the above
measurement should be used to validate correct configuration of Exchange
components, and correct usage of the network architecture (routing essentially)
and its bandwidth.
The
network analysis tools such as these probes, can frequently be configured to
perform automated capture and decode of packets based on thresholds or a
specific user address helps isolate intermittent problems. These requires
expert understanding of the protocols and there usage, but leverages the
investment of the same network traffic analysis tools in many cases.
IT Service Management solutions design is a major part of our
consulting services work, once the management instrumentation, tools and hooks
are implemented. These provide the management data that allows the service
level agreements to be negotiated and monitoring constantly such that the appropriate parties
responsible for the SLA are alerted when trends indicate service degradation
and SLA rupture in the IT architecture.
SLO's being monitored here will
rely on the measurement data and management functions, that I've discussed so
far. Without those "hooks" and measurements, no SLO monitoring would
be possible.
Basically, you cannot manage what
you cannot measure.
The IT Service centre used in
this example, can actually be configured to publish the dashboard seen here via
the web, such that people not equipped with any management platform, can follow
and be aware of the SLO monitoring through their desktop and react before users
are affected.
Conclusions
ü
Choose
management interfaces which wherever possible will allow common “look-and-feel”
to management tasks
ü
With a common
data store using preferrably industry standard RDBMS products
ü
Use products
with intelligent agents to allow delegation of management tasks close to
Exchange servers
ü
Choose
management prodcuts which will allow links to legacy management packages
(trouble ticket, inventory, cable management,…)
ü
Choose products
which are native to the OS thus providing strong intergration with Exchange,
but ..
ü
Also
integration to provide a global vision of Exchange’s condition, resource
consumption, trends and exceptions
ü Services to leverage from the “basic-to-basics”
solutions into a scalable, management architecture to build full SLA monitoring
of Exchange key components
ü Choose
solutions scalable to the size of your existing Exchange Architecture design
but will allow expansion through mid-level management federatable trhough
enterprise solutions
ü Exchange
performance baselining (network, systems, Exchange components)
ü Management
associated components on which the Exchange architecture depnds (Web access,
management solutions, backup/disaster recovery designing/optimisation
ü Ensure
the solutions allows turnkey tailoring to fit Exchange based collaborative
solutions e.g. Eastman SoftWare), Fax, GSM,SMS, .. Integrated services (e.g.
Fenestrae)
You can’t manage what you can’t
measure”
The solutions and best practices discussed in
this whitepaper, are based upon HP Consulting services specifically developed
to provide assessment, design, deployment and management solutions, turnkey to
a specific client’s requirements. HP OSD also provides Exchange Management
Utility services.
For more information concerning HP Consulting please consult http://www.hp.com/go/consulting
HP Consulting services are based upon but not restricted to the following HP OpenView products ;
-
HP
OpenView ManageX –
Windows-NT and BackOffice Management
-
HP
OpenView IT/Operations – Heterogeneous
Enterprise Systems Management
-
HP
OpenView Smart PlugIn for Exchange Management
for ManageX
for IT/Operations
-
HP
PerfView –
Heterogeneous Systems Performance Management
-
HP
Measureware -
Heterogeneous Systems Performance Management
-
HP
OpenView Service Reporter –
Performance Reporting Management
-
HP
OpenView Network Node Manager – SNMP Systems and Network Management
-
HP
OpenView NetMetrix –
Network Performance and Traffic Analysis
- HP TopTools – HP NetServer HardWare management
-
For more information concerning these management products, please consult http://openview.hp.com