Configuring Disk Arrays for Messaging Performance on NT

Bob Montgomery

Hewlett-Packard
970.898.4237
bobm@fc.hp.com

Introduction

Between September 1998 and February 1999 I was responsible for the “highwater” benchmarking of Microsoft Exchange Server on new HP NetServers. During that time, I published results for two configurations that established new performance records for systems with fault tolerant disk arrays. My test systems used mirrored and striped-mirror disk configurations implemented with HP NetRAID-3Si Disk Array Controllers, HP Rack Storage/8 and HP Rack Storage/12 disk enclosures for external storage, and HP 6107A 9GB 10K hot swap disk drives. This is the story of how and why I did it that way.

Benchmark Description

The Exchange Messaging Benchmark was created by Microsoft to allow NT computer vendors to demonstrate the performance and scalability of Exchange on their systems. A test configuration consists of the server under test and several smaller client systems. The clients all run a simulator program called loadsim that simulates the Exchange messaging and scheduling activity for a number of “average corporate MAPI users”. Each client in the benchmark typically simulates between 500 and 1000 users. My published results were for systems serving 12600 users and 17600 users.

Some benchmarks are like the 100-yard dash. You run it, and you see how fast you went. The whole thing’s over in a few minutes. The Exchange benchmark is more like the pole vault. You set the height in advance, and then you see if you can jump it successfully. With Exchange, the height is the number of users in the simulation, and in order to make it, the server has to handle the load while keeping several stress parameters within defined limits. So you can clear the bar, but then be disqualified because your heart rate was too high on the way over. Oh, and one other thing… Each jump takes about 10 hours.

The most visible part of the published score is the user count, but it isn’t called that. It’s reported as MMB (MAPI Messaging Benchmark) to emphasize that it’s a load level and not a recommended number of users for a real configuration. The next part of the score is the response time. The loadsim clients clock all the reads, replies, sends, forwards, and deletes during the benchmark run. A weight is assigned to each operation during the scoring period, and the 95^th percentile of the weighted averages determines the response time. Any response time under 1000 milliseconds is legal, but you get bragging rights for shorter times. My big system scored 17600 MMB with a response time of 243 milliseconds.

The statistics for the run are measured during a 4-hour steady state period in the middle of an 8- or 9-hour loadsim run. Besides the limit on response time, there are limits on several server resource parameters measured by Performance Monitor. One counter (MSExchangeIS Private Send Queue Size) tracks an internal queue in Exchange, and when it’s out of bounds, the server is usually overloaded and heading for trouble.

Of interest here is a counter that sets off alarms well before there’s real trouble on the server. It is the Average Disk Queue Length, and the benchmark rules state that over the steady state period, the “average should be less than the number of spindles in the physical device”. I started taking it very seriously when Microsoft rejected my very first benchmark submittal four days before our system introduction event. The Average Disk Queue Length on my 12-disk array had come in at 12.663.

How Exchange Uses Disks

I divide Exchange’s disk requirements into three pieces: the Information Store (IS), the Information Store Logs (IS Logs), and everything else.

By the way, sometimes a “disk” is just a disk, and sometimes a “disk” is really a bunch of disks in a disk array. Just remember that if I talk about the Information Store disk, I’m talking about the logical disk that holds the Information Store files, even if it’s really implemented as twenty-four 9GB drives on four SCSI buses attached to a couple of NetRAID-3Si disk array controllers.

The Information Store consists of priv.edb and pub.edb, the private and public database files. After the user mailboxes are set up for the benchmark, the size of priv.edb is about 2.6 MB per user (over 30 GB for the 12600 MMB run). The pub.edb file is only about 17 MB, and doesn’t have much effect on IS performance. The disk I/O to the information store consists of random reads and writes, averaging about 4K in size. I have accepted the conventional wisdom that the access pattern in the Information Store is random without tracing it myself. The read/write ratio is nearly 2:1. The rate is usually around 0.04 ops/second for each simulated user, but it varies some with the memory in the system. Exchange appears to issue these requests in groups of 64. On my systems, the Information Store is always on the G: drive.

Exchange logs all IS transactions to the IS log files before committing them to the database. I enable circular logging during benchmarks, which means that Exchange can recycle the 5 MB log files after all their transactions have been committed. With circular logging, only a handful of log files are created during a run, so I usually allocate a small partition for the IS Logs. Transfers to the log files are small sequential writes, at a rate of about 14 bytes/sec for each user. Without circular logging, several gigabytes of log files will accumulate during an 8-hour run on a large benchmark. On my systems, the IS Logs are on the F: drive.

The rest of Exchange includes the Directory Store database and its logs, and the executables, libraries, and miscellaneous files of the installation. I leave them all in their default positions in the \exchsrvr directory on the C: drive.

The tables below show the measured disk I/O characteristics for the IS and IS Logs on three benchmark sizes. With Random I/O (Information Store), we’re mostly interested in operations/sec because the data transfer rate is too low to be a limiting factor. With Sequential I/O (IS Logs), we’re more interested in the rate of transfer.

Users	I/O Ops/sec	I/O Ops/sec per user	%Reads	Average Bytes/Read	Average Bytes/Write	Information Store Size
8000	341	.043	64%	4097	4690	20 GB
12600	489	.039	61%	4098	4928	32 GB
17600	759	.043	64%	4097	4888	48 GB

Table 1. Information Store Disk Characteristics (Random I/O)

Users	Average KBytes/sec	I/O Ops/sec	I/O Ops/sec per user	%Reads	Average Bytes/Write
8000	121	100	.013	0%	1244
12600	177	124	.010	0%	1459
17600	231	142	.008	0%	1670

Table 2. IS Logs Disk Characteristics (Sequential I/O)

A few more words about the Average Disk Queue Length: Exchange shovels I/O requests to the Information Store disk at up to 64 at a time. Performance Monitor sums the in-process times for each request and divides by the elapsed time of the sample period to compute the Average Disk Queue Length. So the number is for a conceptual queue. I use an internal disk benchmark that maintains a constant number of outstanding requests and measures the resulting I/O ops/sec. In that benchmark, Performance Monitor’s Average Disk Queue Length matches the benchmark’s outstanding request level. The Exchange benchmark is sort of the opposite of the internal disk benchmark. Exchange maintains a fairly constant rate of I/O ops/sec, and measures the resulting Average Disk Queue Length. If the disk subsystem can’t keep up with the required rate, things blow up pretty quickly and it’s obvious that you have to fix something. The usual case is that the subsystem does keep up with the rate, but only by having lots of pending requests and thus a higher Average Disk Queue Length. A DAC can usually get more total throughput from a disk array when there are lots of pending requests, because it can find more optimizations through sorting and combining when there are more things to choose from. The downside is that the response time for individual requests is longer because of the time spent waiting on the queue.

Configuring the disk subsystem

As long as you don’t mix the Information Store and the IS Logs on the same disk, the IS Logs are easy to accommodate. On my published benchmarks, the F: drive was a 500 MB FAT partition that shared a disk with the C: partition and a paging partition. It’s the Information Store that requires the large size and higher random I/O performance of a disk array. All arrays discussed here were implemented with HP NetRAID-3Si Disk Array Controllers (DACs).

RAID Level

RAID stands for Redundant Array of Independent Disks. It was originally coined with I for “Inexpensive”, but the drives I’ve been using are better described as independent, because they aren’t inexpensive. The mental picture that I use for thinking about RAID performance is simple. I think about a typical operation (in our case, a 4K random read or write), and then I think about how many of those little green access lights come on during that operation. The fewer, the better. I think about how many operations I could start at a time before all the lights come on. The more, the better. For illustrations, we’ll use arrays of 4 disk drives. Remember that my FGL (Flashing Green Lights) model ignores DAC features like caching, read-ahead, and reordering. It’s just to give you an intuitive feel for the relative performance of the different RAID levels when you’re thinking about 4K random I/Os.

RAID 0 (Striping): The simplest, fastest, and cheapest RAID configuration would more correctly be called AID, because it doesn’t offer any type of redundancy. Data is simply striped across the disks, and the NetRAID terms Stripe Depth and Stripe Size both refer to the amount of data written to each disk before moving to the next one. In a 4-disk RAID 0 array with a Stripe Depth of 16K, a 4K read to offset 0 would touch the first block on the first disk. A 4K read to offset 32K would touch the first block on the third disk, and 4K reads to offsets 64K, 128K, 192K, or 256K would all touch different blocks back on the first disk. The total size of the stripe across the 4 disks is 4 * 16K = 64K, but in NetRAID terminology, the Stripe Size is 16K.

Disk 1	Disk 2	Disk 3	Disk 4
0K	16K	32K	48K
64K	80K	96K	112K
128K	144K	160K	176K
192K	208K	224K	240K
…

Table 3. File system offset in a 4-disk RAID 0 array with Stripe Depth = 16K

Regarding the flashing green lights: A 4K read or write will light up a single disk. A 4-disk RAID 0 array could have up to 4 of these operations going at the same time. So the FGL model predicts that a 4-disk RAID 0 array reads at 4X and writes at 4X compared to the performance of a single disk.

With no redundancy, if any disk in a RAID 0 array fails, all data in the array is lost and must be recovered from a backup. I wouldn’t take that chance with a mail system, so I have self-righteously chosen to believe that benchmarking Exchange with RAID 0 arrays is cheating. A look at Microsoft’s Exchange benchmark results page will show you that many others do not share my belief.

RAID 1 (Mirroring): With mirroring, the data on one disk is duplicated on a second disk. Writes go to both disks and reads can come from either disk. In NetRAID terminology, a RAID 1 array consists of exactly two disks. If one disk fails, all the information is available from the other disk. I use a RAID 1 array for my system disk, which also contains the F: partition with the IS Logs, and a paging partition. Since our example requires 4 disks and the Information Store will require even more than that, let’s move on.

RAID 10 (Spanning with Mirrored Arrays): The way to make a larger mirrored disk array is to use striping (RAID 0) across mirrored pairs (RAID 1). The NetRAID documentation calls this RAID 10 (1 and 0, get it?), but the NetRAID setup programs never use that term. In the setup programs, you create adjacent RAID 1 arrays and then “span” up to four of them to form the maximum 8-disk RAID 10 array. The mirrored pairs are really striped and not concatenated, so the Stripe Depth chosen for the first pair controls the striping between the pairs.

Disk 1	Disk 2 (Mirror of Disk 1)	Disk 3	Disk 4 (Mirror of Disk 3)
0K	0K	16K	16K
32K	32K	48K	48K
64K	64K	80K	80K
96K	96K	112K	112K
…

Table 4. File system offsets in a 4-disk RAID 10 with Stripe Depth = 16K

And what about the lights? Consider a 4-disk RAID 10 array. A 4K read will light up a single disk, and it could be on either side of a mirrored pair, since both have the same data. Up to 4 reads could be happening from the array at the same time. A 4K write will light up two disks, since both sides of a pair have to get a copy of the new data. Only 2 writes can happen at the same time. So the FGL model predicts that a 4-disk RAID 10 array reads at 4X and writes at 2X over a single disk. You can see that it’s a pretty simple model, and I enjoy watching the flashing lights in my head.

Data protection in a RAID 10 array is pretty good, since it can tolerate the loss of any single disk without losing data or performance. When a failed disk is replaced, the controller will copy the good side of the mirror to the new disk to reestablish the pair. A RAID 10 array can tolerate a double failure as long as it doesn’t take out both sides of a mirrored pair.

The downside of RAID 10 (and RAID 1) is cost. For every two disks you buy, you get the capacity of one. The usable capacity of our 4-disk example is 2 disks. With NetRAID’s limit of 8 disks in a RAID 10 array, the capacity is usually not sufficient for an Exchange Information Store. My solution is to use NT stripe sets to stripe across the RAID 10 arrays implemented on the DAC. But I refuse to call it RAID 100 for Mirrors Striped and Striped again. I just call it RAID 10 . My 17600-user system used NT striping to combine four 6-disk RAID 10 arrays into a 24-disk G: drive. The 12600-user system used NT striping to combine two 6-disk RAID 10 arrays for the G: drive.

RAID 5 (Striping with Distributed Parity): RAID 5 is apparently an often-recommended configuration but it’s not ideal for the Exchange Information Store. It is striped like RAID 0 except that one block in every stripe is used to hold encoded parity information for that stripe. If any block in the stripe is lost (including the parity block), the other blocks can be used to recreate the lost data. The parity blocks are distributed across the disks for performance reasons.

Disk 1	Disk 2	Disk 3	Disk 4
0K	16K	32K	Parity
48K	64K	Parity	80K
96K	Parity	112K	128K
Parity	144K	160K	176K
…

Table 5. File system offsets in a 4-disk RAID 5 with Stripe Depth = 16K

And the flashing green lights? As in normal striping, a 4K read only lights up one disk. Parity isn’t checked during normal reads, so there isn’t any activity to the parity block. On our example array, up to 4 different reads could happen at the same time. But writing is a different story. Referring to Table 5, a write at 16K on Disk 2 must also write a new parity block on Disk 4. One way to do that is to read the corresponding data from Disk 1 and Disk 3, combine it with the Disk 2 data to calculate a new parity, and then write the new data to Disk 2 and the parity to Disk 4. Did you see all the lights? All four disks lit up to perform this write, so only one write can be happening at a time. The FGL model predicts that a 4-Disk RAID 5 array can read at 4X but only write at 1X. Remember, we’re still talking about small random I/Os.

If we extend this algorithm to larger arrays, the read performance goes up with the number of disks, but the array always writes at 1X. Some DACs use an algorithm that writes at 2X on an 8-disk RAID 5 array, but has some drawbacks at initialization time. A better way around the write performance problem is to use RAID 50, described below.

A RAID 5 array can tolerate the loss of one drive without losing data, but performance will suffer while the drive is out because the missing data has to be reconstructed from the other drives on every read. Any double failure on a RAID 5 array will cause data loss.

The minimum size for a RAID 5 array is 3 disks. The NetRAID-3Si limits the maximum size to 8 disks. The cost of the redundancy is one extra disk per array, so our 4-disk array has the capacity of 3 disks.

RAID 50 (Spanning with RAID 5 arrays): By striping across small RAID 5 arrays, we get back some of the write performance without paying as much as we would for a RAID 10 solution. For example, with an 8-disk RAID 50 implemented by spanning two 4-disk RAID 5 arrays, the FGL model predicts performance as 8X read and 2X write, with the capacity of 6 disks. You could also create a 9-disk RAID 50 by spanning three 3-disk RAID 5 arrays. The FGL prediction for that array would be 9X read and 3X write with the capacity of 6 disks. For comparison, an 8-disk RAID 10 array has an FGL predicted performance of 8X read and 4X write, but only provides the capacity of 4 disks.

The chart below shows the performance using the Flashing Green Lights (FGL) model for RAID 0, 10, 5, and 50 assuming 66% random reads. Remember that the simple-minded model doesn’t consider the effects of caching and other DAC features, and it assumes that pending read operations are evenly distributed across the drives. Armed with the simple-minded FGL model of RAID performance, you are ready to answer burning questions like “Which is faster, RAID 10 or RAID 5?” Your appropriate response should be “Define every aspect of the system configuration and the benchmark, give me the hardware, and I’ll run an experiment.” Or, if you prefer fewer words: “It depends.”

Graph 1. FGL Performance Predictions (1/(.66/ReadRatio + .34/WriteRatio))

On my benchmark systems, I used RAID 10. It is simply the coolest thing going for the Exchange Information Store. After we discuss the rest of the configuration issues, I’ll show you the results of some experiments that support my enthusiasm. I ran an 8000-user benchmark using an 8-disk RAID 10 array for the G: drive. Then I ran it with a 5-disk RAID 5 array (lower cost, equal capacity), and with an 8-disk RAID 50 array (equal cost, greater capacity). The results are in the case study below.

DAC Configuration

The other decisions during DAC configuration are Stripe Depth, Read Policy, Write Policy, and the mysterious Cache Policy. Here are my recommendations.

Stripe Depth: 16K. The usual recommendation is to use large stripes for sequential access and small stripes for random access. But on the NetRAID predecessor to the NetRAID-3Si, that recommendation often ended with “…but use 64K anyway”. Considering both sets of advice, I set up my NetRAID-3Si benchmark system with 32K stripes. When Microsoft rejected my 12600-user benchmark for violating the Average Disk Queue Length limit, I spent a busy weekend looking for the tweak that would let me publish a result in time for the announcement. It turned out to be Stripe Depth. When I decreased the Stripe Depth from 32K to 16K, the improvement in Information Store disk array performance was worth a 9% reduction in the Average Disk Queue Length.

Here’s what didn’t work: a Stripe Depth of 8K was slower than 16K, even though the small random I/Os to the Information Store would suggest that smaller should be better. I also tried 128K stripes just in case the conventional wisdom was completely wrong, but it wasn’t. The Average Disk Queue Length shot up to 16 with 128K Stripes.

There is one minor consideration for folks who do a lot of benchmarks. Before every benchmark, I restore a fresh database from a backup on another disk array. That copy goes noticeably faster with big stripes.

Write Policy: Write-Back. The choices are Write-Through and Write-Back. In a Write-Through cache, data is sent to the disk at the same time it is cached. I suspect that the DAC doesn’t report the write complete until it’s on the disk. In a Write-Back cache, data doesn’t get written to the disk until something forces it out of the cache. Write operations complete without having to wait on the disk. But a Write-Back cache is only safe if the cache is protected against power failure. Remember, Exchange thinks the data is safely on the disk, when it might still be sitting in the DAC cache. The NetRAID-3Si has battery backup, so I used Write-Back caching for my benchmarks. The NetServer that I used for my 12600-user benchmark had a built-in NetRAID DAC. But it didn’t have battery backup. So I plugged in a NetRAID-3Si and cabled it to the internal hot-swap bays in the NetServer, just so I could use Write-Back caching. The 8000-user case study includes a benchmark run using Write-Through caching.

Read Policy: Adaptive. There are three settings: Read-Ahead, Normal (no read-ahead), and Adaptive (read-ahead when previous accesses were sequential). With Read-Ahead, data beyond the request are read and cached, on the assumption that they’ll be requested soon. I didn’t want to use Read-Ahead for random I/O, but I didn’t want to rule it out completely in case the benchmark had its sequential moments. So I used Adaptive Read-Ahead. A coworker suggested that I shouldn’t trust the firmware to adapt correctly, so I later set up a test with Normal Read Policy (no read-ahead). There was no difference in performance, so I still use Adaptive Read-Ahead.

Cache Policy: Cached I/O. You can choose Direct I/O and Cached I/O. The NetRAID documentation says this about Direct I/O: “Read Data is only cached if read repeatedly”. Cached I/O caches everything and follows the Read and Write Policies. I have not tested Direct I/O with the Exchange Information Store, so it might be the next performance frontier.

Changing the DAC configuration: I change array configurations a lot. If you just want to change the Read, Write, or Cache Policy, you can do it for individual logical drives using NetRAID Assistant, and you don’t need to reboot. I think I remember reading somewhere that you shouldn’t change parameters while the disk subsystem is under load.

If you want to change the RAID level or the Stripe Depth, or you want to rearrange your disks, you have to redo the configuration of everything on the DAC. Since my NT system disk is also on the DAC, I always approach this with some hesitation. Here’s my typical procedure for something like changing the G: drive from RAID 5 to RAID 10:

1. Make a backup. (Just kidding, I never make backups on my test systems.)

2. Stop all the services and then use Disk Administrator to delete the drive that you’re going to change. This might be superstition.

3. Write down the configuration for any array that you want to keep (like the RAID 1 array that has your system on it). Make sure you know which drives are in the array in what order, the RAID level, and the Stripe Depth. Everything else is adjustable later. You can get this from NetRAID Assistant while you’re still in NT, or from NetRAID Express during boot up.

4. Reboot, and use <Control>M during the boot screens to enter NetRAID Express. It’s a goofy DOS-like interface, but I’ve come to love it.

5. View the configuration one last time to make sure you have it written down correctly.

6. Select New Configuration. This step always makes me a little nervous.

7. Make sure you correctly set up the arrays that you wanted to keep intact.

8. Set up the other arrays and save the configuration as you leave the setup screen. If you made a mistake, go back to Select New Configuration again.

9. Initialize ONLY the new arrays. Don’t make a mistake here, or you’ll wish you hadn’t just laughed about step 1.

10. Reboot.

11. Use Disk Administrator to set up the new arrays in NT. The ones you kept should already be there just as they were.

Of course, if you mess up, you’ll lose your system disks or worse. I’d recommend that you practice this five or six times before it matters. There’s probably a better way, but this method works for me.

NT Configuration

The logical arrays that you configure on the NetRAID look like physical disks to NT. The Disk Administrator program allows you to allocate partitions on them or combine them into NT stripe sets.

Partition setup: Here’s a trick that you need to know. When you enter Disk Administrator for the first time after reconfiguring the NetRAID-3Si, the new arrays will appear as free space. Right click and select “Create Extended” on each one. Ignore the warnings about large disks. Disk Administrator still shows the disk as free space but the hash marks change direction. Now go ahead and create logical drives or select disks for a stripe set. If you don’t create the extended partition first, the DAC BIOS and NT’s mysterious DOS compatibility stuff will conspire to misalign your partition with the blocks on the disks. When this happens, some of the 4K I/O operations will cross Stripe Depth boundaries and hit two disks instead of one. On simple 4K random I/O tests, I’ve seen a 10% performance drop when the partition is not aligned with the DAC.

How do you tell whether you’re aligned or not? A resource kit program called diskmap will show you the starting offset of the partition. Usage is diskmap /d<drive#> /h (drive# matches the Disk Administrator number, /h gives hexadecimal output). Here’s what it says when I create a primary partition without first creating an extended partition.

Cylinders HeadsPerCylinder SectorsPerHead BytesPerSector MediaType

452 ff 3f 200 c

TrackSize = 7e00, CylinderSize = 7d8200, DiskSize = 21e3ba400 (8675MB)

Signature = 0xa922bc5f

StartingOffset PartitionLength StartingSector PartitionNumber

000007e00 21e3b2600 3f 1

And here’s the result when I create the extended partition before creating my logical drive:

Cylinders HeadsPerCylinder SectorsPerHead BytesPerSector MediaType

452 ff 3f 200 c

TrackSize = 7e00, CylinderSize = 7d8200, DiskSize = 21e3ba400 (8675MB)

Signature = 0xa922bc5e

StartingOffset PartitionLength StartingSector PartitionNumber

0007e0000 21dbda400 3f 1

The important number is StartingOffset. The largest alignment boundary for 0x7e00 is 0x200, or 512 bytes. (What’s the highest power of two that will divide the offset with no remainder?) In fact, it starts 512 bytes short of being aligned on a 32K boundary. So this partition won’t be aligned with our 16K stripes.

The largest alignment boundary for 0x7e0000 is 0x20000 or 128K. So the logical drive in the extended partition will be aligned with any Stripe Size up to 128K. For well-behaved applications that do their I/O at aligned offsets, this partition setup will avoid the inefficiency of crossing stripe boundaries unnecessarily.

You can use this trick when you’re creating a single logical drive per DAC disk (remember that NT sees our DAC disk array as one physical disk). I haven’t figured out how to guarantee alignment of subsequent logical drives on the same disk. The 8000-user case study includes a run with an unaligned G: drive to show the effect of not doing this simple optimization.

Formatting: I don’t use Disk Administrator to format the drives. Instead I use format /fs:ntfs /a:16k where the /a option matches my Stripe Depth. This might be superstition. I don’t think I ever proved that it mattered.

Case Study

I set up an 8000-user benchmark to demonstrate the effects of some of the configuration choices. The server was configured with 4 processors and 1 GB of memory to make sure that the disk subsystem was the dominant bottleneck. The first configuration does everything right (RAID 10, Write-Back, Aligned) and scores 8000 MMB with a 217 millisecond response time. The disk queue length of 6.9 is less than the number of disks.

The second configuration uses a 5-disk RAID 5 array with the same capacity as the first configuration, saving the cost of 3 disk drives. The 95^th percentile response time went up by almost 4X but is still within benchmark limits. But the disk queue length of 20.6 is 4X above the number of disks, so the run isn’t valid for score. The send queue also shows the effect of the much lower disk performance of the RAID 5 array.

The third configuration uses an 8-disk RAID 50 array that costs the same as the first configuration, but provides 1.5 times the capacity (it stripes two 4-disk RAID 5 arrays with 3 disks of capacity each). The response time performance is much better than the RAID 5 case, but nearly 2X worse than the RAID 10 configuration. The send queue shows no problems, but the disk queue length is about 25% too high for a valid benchmark. The RAID 50 system would definitely be usable, but with worse response times than the RAID 10 system.

The fourth configuration differs from the first by using Write-Through caching on the DAC. The response time is about 10% worse, the send queue isn’t affected at all, but the disk queue length is about 25% too high for a valid benchmark. In this case… let me emphasize that… IN THIS CASE, the difference between Write-Through and Write-Back caching looks like more of a “benchmark tweak” than a real-world tweak. Remember that in this benchmark we’re talking about small, random I/Os delivered in large groups. I’ve seen real-world examples where Write-Back caching on the DAC more than doubled the performance over Write-Through caching. It just depends on the application.

The fifth configuration differs from the first by having an unaligned partition for the G: drive. The response time is about 20% worse, the send queue is up a little, and the disk queue length disqualifies the benchmark with a heart-breaking 5% violation of the limit. In this case, the alignment is a free performance boost, so I’ll defend it as being more than just a “benchmark tweak”.

RAID Config for G: Drive	Cache Write Policy	Partition Setup	Mean Response Time	95^th Percentile Response Time	Avg Disk Queue Length for G: Drive	IS Private Send Queue Size
8 Disks RAID 10	Write-Back	Aligned	52 msec	217 msec	6.9	0.9 Avg 5 Peak
5 Disks RAID 5	Write-Back	Aligned	140 msec	806 msec	20.6 (Violates benchmark limit)	7.9 Avg 150 Peak
8 Disks RAID 50	Write-Back	Aligned	86 msec	422 msec	10.2 (Violates benchmark limit)	1.7 Avg 8 Peak
8 Disks RAID 10	Write-Through	Aligned	67 msec	240 msec	10.4 (Violates benchmark limit)	0.9 Avg 5 Peak
8 Disks RAID 10	Write-Back	Unaligned	58 msec	268 msec	8.4 (Violates benchmark limit)	1.1 Avg 5 Peak

Table 6. 8000-user Exchange Benchmark Case Study

Sensible Conclusion

Quick! What’s the FGL Read/Write performance of an 8-disk RAID 10 on random I/O? (Answer: 8X/4X)

How about a 5-disk RAID 5? (Answer: 5X/1X)

And an 8-disk RAID 50? (Answer: 8X/2X)

(This will be on the test.)

From Graph 1, the relative performance predictions for these three configurations are roughly:

· 8-disk RAID 10: 6X

· 5-disk RAID 5: 2X

· 8-disk RAID 50: 4X

While all the configurations in the case study delivered the required rate of about 340 I/O ops/sec, you can see the effect of their performance differences in the response times and queue lengths of the results. It’s not a precise way to predict real results, but I hope the FGL model will give you a feel for the performance of the different RAID configurations.

Despite being the result of a burning desire to get a good publishable score on the Exchange Messaging Benchmark, most of the recommendations given here will serve you well in the real world, too. At least to the extent that your real world involves Exchange 5.5 on HP NetServers with NetRAID-3Si controllers. You don’t want to stretch these results too far. Disk subsystem performance is extremely dependent on I/O sizes and patterns. While 16K was the right stripe size for me, different applications, or even a different message workload on Exchange might be better served by a different (probably larger) size.

The benchmark uses a pretty small Information Store compared to its message load (2.6 Mbytes/user) and by doing so emphasizes performance over capacity in the disk subsystem. A much larger RAID 10 array that accommodates bigger user mailboxes might have more performance than you need (is that possible?) for a given load. A much larger RAID 50 array might have as much performance as you need for less cost per megabyte. (Although why anyone would ever choose anything other than RAID 10 is beyond me…)

The benchmark also ignores other practical issues like multiple server traffic, huge message sizes, public message serving, and backup and recovery times. An Exchange Server acting as a bulletin board for satellite maps will probably have much different I/O characteristics than the ones exhibited by this benchmark.

Given all that, I’ll say it again: RAID 10 is the coolest, 16K stripes work best, and don’t forget to extend your partitions.

Acknowledgements

Thanks to Dale McAtee of HP for helping me through my first Exchange benchmark when all appeared to be lost, to Jerry McKinney of HP for teaching me how to set up NT systems again and again, and to Ron Jones of HP for exploring various aspects of NetRAID performance and telling me some of the secrets.