Memory Page Deallocation (MPD) [ COMMUNICATOR 3000 MPE/iX General Release 5.0 (Core Software Release C.50.00) ] MPE/iX Communicators
COMMUNICATOR 3000 MPE/iX General Release 5.0 (Core Software Release C.50.00)
Memory Page Deallocation (MPD)
by Steve Flynn
Systems Technology Division
MPD and Current Systems
This article presents an overview of Memory Page Deallocation (MPD), a
new feature available with MPE/iX Release 5.0. Rather than have an
entire bank of memory be deallocated, now it is just a page.
When an HP 3000 is upgraded to MPE/iX 5.0, it also benefits from the MPD
software. Most of the MPD operations described below operate in a
similar manner. Please refer to the last section of this article for a
discussion of the minor exceptions to MPD operation.
Memory Failures.
Memory boards are subject to two types of failures, hard errors and soft
errors. Hard errors are caused by a single chip failure within a memory
board, causing failures on all words associated with that chip. Soft
errors occur when a bit within a word changes value. This is typically
caused by decaying alpha particles from the surrounding casing material
on the chip.
HP's current memory design is single-bit correct, double-bit detect. It
is important to note that our ECC design does not perform error
correction on the memory cell itself, but fixes the value in the cache
line. The memory cell still contains the failure. If this is a soft
failure, the data in memory is corrected when the cache line is written
back to memory. If this is a hard failure, the memory cell is always in
error.
In either case, if another failure were to occur on the same word, it
would go from single-bit correct to double-bit detect and cause the
system to fail the next time the word is read. The purpose of page
deallocation is to permanently remove those pages from memory that
contain single or double bit errors.
Components of MPD
MPD provides a mechanism where memory pages containing errors can be made
unavailable for system use. A memory page is 4k bytes in size and is
deallocated if it contains one of the following errors:
* Solid single-bit error
* A soft failure re-occurring within a 24-hour period
* A double-bit error
Numerous system components work together to implement memory page
deallocation:
Page Deallocation Table (PDT).
This is a table that contains an entry for each memory page that has been
deallocated, at some point in time, due to an error. Each entry contains
the address and the nature of the error (single or double-bit).
One important feature of this table is that it is implemented in
Non-Volatile RAM, thus preserving deallocated pages between system boots.
NOTE Older systems do not implement the PDT.
Memory Selftest.
Each time the system is reset, the memory selftest executes. If it finds
a double-bit error, the address is entered into the PDT along with the
fact that this was a double-bit error.
MEMLOGP.
The Memory Logging Process, MEMLOGP, is a process that periodically
(every hour by default) checks the status of each memory controller on
the system for occurrences of single-bit errors.
MEMDIAG/LOGTOOL.
Information about deallocated pages is kept in two places, the PDT, which
is NVRAM based, and the MEMLOGP memory log file, which is disk based.
MEMDIAG and LOGTOOL can be used to display the contents of the memory
logfile. Information such as memory board slot number, physical address,
page number and error type is displayed. The size of the PDT and number
of entries currently in the table are also displayed.
O/S Memory Manager.
The O/S memory manager is involved during two phases, system boot and
while the system is running.
During the early portion of boot, the memory manager reads the PDT and
deallocate any pages found there.
Once the system is up, the memory manager provides services to MEMLOGP to
allow pages to be deallocated online.
Predictive.
HP Predictive Support analyzes internal error logs on disk drives, system
log files and memory logs for error trends. When an error rate exceeds
its threshold, an EVENT is generated. HP Response Center Engineers and
Customer Engineers analyze event information and take appropriate action
to solve the problem.
MEMSCAN is a software module within Predictive which scans system memory
log files. MEMSCAN provides page deallocation trending information to
support engineers such as PDT table size status and identification of
boards or banks that have a significant number of pages deallocated.
Bank deallocation or board replacement recommendations occur if the total
number of deallocated pages exceeds a certain threshold.
GENERAL OPERATION
PD comes into effect while the system is being started as well as when it
is online.
During system startup, memory is tested and any pages with bad locations
are made unavailable to the system.
While the system is online an attempt is made to correct memory locations
containing soft errors (scrubbing) and deallocated pages online, that
contain solid errors.
System Startup.
The following shows the general system startup flow that occurs with
respect to MPD.
1. Memory selftest executes. If any double-bit errors are discovered
during testing, and there is not an entry in the PDT corresponding
to this address, an entry is made.
2. During the boot process, the Operating System obtains the contents
of the PDT. Each page in the PDT are made unavailable for
allocation by the system's memory manager.
3. MEMLOGP reads the PDT and add any new PDT entries (discovered by
selftest) which are not contained in the memory logfile.
Online Operation.
The following shows the operation of MPD while the system is online.
1. MEMLOGP wakes up and reads the memory controller status register
and determines whether a single-bit error has been logged.
2. MEMLOGP requests the O/S memory manager to release the page for
testing.
3. If the O/S cannot release the page, MEMLOGP logs the error in the
memory log file as it does today.
4. If the O/S does release the page, MEMLOGP performs a scrubbing
operation (write/read test) on the page.
5. If the single-bit error is reproduced (hard error), the page is
entered into the PDT and memory log file. A request is made to
the O/S memory manager to make this page unavailable for system
use.
6. If the single-bit error is not reproduced (soft error) and another
soft error WAS DETECTED at this location within 24 hours, the page
is entered into the PDT and memory log file. A request is made to
the O/S memory manager to make this page unavailable for system
use.
MPD and Current Systems
The one exception to MPD operation is that older systems were not
designed with a Page Deallocation Table. Because of this, the system
startup routine is slightly different. During system startup if the
memory selftest detects a double-bit error, the system does not boot
(same operation as today), unlike the 3000 991/995. But, while the
system was running, MEMLOGP was keeping track of deallocated pages in its
disk-based memory log file. During startup, these pages are deallocated
before the system comes up.
MPE/iX Communicators