HP 3000 Manuals

Memory Page Deallocation (MPD) [ COMMUNICATOR 3000 MPE/iX General Release 5.0 (Core Software Release C.50.00) ] MPE/iX Communicators


COMMUNICATOR 3000 MPE/iX General Release 5.0 (Core Software Release C.50.00)

Memory Page Deallocation (MPD) 

by Steve Flynn 
Systems Technology Division 

MPD and Current Systems 

This article presents an overview of Memory Page Deallocation (MPD), a
new feature available with MPE/iX Release 5.0.  Rather than have an
entire bank of memory be deallocated, now it is just a page.

When an HP 3000 is upgraded to MPE/iX 5.0, it also benefits from the MPD
software.  Most of the MPD operations described below operate in a
similar manner.  Please refer to the last section of this article for a
discussion of the minor exceptions to MPD operation.

Memory Failures.   

Memory boards are subject to two types of failures, hard errors and soft
errors.  Hard errors are caused by a single chip failure within a memory
board, causing failures on all words associated with that chip.  Soft
errors occur when a bit within a word changes value.  This is typically
caused by decaying alpha particles from the surrounding casing material
on the chip.

HP's current memory design is single-bit correct, double-bit detect.  It
is important to note that our ECC design does not perform error
correction on the memory cell itself, but fixes the value in the cache
line.  The memory cell still contains the failure.  If this is a soft
failure, the data in memory is corrected when the cache line is written
back to memory.  If this is a hard failure, the memory cell is always in
error.

In either case, if another failure were to occur on the same word, it
would go from single-bit correct to double-bit detect and cause the
system to fail the next time the word is read.  The purpose of page
deallocation is to permanently remove those pages from memory that
contain single or double bit errors.

Components of MPD 

MPD provides a mechanism where memory pages containing errors can be made
unavailable for system use.  A memory page is 4k bytes in size and is
deallocated if it contains one of the following errors:

   *   Solid single-bit error

   *   A soft failure re-occurring within a 24-hour period

   *   A double-bit error

Numerous system components work together to implement memory page
deallocation:

Page Deallocation Table (PDT).   

This is a table that contains an entry for each memory page that has been
deallocated, at some point in time, due to an error.  Each entry contains
the address and the nature of the error (single or double-bit).

One important feature of this table is that it is implemented in
Non-Volatile RAM, thus preserving deallocated pages between system boots.


NOTE Older systems do not implement the PDT.
Memory Selftest. Each time the system is reset, the memory selftest executes. If it finds a double-bit error, the address is entered into the PDT along with the fact that this was a double-bit error. MEMLOGP. The Memory Logging Process, MEMLOGP, is a process that periodically (every hour by default) checks the status of each memory controller on the system for occurrences of single-bit errors. MEMDIAG/LOGTOOL. Information about deallocated pages is kept in two places, the PDT, which is NVRAM based, and the MEMLOGP memory log file, which is disk based. MEMDIAG and LOGTOOL can be used to display the contents of the memory logfile. Information such as memory board slot number, physical address, page number and error type is displayed. The size of the PDT and number of entries currently in the table are also displayed. O/S Memory Manager. The O/S memory manager is involved during two phases, system boot and while the system is running. During the early portion of boot, the memory manager reads the PDT and deallocate any pages found there. Once the system is up, the memory manager provides services to MEMLOGP to allow pages to be deallocated online. Predictive. HP Predictive Support analyzes internal error logs on disk drives, system log files and memory logs for error trends. When an error rate exceeds its threshold, an EVENT is generated. HP Response Center Engineers and Customer Engineers analyze event information and take appropriate action to solve the problem. MEMSCAN is a software module within Predictive which scans system memory log files. MEMSCAN provides page deallocation trending information to support engineers such as PDT table size status and identification of boards or banks that have a significant number of pages deallocated. Bank deallocation or board replacement recommendations occur if the total number of deallocated pages exceeds a certain threshold. GENERAL OPERATION PD comes into effect while the system is being started as well as when it is online. During system startup, memory is tested and any pages with bad locations are made unavailable to the system. While the system is online an attempt is made to correct memory locations containing soft errors (scrubbing) and deallocated pages online, that contain solid errors. System Startup. The following shows the general system startup flow that occurs with respect to MPD. 1. Memory selftest executes. If any double-bit errors are discovered during testing, and there is not an entry in the PDT corresponding to this address, an entry is made. 2. During the boot process, the Operating System obtains the contents of the PDT. Each page in the PDT are made unavailable for allocation by the system's memory manager. 3. MEMLOGP reads the PDT and add any new PDT entries (discovered by selftest) which are not contained in the memory logfile. Online Operation. The following shows the operation of MPD while the system is online. 1. MEMLOGP wakes up and reads the memory controller status register and determines whether a single-bit error has been logged. 2. MEMLOGP requests the O/S memory manager to release the page for testing. 3. If the O/S cannot release the page, MEMLOGP logs the error in the memory log file as it does today. 4. If the O/S does release the page, MEMLOGP performs a scrubbing operation (write/read test) on the page. 5. If the single-bit error is reproduced (hard error), the page is entered into the PDT and memory log file. A request is made to the O/S memory manager to make this page unavailable for system use. 6. If the single-bit error is not reproduced (soft error) and another soft error WAS DETECTED at this location within 24 hours, the page is entered into the PDT and memory log file. A request is made to the O/S memory manager to make this page unavailable for system use. MPD and Current Systems The one exception to MPD operation is that older systems were not designed with a Page Deallocation Table. Because of this, the system startup routine is slightly different. During system startup if the memory selftest detects a double-bit error, the system does not boot (same operation as today), unlike the 3000 991/995. But, while the system was running, MEMLOGP was keeping track of deallocated pages in its disk-based memory log file. During startup, these pages are deallocated before the system comes up.


MPE/iX Communicators