Debugging Examples

Example 1

The following core dump was obtained while using a modified version of the sp driver, which is described in example #2 in the strdb section of this chapter.

On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace. The interesting portion of msgbuf for this dump is:

msgbuf+10/s
        .
        .
        .
interrupt type 15, pcsq.pcoq = 0.3b2cc, isr.ior = 0.0
Data page fault on interrupt stack
   B2352A HP-UX () #1: Fri Aug 14 00:49:59 PDT 1992
panic: (display==0xbf00, flags==0x0) Interrupt
PC-Offset Stack Trace (read across, most recent is 1st):
  0x0013e81c  0x000cddb8  0x000bc93c  0x0003b2cc  0x0012e2bc  
0x0016b350
End Of Stack

First we translate the hex stack trace in the panic message into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic message stack trace, we get the following symbolic stack trace:

panic+40:       addil   800,dp
interrupt+7E8:  rsm     1,r0
$ihndlr_rtn:    rsm     1,r0
sp_timeout+2C:  ldws    0(arg3),arg2
softclock+94:   b,n     softclock+30
external_interrupt+350:         ldil    261000,r22

The address where the illegal data access occurred is sp_timeout+2C. The isr.ior in the panic message indicates that the data address that caused the panic is 0.0, and the instruction at sp_timeout+2C is ldws 0(arg3),arg2, so arg3 must have been 0 at the time of the panic. So we are probably dereferencing a null pointer. Our first task is to find out which pointer this is. To do this we need to know which source code line sp_timeout+2C corresponds to. Here is the source code for sp_timeout():

struct sp {
  unsigned sp_state;
  queue_t  *sp_rdq;
  mblk_t   *mp;
  mblk_t   *last_mp;
};
 
static sp_timeout(lp)
struct sp *lp;
{
  mblk_t *temp;
  unsigned int s;
 
  if (lp->sp_state & SPOPEN) {
    /* Put message on driver's read queue */
      s = splstr();
      temp = lp->mp;
      lp->mp = lp->mp->b_next;
      if (lp->mp == NULL) lp->last_mp = NULL;
      temp->b_next = NULL;
      putq(lp->sp_rdq,temp);
      splx(s);
  }
}

Here is the relevant portion of the assembly code. The instruction which caused the panic is marked with an "*."

 sp_timeout,20?ia                        # adb command
 sp_timeout:                             # adb's response
 sp_timeout:     stw     rp,-14(sp)
 sp_timeout+4:   stwm    r3,40(sp)
 sp_timeout+8:   stw     r4,-3C(sp)
 sp_timeout+0xC: or      arg0,r0,r3
 sp_timeout+10:  ldws    0(r3),arg1
 sp_timeout+14:  bb,>=,n arg1,31,sp_timeout+58
 sp_timeout+18:  bl      tmxlwsrv+6C,rp  (splstr)
 sp_timeout+1C:  or      r0,r0,r0
 sp_timeout+20:  or      ret0,r0,r4
 sp_timeout+24:  ldws    8(r3),arg1
 sp_timeout+28:  ldws    8(r3),arg3
*sp_timeout+2C:  ldws    0(arg3),arg2
 sp_timeout+30:  stws    arg2,8(r3)

At sp_timeout+0xC, arg0, which corresponds to the source code variable lp is moved to r3. We know arg0 is lp, because lp is the first argument to sp_timeout(). sp_timeout+0x14 looks like the if statement in the source code, because bb is a branch instruction. sp_timeout+0x18 is the call to splstr(). sp_timeout+0x28 loads arg3 with the memory contents at location r3 + 0x8. arg3 is the source code variable lp->mp. We can guess this because mp is 8 bytes from the start of lp, according to the declaration for the struct sp. So our problem is that lp->mp is NULL. We want to confirm this, and want to look at the rest of *lp. To do so, we need to find the value of r3 at the time of the panic. We may be able to extract this information from the stack if we know the value of sp at the time of the panic. To get this information, we do a manual stack back-trace. See "Manual Stack Back-Tracing" for details on how this is done. The resulting table is shown below:

pcoqh

Procedure Address

Frame Size

0x1fdb80
0x1fdb50
0x1fdad0
0x16860
0x167e0
0x16560
0x16330
0x162f0

0x24b34
0xc8f48
0xc8938
0x13e8cc
0xcddb8
0xbc93c
0x3b2cc
0x12e2bc

doadump+0xec
panic_boot+0x354
boot+0x24
panic+0xf0
interrupt+0x7e8
$ihndlr_rtn
sp_timeout+0x2c
softclock+0x94

Now that we have the values of sp, we want to look into the stack of the procedure above sp_timeout() in the stack trace to find what value that procedure saved in its stack for r3. In this case, the procedure above sp_timeout() is $ihndlr_rtn. $ihndlr_rtn is one of the low-level kernel utility procedures which is hand-coded in assembly and does not create a normal stack frame. Instead it creates a "save state" area, which contains the values of all the registers at the time the trap or interrupt took place. The structure save_state is defined in /usr/include/machine/save_state.h. The general registers are stored first, and are located at "top of save state area" + "register number" * 4. For example, r3 will be 3*4 = 12 off of the beginning of the save state area. To find the top of the save state area, subtract the size of the save_state structure from the value of sp for $ihndlr_rtn:

0x16560-0x230>sp         # set sp to top of trap save state
<sp/X
16330:       0xF000009   # first word of save state area
<sp+0xC/X                # find contents of r3 (lp) at sp + 3*4
icsBase+33C: 24C258
24C258+0x8/X             # find 8 off of r3 (lp->mp)
sp_sp+18:    0           # lp->mp is NULL
0x24c258/4X              # look at all of lp:
             # state *sp_rdq     *mp       *last_mp
sp_sp+10:    1        1040C00     0         10F7C00

We can also use strdb to look at lp. (See the strdb section of this chapter for details.) There may be several instances of the sp driver, each with a different minor number, so we must look at each one until we find the instance whose q_ptr is the same as the address we have for lp. lp is a pointer to the sp driver's private data, which is also pointed to by q_ptr. The strdb STREAMS/UX subsystem la command will tell us what minor numbers are in use for the sp driver:

:la sp
 
sp MAJOR = 115
ACTIVE Minor  2  Stream head RQ = 0x0810eb000
ACTIVE Minor  1  Stream head RQ = 0x081107a00
ACTIVE Minor  0  Stream head RQ = 0x0810ebe00

The strdb STREAMS/UX subsystem command lm will show us what modules may have been pushed into the stream above the sp driver:

:lm sp 0
 
STREAM Head
lmodc
Driver sp

In this case, the panicking stream happens to correspond to the sp with minor number 1. From the strdb STREAMS/UX subsystem, we use ":qh sp 1" to get to the read queue of the stream head containing sp driver with minor number 1. Then the o command to get to the write queue of the stream head. Next the n command twice to get from the stream head through the module lmodc to the driver sp. Here is the display of the q information for driver sp, minor number 1. Note that q_ptr is 0x24c258, which is the address of lp.

:qh sp 1
 
struct queue 0x1040c74
 
q_qinfo   =  0x1e545c   q_pad1[2] = 00
q_first   =  0x0q_other   =  0x1040c00
q_last    =  0x0
q_next    =  0x0
q_link    =  0x0
q_ptr     =  0x24c258
q_count   =  0
q_flag    =  0x1128
   QWANTR
   QUSE
   QOLD
   QSYNCH
q_minpsz  =  0
q_maxpsz  =  256
q_hiwat   =  0x8000
q_lowat   =  0x4000
q_bandp   =  0x105fd40
q_nband   =  1
q_pad1[0] = 00
q_pad1[1] = 00

Now that we have reached the queue structure for the panicking sp driver instance, we can use strdb or adb to examine its contents. Using the strdb command :b, we can look at q_ptr, and see that its mp field (the third word) is NULL:

:b 0x24c258
 
0x0024c258 00 00 00 01 01 04 0c 00 00 00 00 00 01 0f 7c 00 | ................
0x0024c268 00 00 00 01 01 0f 8e 00 00 00 00 00 00 00 00 00 | ................
0x0024c278 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c288 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c298 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c2a8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c2b8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c2c8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c2d8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c2e8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c2f8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c308 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c318 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c328 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................
0x0024c338 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

At this point, we have probably learned all that we can from the dump and must turn to the source code to discover the cause of this problem. We next examine the code carefully everywhere that lp->mp is updated or should be updated. Because sp driver's put routine, spput(), should be updating lp->mp, we look at it first.

static spput(q, mp)
queue_t *q;
mblk_t *mp;
{
  struct sp *lp;
  unsigned int s;
 
  switch (mp->b_datap->db_type) {
  case M_DATA:
  case M_PROTO:
  case M_PCPROTO:
    s = splstr();
    lp = q->q_ptr;
    if (!lp->last_mp)
      lp->last_mp = mp;
    else
      lp->last_mp->b_next = mp;
    splx(s);
    timeout(sp_timeout,lp,1);
    break;
  default:
    printf("Routine spput: Should not be here\n");
    break;
  }
}

Note that spput() never updates lp->mp. It just adds the new message to the tail of the list using lp->last_mp. But once sp_timeout() has processed the last message on the list and set lp->mp to NULL, spput() will never update lp->mp to point at the next message it receives. This causes sp_timeout() to be called with lp->mp == NULL. If we change spput() if statement to properly update lp->mp as shown below, this panic will be fixed.

if (!lp->mp)
        /*
         * head of list is NULL so list is empty -- put new message
         * at head of list
         */
        lp->mp = mp;
else
        /*
         * list is not empty -- put new message at tail of list
         */
        lp->last_mp->b_next = mp;
/*
 * update list tail pointer to point to new message
 */
lp->last_mp = mp;

Example 2

The following core dump was obtained while using a modified version of the sp driver, which is described in example #2 in the strdb section of this chapter.

On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace. The interesting portion of msgbuf for this dump is:

msgbuf+0xc/s
        .
        .
        .
trap type 15, pcsq.pcoq = 0.3b584, isr.ior = 0.0
   B2352A HP-UX () #1: Fri Aug 14 00:49:59 PDT 1992
panic: (display==0xbf00, flags==0x0) Data segmentation fault
PC-Offset Stack Trace (read across, most recent is 1st):
  0x0013e81c  0x000cc108  0x000bd3f4  0x0003b584  0x00049a48  0x0004bd0c
  0x0002f7d4  0x00046178  0x00049a48  0x000460d0  0x00046594  0x0012cc10
  0x000bedd0  0x00024cf0
End Of Stack

panic+40:             addil   800,dp
trap+0xA28:           b       trap+0xF18
$call_trap+20:        rsm     1,r0
spput+4C:             stws    r31,0(r1)
csq_lateral+80:       b,n     csq_lateral+8C
puthere+4C:           ldw     -54(sp),rp
lmodcsrv+5C:          bl      getq,rp
sq_wrapper+50:        ldw     -54(sp),rp
csq_lateral+80:       b,n     csq_lateral+8C
runq_run+58:          b,n     runq_run+74
str_sched_daemon+264: b       str_sched_daemon+160

The address where the illegal data access occurred is spput+4C. The isr.ior in the panic message indicates that the data address that caused the panic is 0.0, and the instruction at spput+4C is stws r31,0(r1), so r1 must have been 0 at the time of the panic. We are probably dereferencing a null pointer. Our first task is to find out which pointer this is. To do this we need to know to which source code line spput+4C corresponds to. Here is the source code for spput():

struct sp {
  unsigned sp_state;
  queue_t  *sp_rdq;
  mblk_t   *mp;
  mblk_t   *last_mp;
};
 
static spput(q, mp)
queue_t *q;
mblk_t *mp;
{
  struct sp *lp;
  unsigned int s;
 
  switch (mp->b_datap->db_type) {
  case M_DATA:
  case M_PROTO:
  case M_PCPROTO:
    lp = q->q_ptr;
    if (!lp->mp)
      lp->mp = mp;
    else
      lp->last_mp->b_next = mp;
    lp->last_mp = mp;
    timeout(sp_timeout,lp,1);
    break;
  default:
    printf("Routine spput: Should not be here\n");
    break;
  }
}

Here is the relevant portion of the assembly code. The instruction where the panic occurred is marked with an "*".

  spput,40?ia
  spput:
  spput:          stw     rp,-14(sp)
  spput+4:        ldo     40(sp),sp
  spput+8:        or      arg1,r0,r31
  spput+0xC:      ldw     14(r31),r22
  spput+10:       ldbs    0xD(r22),arg1
  spput+14:       ldo     -41(r0),arg2
  spput+18:       ldo     -41(arg1),arg3
  spput+1C:       combt,=,n       arg2,arg3,spput+30
  spput+20:       ldo     -40(r0),ret1
  spput+24:       combt,=,n       ret1,arg3,spput+30
  spput+28:       ldo     42(r0),r19
  spput+2C:       combf,=,n       r19,arg3,spput+78
  spput+30:       ldw     14(arg0),arg1
  spput+34:       ldws    8(arg1),ret0
  spput+38:       comibf,=,n      0,ret0,spput+48
  spput+3C:       stws    r31,8(arg1)
  spput+40:       b       spput+54
  spput+44:       stws    r31,0xC(arg1)
  spput+48:       ldws    0xC(arg1),r1
* spput+4C:       stws    r31,0(r1)
  spput+50:       stws    r31,0xC(arg1)
  spput+54:       ldil    3B000,rp
  spput+58:       ldo     298(rp),r20
  spput+5C:       extru,= r20,1F,1,r21
  spput+60:       ldw     -4(dp),r21
  spput+64:       ldo     1(r0),arg2
  spput+68:       bl      spclose+0xB4,rp  (timeout)

First, we try to get a general idea where spput+0x4C falls in the source code. It occurs before the call to timeout() at spput+0x68. The pattern of combt and combf instructions from spput+0x1C to spput+0x2C correspond to the switch statement in the source code. We guess this by noticing that we have loaded a value into arg3 which we compare against three different values, which resembles the first three case statements in the switch statement. It is unlikely that the default case of the switch statement, which just does a printf(), would cause the system to panic. spput+0x4C is probably in the source code in the case statement for M_DATA, M_PROTO, and M_PCPROTO. The comibf instruction at spput+0x38 must correspond to the if (!lp->mp) source statement, because it is a conditional branch statement, and it is comparing a register to 0 (zero).

We may be able to determine whether we executed the "if" clause or the "else" clause of the if statement, based on the fact that we know we executed spput+0x4C (because a trap occurred while executing it). The comibf instruction branches to its target address if the condition it is checking is false. This comibf instruction compares ret0 to zero. If ret0 equals zero, comibf will not branch, and execution will continue to spput+0x3C and spput+0x40. spput+0x40 is an unconditional branch to spput+0x54, which is past spput+0x4C. Therefore, if ret0 had been zero, we never would have executed spput+0x4C. So ret0 was not zero. Since we know that the comibf instruction corresponds to "if (!lp->mp)," we know that lp->mp was not NULL, and the comibf instruction branches to spput+0x48 if lp->mp is not NULL, we can be confident that spput+0x48 and spput+0x4C are part of the else clause of the if statement, which consists of one statement, "lp->last_mp->b_next = mp;."

Now we know which source code line we panicked on. We need to determine which source code pointer the register r1 corresponds to, because dereferencing r1 is what caused the panic. To do this, we work backwards from spput+0x4C to see where r1's contents came from. On spput+0x48, r1 gets loaded from arg1 + 0xC. Now we look backward to see where arg1 came from. It is tempting to assume that arg1 is the second argument to spput, which is mp. But at spput+0x10, arg1 is the target of a load, so at spput+0x48 arg1 does not contain mp. It is also tempting to look at spput+0x44 for the origins of arg1's contents, because that instruction has arg1 as its target. But because we took the comibf at spput+0x38, we must have branched around spput+0x44, so we can ignore this instruction. Looking further backward to spput+0x30, arg1 gets loaded from arg0 + 0x14. arg0 has not been the target of a load instruction since the beginning of spput, so it must still contain the first argument to spput, q. Looking at the source code, the only time that q is referenced is to set lp in the statement before the if. So arg1 must correspond to lp. Looking at the source code line where the panic occurred, "lp->last_mp->b_next = mp," and the assembly code lines spput+0x48 and spput+0x4C, it appears that spput+0x48 is setting r1 to lp->last_mp, and spput+0x4C is attempting to put the contents of r31 into memory location r1 + 0, which must be "lp->last_mp->b_next".

So our problem is that lp->last_mp is NULL. It may help us to look at the rest of *lp, and to do so we need to find the value of arg1 at the time of the panic. We may be able to extract this information from the stack if we know the value of sp at the time of the panic. To get this information, we do a manual stack back-trace. See "Manual Stack Back-Tracing" for details on how this is done. The resulting table is shown below:

pcoqh

Procedure Address

Frame Size

0x1fdb80
0x1fdb50
0x1fdad0
0x7ffe6f88
0x7ffe6f08
0x7ffe6e08
0x7ffe6bd8
0x7ffe6b98

0x24b34
0xc8f48
0xc8938
0x13e8cc
0xcc108
0xbd3f4
0x3b584
0x49a48

doadump+0xEC
panic_boot+354
boot+0x24
panic+0xf0
trap+0xf18
$call_trap
spput+0x4c
csq_lateral+0x80

Now that we have the values of sp, we want to look into the stack frame of the procedure above spput() in the stack trace, to find what value that procedure saved in its stack for arg1. In this case, the procedure above spput() is $call_trap. $call_trap is one of the low-level kernel utility procedures which is hand-coded in assembly and does not create a normal stack frame. Instead it creates a "save state" area, which contains the values of all the registers at the time the trap or interrupt took place. The structure save_state is defined in /usr/include/machine/save_state.h. The general registers are stored first, and are located at "top of save state area" + "register number" * 4. So, for example, arg1, which is also known as r25, will be 25*4 = 100 off of the beginning of the save state area. To find the top of the save state area, subtract the size of the save_state structure (0x230 in release 9.0) from the value of sp for $call_trap:

0x7ffe6e08-0x230>sp       # set sp to top of trap save state
<sp/X
7FFE6BD8:    0xF000009    # first word of save state area 
0x7ffe6bd8+0x4/X
7FFE6BDC:    0            # find contents of r1 (lp->last_mp)
                          # at sp + 1*4.  NULL, as we thought
0x7ffe6bd8+0x64/X         # find contents of arg1 (lp) at
                          # sp + 25*4.
7FFE6C38:     0xFFFFFFBF

0xFFFFFFBF is a very unlikely value for lp. It is more likely that the contents of arg1 were changed in the process of taking a trap. The four arg registers are considered scratch registers, and the trap path is very likely to have overwritten these registers before it created the save state area. However, there is an alternative way to find out the value of lp. If we can determine what the procedure that called spput() set arg0 to before the call, we will know the value of q, and lp is q->q_ptr.

  csq_lateral+40,15?ia
  csq_lateral+40: ldws    8(r3),arg2
  csq_lateral+44: depi    -1,1E,1,arg2
  csq_lateral+48: stws    arg2,8(r3)
  csq_lateral+4C: bl      csq_turnover+108,rp  (UNCRIT)
  csq_lateral+50: or      r6,r0,arg0
  csq_lateral+54: ldw     10(r5),ret1
  csq_lateral+58: comibt,=,n      0,ret1,csq_lateral+68
  csq_lateral+5C: ldw     10(r5),arg0
  csq_lateral+60: ldw     1C(arg0),r19
  csq_lateral+64: bb,<,n  r19,18,csq_lateral+84
  csq_lateral+68: ldw     1C(r5),arg1
  csq_lateral+6C: stw     r0,1C(r5)
  csq_lateral+70: ldw     14(r5),r6
  csq_lateral+74: ldw     18(r5),arg0
* csq_lateral+78: ble     0(sr4,r6)
  csq_lateral+7C: or      r31,r0,rp
  csq_lateral+80: b,n     csq_lateral+8C

The procedure which called spput() is csq_lateral(). The point where the call was made is marked with an asterisk. Note that the procedure call here is made using the instruction ble instead of the usual instruction bl. This is because csq_lateral does not know the name of the procedure it is going to call. csq_lateral() is passed a structure which contains the address of a procedure to call and the arguments with which to call it. Because the compiler cannot tell at compile time how far away in the executable image the procedure address is, it must use a branch and link external, ble, instruction in order to be sure it will be able to reach the procedure address being branched to.

At csq_lateral+0x74, arg0 is loaded from r5 + 0x18. So if we can find out what value r5 had at that point, we can determine the value of q. r5 is a callee save register, so there is a chance that spput saved r5 in its stack frame. We look at the first few instructions of spput:

spput/6i
spput:
spput:          stw     rp,-14(sp)
                ldo     40(sp),sp
                or      arg1,r0,r31
                ldw     14(r31),r22
                ldbs    0xD(r22),arg1
                ldo     -41(r0),arg2

We see that spput did not save r5. Callee registers are only saved if the callee plans to overwrite the register. So we cannot get r5 from spput's stack frame, but if spput did not save r5 that means it did not overwrite it; therefore, the value for r5 in the save state area will be the same value that r5 had at csq_lateral+0x74. Look at 4*5 into the save state area:

<sp+0x14/X              # sp + 4*5 == r5
7FFE6BEC:       11002A0
11002A0+0x18/X          # q is r5 + 0x18
11002B8:        10EE674
10EE674+0x14/X          # lp is q + 0x14
10EE688:        24C278
24C278+0xC/X            # lp->last_mp = lp + 0xC
sp_sp+3C:       0       # lp->last_mp is NULL
0x24c278/4X             # look at all of lp:
                # state sp_rdq        mp        last_mp
sp_sp+30:       1       10EE600       0          0

Note that at the point the panic occurred, lp->mp was NULL, even though we can be sure that at the time we checked lp->mp at instruction spput+0x38, lp->mp was not NULL. How can this be true? As we saw in the previous example, sp_timeout() modifies the lp structure, and it runs out of timeout. In other words, spput() calls timeout() to schedule sp_timeout() to run after a specified amount of time. At each system clock tick, the kernel examines the list of procedures created by timeout() and schedules those procedures whose time has expired to run. Because a clock tick is a high level interrupt, it can occur at any time, and may suspend spput() if it is running. A clock tick may have occurred between spput+0x38 and spput+0x4C, allowing sp_timeout() to run and set lp->mp to NULL. In order to prevent this, we need to protect access to the lp structure by using splstr() around all critical sections of code in the sp driver which manipulate lp. So spput() source code should be changed as shown below:

case M_DATA:
case M_PROTO:
case M_PCPROTO:
  /*
   * Use splstr() to protect access to q->q_ptr area from
   * interrupts which may schedule sp_timeout().
   */
  s = splstr();
  lp = q->q_ptr;
  if (!lp->mp)
    lp->mp = mp;
  else
   lp->last_mp->b_next = mp;
  /*
   * Return to previous interrupt level
   */
  splx(s);

In order to protect access to q->q_ptr, sp_timeout() must also call splstr() before it accesses q->q_ptr. The source code for sp_timeout() in the first example in this section shows the correct use of splstr().

See the STREAMS/UX synchronization section of Chapter 3 for guidelines on protecting module and driver critical sections.

Example 3

The following core dump was obtained while using a modified version of the sp driver, which is described in example #2 in the strdb section of this chapter.

On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace. The interesting portion of msgbuf for this dump is:

msgbuf+0xc/s
        .
        .
        .
trap type 15, pcsq.pcoq = 0.9ea14, isr.ior = 0.d
@(#)9245XA HP-UX (A.09.00) #0: Thu Aug 13 23:17:54 PDT 1992
panic: (display==0xbf00, flags==0x0) Data segmentation fault
 
PC-Offset Stack Trace (read across, most recent is 1st):
 0x0016da70 0x000e5a68 0x000d34cc 0x0009ea14 0x00099714 0x0009
2fdc
 0x0006e0c8 0x0006dbb8 0x0006d2a8 0x001954e8 0x00194fa4 0x000b
7e24
 0x001846d4 0x00181730 0x00156538 0x00156af8 0x001567b8 0x000e
6d80
 0x000d3aac
End Of Stack

panic+30:            addil   -1000,dp
trap+0xADC:          b       trap+1004
$call_trap+20:       rsm     1,r0
flushq+60:           ldbs    0xD(r21),r22
q_free+1C:           ldw     -0xA4(sp),r31
osr_pop_subr+0xB44:  b       osr_pop_subr+0xB4C
osr_close_subr+4D8:  stw     ret0,-40(sp)
pse_close+8A0:       stw     ret0,-3C(sp)
hpstreams_close+58:  stw     ret0,-40(sp)
call_open_close+448: or      ret0,r0,r3
closed+138:          or      ret0,r0,r5
ufs_close+11C:       movb,tr r0,ret0,ufs_close+15C
vn_close+24:         ldw     -54(sp),rp
vno_close+50:        addil   -59800,dp
closef+0xE8:         ldw     18(r3),arg0
exit+2B4:            bl      uffree,rp
rexit+20:            ldw     -54(sp),rp
syscall+2A4:         ldhs    0(r9),r19

The address where the illegal data access occurred is flushq+0x60. The isr.ior in the panic message indicates that the data address that caused the panic is 0.d, and the instruction at flushq+0x60 is ldbs 0xD(r21),r22, so r21 must have been 0 at the time of the panic. So we are probably dereferencing a null pointer. Our first task is to find out which pointer this is. To do this we need to know which variable r21 was supposed to contain. We do not have source code for flushq(), because it is a STREAMS/UX internal procedure, but we do know from its man page what arguments it takes, and we do have the assembly version of the code. Here is the relevant portion of the assembly. The instruction where the panic occurred is marked with an "*".

  flushq,20?ia
  flushq:
  flushq:         stw     rp,-14(sp)
  flushq+4:       ldo     40(sp),sp
  flushq+8:       stw     arg0,-64(sp)
  flushq+0xC:     stw     arg1,-68(sp)
  flushq+10:      ldw     -68(sp),r20
  flushq+14:      zdepi   1,10,1,r21
  flushq+18:      and     r20,r21,r22
  flushq+1C:      stw     r22,-3C(sp)
  flushq+20:      ldw     -68(sp),r31
  flushq+24:      addil   -8000,r0
  flushq+28:      ldo     -1(r1),r19
  flushq+2C:      and     r31,r19,r20
  flushq+30:      stw     r20,-68(sp)
  flushq+34:      ldw     -64(sp),r21
  flushq+38:      ldws    4(r21),r22
  flushq+3C:      stw     r22,-34(sp)
  flushq+40:      ldw     -34(sp),r1
  flushq+44:      comibt,=,n      0,r1,flushq+120
  flushq+48:      or      r0,r0,r0
  flushq+4C:      ldw     -34(sp),r31
  flushq+50:      ldws    0(r31),r19
  flushq+54:      stw     r19,-38(sp)
  flushq+58:      ldw     -34(sp),r20
  flushq+5C:      ldw     14(r20),r21
* flushq+60:      ldbs    0xD(r21),r22
  flushq+64:      stw     r22,-40(sp)
  flushq+68:      ldw     -68(sp),r1

We can find flushq()'s calling sequence in its man page in SVR4PG:

void flushq(queue_t *q, int flag)

It is more likely that *q or one of its members is NULL than the parameter flag being the cause of our problem. We will trace the use of the first argument, originally in arg0, through flushq, to see how it might be related to the contents of r21.

At flushq+0x8, arg0 is pushed onto the stack at offset sp - 0x64. Neither arg0 nor -64(sp) is referenced again until flushq+0x34. At flushq+0x34, r21 is loaded with -64(sp), so at this point r21 contains *q. At flushq+0x38, r22 is loaded from memory location 4 + r21. Looking at the structure definition for queue_t, found in /usr/include/sys/stream.h, we see that the second word in a queue_t structure, which would be found at memory location r21 + 4, is the q_first pointer.

struct queue {
    struct qinit * q_qinfo;  /* procedures and limits for queue */
    struct msgb *  q_first;  /* head of message queue */
    struct msgb *  q_last;   /* tail of message queue */
    struct queue * q_next;   /* next QUEUE in Stream */
    struct queue * q_link;   /* link to scheduling queue */
    caddr_t        q_ptr;    /* to private data structure */
    ulong          q_count;  /* weighted count of characters on q 
*/
    ulong          q_flag;   /* QUEUE state */
    long           q_minpsz; /* min packet size accepted */
    long           q_maxpsz; /* max packet size accepted */
    ulong          q_hiwat;  /* high water mark, for flow control 
*/
    ulong          q_lowat;  /* low water mark */
    struct qband * q_bandp;   /* band information */
    unsigned char  q_nband;   /* number of bands */
    unsigned char  q_pad1[3]; /* reserved */
    struct queue * q_other;   /* pointer to other Q in queue pair 
*/
    QUEUE_KERNEL_FIELDS
};

So r22 now contains q->q_first. At flushq+0x3C, r22 is stored back in the stack, at sp - 0x34.

At this point, it may be useful to try and work backwards from flushq+0x5C, where r21 gets loaded from 0x14 + r20, because at the next instruction, flushq+0x60, we know that r21 is NULL. We notice that at flushq+0x58, r20 is loaded from sp - 0x34. At flushq+0x3C, we know that sp - 0x34 was q->q_first. Checking the instructions between flushq+0x3C and flushq+0x58 shows that sp - 0x34 has not been stored to by any of these instructions, only loaded from. So at flushq+0x58, r20 is loaded with q->q_first. At flushq+0x5C, r21 is loaded with some field of q->q_first. Looking at the structure definition for struct msgb, also found in /usr/include/sys/stream.h, we find that the sixth word of the msgb structure, which would be found at memory location r20 + 5 words == r20 + 0x14, is b_datap.

struct  msgb {
   struct msgb *   b_next;  /* next message on queue */
   struct msgb *   b_prev;  /* previous message on queue */
   struct msgb *   b_cont;  /* next message block of message */
   unsigned char * b_rptr;  /* first unread data byte in buffer 
*/
   unsigned char * b_wptr;  /* first unwritten data byte */
   struct datab *  b_datap;  /* data block */
   unsigned char   b_band;  /* message priority */
   unsigned char   b_pad1;
   unsigned short  b_flag;   /* message flags */
   long      b_pad2;
   MSG_KERNEL_FIELDS
};

So our problem is that q->q_first->b_datap is NULL. We want to confirm this, and to look at the rest of the q structure. To do that we need to find the value of sp - 0x64 at the time of the panic. We may be able to extract this information from the stack if we know the value of sp at time of the panic. To get this information, we do a manual stack back-trace. See "Manual Stack Back-Tracing" for details on how this is done. The resulting table is shown below:

pcoqh

Procedure Address

Frame Size

0x2418c0
0x241890
0x2417d0
0x7ffe7750
0x7ffe7710
0x7ffe7650
0x7ffe7420

0x1c374
0xdfcd0
0xdf3a8
0x16db14
0xe5a68
0xd34cc
0x9ea14

doadump+0xec
panic_boot+0x354
boot+0x34
panic+0xd4
trap+0xadc
$call_trap
flushq+0x60

0x30
0xc0
0x80
0x40
0xc0
0x230
0x40

Now that we have the values of sp for flushq, we know the q address we are interested in is at 0x7ffe7420 - 0x64:

0x7ffe7420-0x64/X
7FFE73BC:       5E9C00

Looking at the first few words of the q structure, we can determine the value of q_first, which is the second word:

5E9C00/4X
5E9C00:         294160          5D8C00          6C1880          0

Looking at q_first, we can see that the sixth word, b_datap, is NULL:

5D8C00/8X
5D8C00:         646480      0               646400          644000
                6440D1      0               0               0

We can also use strdb to look at q and q_first. See the strdb section of this chapter for more information. Because there may be several instances of the sp driver, each with a different minor number, we must look at each one until we find the stream which contains a queue whose address is the same as the address we have for q. The strdb STREAMS/UX subsystem la command will tell us what minor numbers are in use for the sp driver:

:la sp
 
sp MAJOR = 115
ACTIVE Minor 0x000013  Stream head RQ = 0x00607b00
ACTIVE Minor 0x000012  Stream head RQ = 0x00605c00

These instances of sp are far fewer than we had expected. lm on minor number 0x12 shows that lmodc has already been popped off the stream:

:lm sp 0x12
 
STREAM Head
Driver sp

and using :qh sp 0x12, and o and n as needed to traverse all the queues in this stream shows that none of these queues have address 0x5e9c00. lm on sp 0x13 shows that lmodc is still pushed above sp on this stream, but traversing all the queues in this stream shows that none of them are the queue we are looking for. We can use the strdb primary mode :x command to format q_first as a struct msgb to confirm our finding from adb that q->q_first->b_datap is NULL. (We find the structure type for q_first from /usr/include/sys/stream.h).

:x msgb 0x5d8c00
 
struct msgb 0x5d8c00                    S:1
 
b_next  =  0x646480
b_prev  =  0x0
b_cont  =  0x646400
b_rptr  =  0x644000
b_wptr  =  0x6440d1
b_datap =  0x0
b_band  =    0
b_pad1  = 00
b_flag  =  0x0
b_pad2  =  0

b_datap could be NULL because its resources have been freed, or it could be NULL because the data structure was corrupted in some way. To try to narrow this down, we want to look at the message buffer b_cont. If its b_datap is also NULL, the possibility of corruption becomes less likely. We can use :x msgb 0x646400 to format the b_cont field of q->q_first. It is easier, however, to see if there is a navigation key available for the b_cont field. "?" lists the available navigation keys:

navigation for structure msgb
'n'      = b_next (msgb)
'p'      = b_prev (msgb)
'm'      = b_rptr (b_rptr)
'c'      = b_cont (msgb)
'd'      = b_datap (datab)

Using the c navigation key, we see that b_datap for b_cont is also NULL. This makes it very likely that this message has already been freed.

struct msgb 0x646400            S:2
 
b_next  =  0x5d8c00
b_prev  =  0x0
b_cont  =  0x0
b_rptr  =  0x651400
b_wptr  =  0x6517e1
b_datap =  0x0
b_band  =  0
b_pad1  =  00
b_flag  =  0x0
b_pad2  =  0

Now we try to get information about the queue which was pointing to this message at the time of the panic. We use :x to format 0x5e9c00 as a queue structure to see what information it may still contain.

:x queue 0x5e9c00
 
struct queue 0x5e9c00               S:3
 
q_qinfo   =  0x294160   q_pad1[0] = 00
q_first   =  0x5d8c00   q_pad1[1] = 00
q_last    =  0x6c1880   q_pad1[2] = 00
q_next    =  0x0        q_other   =  0x5e9c74
q_link    =  0x0
q_ptr     =  0x0
q_count   =  24896
q_flag    =  0x1135
   QREADR
   QFULL
   QWANTW
   QUSE
   QOLD
   QSYNCH
q_minpsz  =  0
q_maxpsz  =  256
q_hiwat   =  0x8000
q_lowat   =  0x4000
q_bandp   =  0x539d00
q_nband   =  1

Note that this is a read queue whose q_next pointer is NULL. This implies that this queue is not a connected part of a stream, and is in the process of being closed. To find out what driver or module this queue is being used by, we want to look at q_qinfo. We could use :x qinit 0x294160, or look for an appropriate navigation key:

?
 
navigation for structure queue
'i'      = q_qinfo (qinit)
'm'      = q_first (msgb)
'z'      = q_last (msgb)
'n'      = q_next (queue)
'l'      = q_link (queue)
'b'      = q_bandp (qband)
'o'      = q_other (queue)

We use the i navigation key to print the following:

struct qinit 0x294160                S:4
 
qi_putp   =  0x785ac
qi_srvp   =  0x78794
qi_qopen  =  0x7841c
qi_qclose =  0x78490
qi_qadmin =  0x0
qi_minfo  =  0x294148
qi_mstat  =  0x0

Using the adb i command, we can find out the name of the qi_putp routine:

0x785ac/i
lmodcput:
lmodcput:       stw     rp,-14(sp)

This means the module lmodc was using the queue on which the panic occurred. We can double check this by looking at the qi_minfo structure in strdb. Again, we can either use :x module_info 0x294148, or we could see if there is a navigation key available for qi_minfo:

?
 
navigation for structure qinit
'i'      = qi_minfo (module_info)
's'      = qi_mstat (module_stat)

Using the qinit i navigation key to print the module_info structure:

struct module_info 0x294148             S:5
 
mi_idnum  =  0x3ec
mi_idname =  0x23a0a8
mi_minpsz =  0
mi_maxpsz =  256
mi_hiwat  =  0x8000
mi_lowat  =  0x4000

and using the adb s command to print mi_idname as a string:

0x23a0a8/s
lmcinfo+10:     lmodc

So we had the panic occur on an lmodc read queue which was in the process of being closed. Our stack trace confirms this. We are making the exit system call, close all open file descriptors and as part of process clean-up. The last close of a stream causes each module and driver to be popped and its resources freed, including its message buffers. Whenever a panic occurs which involves b_datap being NULL, the cause is usually that the buffer has already been freed but a pointer to it was not zeroed out, and a module or driver continues to access the buffer through this non-zeroed pointer. The best way to find the cause of this problem is to look through the source code for all calls to freemsg() or freeb(), and check that all pointers to the buffer being freed are zeroed out.

static spclose(q)
queue_t *q;
{
  struct sp *lp;
  unsigned int s;
  mblk_t *mp, *t_mp;
 
  lp = (struct sp *) (q->q_ptr);
  /* Free messages queued by spput() on interim mesg queue. */
  s = splstr();
  mp = lp->mp;
  while (mp != NULL) {
    t_mp = mp;
    mp = mp->b_next;
    freemsg(t_mp);
  }
  splx(s);
  flushq(WR(q), 1);
  q->q_ptr = NULL;
}

For the sp driver, we found that spclose() calls freemsg():

freemsg() is called to free all messages held in the interim message queue in our private data area, but we do not zero out the pointers lp->mp or lp->last_mp, which point to the head and tail of the private interim queue. A call to sp_timeout() may still be pending in the timeout queue. When sp_timeout() is executed, because lp->mp is non-NULL, it will call putq() to pass lp->mp up to sp's read queue, where sp's service routine will call putnext() to put it in lmodc's read queue. When flushq() is called on lmodc's read queue, it tries to free this already freed message, causing a trap type 15 panic on the NULL b_datap. Adding the following code to spclose() will fix this problem:

        .
        .
        .
    freemsg(t_mp);
  }
  splx(s);
  /*
   * NULL out list pointers to insure the messages they point to
   * will not be freed twice.
   */
  lp->mp = NULL;
  lp->last_mp = NULL;
  flushq(WR(q), 1);
  q->q_ptr = NULL;
}

Debugging Examples

Technical documentation

» Table of Contents

» Index

Example 1

Example 2

Example 3