|
|
STREAMS/UX for the HP 9000 Reference Manual > Chapter 6 Debugging STREAMS/UX Modules and DriversDebugging Examples |
|
The following core dump was obtained while using a modified version of the sp driver, which is described in example #2 in the strdb section of this chapter. On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace. The interesting portion of msgbuf for this dump is:
First we translate the hex stack trace in the panic message into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic message stack trace, we get the following symbolic stack trace:
The address where the illegal data access occurred is sp_timeout+2C. The isr.ior in the panic message indicates that the data address that caused the panic is 0.0, and the instruction at sp_timeout+2C is ldws 0(arg3),arg2, so arg3 must have been 0 at the time of the panic. So we are probably dereferencing a null pointer. Our first task is to find out which pointer this is. To do this we need to know which source code line sp_timeout+2C corresponds to. Here is the source code for sp_timeout():
Here is the relevant portion of the assembly code. The instruction which caused the panic is marked with an "*."
At sp_timeout+0xC, arg0, which corresponds to the source code variable lp is moved to r3. We know arg0 is lp, because lp is the first argument to sp_timeout(). sp_timeout+0x14 looks like the if statement in the source code, because bb is a branch instruction. sp_timeout+0x18 is the call to splstr(). sp_timeout+0x28 loads arg3 with the memory contents at location r3 + 0x8. arg3 is the source code variable lp->mp. We can guess this because mp is 8 bytes from the start of lp, according to the declaration for the struct sp. So our problem is that lp->mp is NULL. We want to confirm this, and want to look at the rest of *lp. To do so, we need to find the value of r3 at the time of the panic. We may be able to extract this information from the stack if we know the value of sp at the time of the panic. To get this information, we do a manual stack back-trace. See "Manual Stack Back-Tracing" for details on how this is done. The resulting table is shown below:
Now that we have the values of sp, we want to look into the stack of the procedure above sp_timeout() in the stack trace to find what value that procedure saved in its stack for r3. In this case, the procedure above sp_timeout() is $ihndlr_rtn. $ihndlr_rtn is one of the low-level kernel utility procedures which is hand-coded in assembly and does not create a normal stack frame. Instead it creates a "save state" area, which contains the values of all the registers at the time the trap or interrupt took place. The structure save_state is defined in /usr/include/machine/save_state.h. The general registers are stored first, and are located at "top of save state area" + "register number" * 4. For example, r3 will be 3*4 = 12 off of the beginning of the save state area. To find the top of the save state area, subtract the size of the save_state structure from the value of sp for $ihndlr_rtn:
We can also use strdb to look at lp. (See the strdb section of this chapter for details.) There may be several instances of the sp driver, each with a different minor number, so we must look at each one until we find the instance whose q_ptr is the same as the address we have for lp. lp is a pointer to the sp driver's private data, which is also pointed to by q_ptr. The strdb STREAMS/UX subsystem la command will tell us what minor numbers are in use for the sp driver:
The strdb STREAMS/UX subsystem command lm will show us what modules may have been pushed into the stream above the sp driver:
In this case, the panicking stream happens to correspond to the sp with minor number 1. From the strdb STREAMS/UX subsystem, we use ":qh sp 1" to get to the read queue of the stream head containing sp driver with minor number 1. Then the o command to get to the write queue of the stream head. Next the n command twice to get from the stream head through the module lmodc to the driver sp. Here is the display of the q information for driver sp, minor number 1. Note that q_ptr is 0x24c258, which is the address of lp.
Now that we have reached the queue structure for the panicking sp driver instance, we can use strdb or adb to examine its contents. Using the strdb command :b, we can look at q_ptr, and see that its mp field (the third word) is NULL:
At this point, we have probably learned all that we can from the dump and must turn to the source code to discover the cause of this problem. We next examine the code carefully everywhere that lp->mp is updated or should be updated. Because sp driver's put routine, spput(), should be updating lp->mp, we look at it first.
Note that spput() never updates lp->mp. It just adds the new message to the tail of the list using lp->last_mp. But once sp_timeout() has processed the last message on the list and set lp->mp to NULL, spput() will never update lp->mp to point at the next message it receives. This causes sp_timeout() to be called with lp->mp == NULL. If we change spput() if statement to properly update lp->mp as shown below, this panic will be fixed.
The following core dump was obtained while using a modified version of the sp driver, which is described in example #2 in the strdb section of this chapter. On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace. The interesting portion of msgbuf for this dump is:
First we translate the hex stack trace in the panic message into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic message stack trace, we get the following symbolic stack trace:
The address where the illegal data access occurred is spput+4C. The isr.ior in the panic message indicates that the data address that caused the panic is 0.0, and the instruction at spput+4C is stws r31,0(r1), so r1 must have been 0 at the time of the panic. We are probably dereferencing a null pointer. Our first task is to find out which pointer this is. To do this we need to know to which source code line spput+4C corresponds to. Here is the source code for spput():
Here is the relevant portion of the assembly code. The instruction where the panic occurred is marked with an "*".
First, we try to get a general idea where spput+0x4C falls in the source code. It occurs before the call to timeout() at spput+0x68. The pattern of combt and combf instructions from spput+0x1C to spput+0x2C correspond to the switch statement in the source code. We guess this by noticing that we have loaded a value into arg3 which we compare against three different values, which resembles the first three case statements in the switch statement. It is unlikely that the default case of the switch statement, which just does a printf(), would cause the system to panic. spput+0x4C is probably in the source code in the case statement for M_DATA, M_PROTO, and M_PCPROTO. The comibf instruction at spput+0x38 must correspond to the if (!lp->mp) source statement, because it is a conditional branch statement, and it is comparing a register to 0 (zero). We may be able to determine whether we executed the "if" clause or the "else" clause of the if statement, based on the fact that we know we executed spput+0x4C (because a trap occurred while executing it). The comibf instruction branches to its target address if the condition it is checking is false. This comibf instruction compares ret0 to zero. If ret0 equals zero, comibf will not branch, and execution will continue to spput+0x3C and spput+0x40. spput+0x40 is an unconditional branch to spput+0x54, which is past spput+0x4C. Therefore, if ret0 had been zero, we never would have executed spput+0x4C. So ret0 was not zero. Since we know that the comibf instruction corresponds to "if (!lp->mp)," we know that lp->mp was not NULL, and the comibf instruction branches to spput+0x48 if lp->mp is not NULL, we can be confident that spput+0x48 and spput+0x4C are part of the else clause of the if statement, which consists of one statement, "lp->last_mp->b_next = mp;." Now we know which source code line we panicked on. We need to determine which source code pointer the register r1 corresponds to, because dereferencing r1 is what caused the panic. To do this, we work backwards from spput+0x4C to see where r1's contents came from. On spput+0x48, r1 gets loaded from arg1 + 0xC. Now we look backward to see where arg1 came from. It is tempting to assume that arg1 is the second argument to spput, which is mp. But at spput+0x10, arg1 is the target of a load, so at spput+0x48 arg1 does not contain mp. It is also tempting to look at spput+0x44 for the origins of arg1's contents, because that instruction has arg1 as its target. But because we took the comibf at spput+0x38, we must have branched around spput+0x44, so we can ignore this instruction. Looking further backward to spput+0x30, arg1 gets loaded from arg0 + 0x14. arg0 has not been the target of a load instruction since the beginning of spput, so it must still contain the first argument to spput, q. Looking at the source code, the only time that q is referenced is to set lp in the statement before the if. So arg1 must correspond to lp. Looking at the source code line where the panic occurred, "lp->last_mp->b_next = mp," and the assembly code lines spput+0x48 and spput+0x4C, it appears that spput+0x48 is setting r1 to lp->last_mp, and spput+0x4C is attempting to put the contents of r31 into memory location r1 + 0, which must be "lp->last_mp->b_next". So our problem is that lp->last_mp is NULL. It may help us to look at the rest of *lp, and to do so we need to find the value of arg1 at the time of the panic. We may be able to extract this information from the stack if we know the value of sp at the time of the panic. To get this information, we do a manual stack back-trace. See "Manual Stack Back-Tracing" for details on how this is done. The resulting table is shown below:
Now that we have the values of sp, we want to look into the stack frame of the procedure above spput() in the stack trace, to find what value that procedure saved in its stack for arg1. In this case, the procedure above spput() is $call_trap. $call_trap is one of the low-level kernel utility procedures which is hand-coded in assembly and does not create a normal stack frame. Instead it creates a "save state" area, which contains the values of all the registers at the time the trap or interrupt took place. The structure save_state is defined in /usr/include/machine/save_state.h. The general registers are stored first, and are located at "top of save state area" + "register number" * 4. So, for example, arg1, which is also known as r25, will be 25*4 = 100 off of the beginning of the save state area. To find the top of the save state area, subtract the size of the save_state structure (0x230 in release 9.0) from the value of sp for $call_trap:
0xFFFFFFBF is a very unlikely value for lp. It is more likely that the contents of arg1 were changed in the process of taking a trap. The four arg registers are considered scratch registers, and the trap path is very likely to have overwritten these registers before it created the save state area. However, there is an alternative way to find out the value of lp. If we can determine what the procedure that called spput() set arg0 to before the call, we will know the value of q, and lp is q->q_ptr.
The procedure which called spput() is csq_lateral(). The point where the call was made is marked with an asterisk. Note that the procedure call here is made using the instruction ble instead of the usual instruction bl. This is because csq_lateral does not know the name of the procedure it is going to call. csq_lateral() is passed a structure which contains the address of a procedure to call and the arguments with which to call it. Because the compiler cannot tell at compile time how far away in the executable image the procedure address is, it must use a branch and link external, ble, instruction in order to be sure it will be able to reach the procedure address being branched to. At csq_lateral+0x74, arg0 is loaded from r5 + 0x18. So if we can find out what value r5 had at that point, we can determine the value of q. r5 is a callee save register, so there is a chance that spput saved r5 in its stack frame. We look at the first few instructions of spput:
We see that spput did not save r5. Callee registers are only saved if the callee plans to overwrite the register. So we cannot get r5 from spput's stack frame, but if spput did not save r5 that means it did not overwrite it; therefore, the value for r5 in the save state area will be the same value that r5 had at csq_lateral+0x74. Look at 4*5 into the save state area:
Note that at the point the panic occurred, lp->mp was NULL, even though we can be sure that at the time we checked lp->mp at instruction spput+0x38, lp->mp was not NULL. How can this be true? As we saw in the previous example, sp_timeout() modifies the lp structure, and it runs out of timeout. In other words, spput() calls timeout() to schedule sp_timeout() to run after a specified amount of time. At each system clock tick, the kernel examines the list of procedures created by timeout() and schedules those procedures whose time has expired to run. Because a clock tick is a high level interrupt, it can occur at any time, and may suspend spput() if it is running. A clock tick may have occurred between spput+0x38 and spput+0x4C, allowing sp_timeout() to run and set lp->mp to NULL. In order to prevent this, we need to protect access to the lp structure by using splstr() around all critical sections of code in the sp driver which manipulate lp. So spput() source code should be changed as shown below:
In order to protect access to q->q_ptr, sp_timeout() must also call splstr() before it accesses q->q_ptr. The source code for sp_timeout() in the first example in this section shows the correct use of splstr(). See the STREAMS/UX synchronization section of Chapter 3 for guidelines on protecting module and driver critical sections. The following core dump was obtained while using a modified version of the sp driver, which is described in example #2 in the strdb section of this chapter. On entry to adb, we first look at the msgbuf to look for the panic message and hex stack trace. The interesting portion of msgbuf for this dump is:
First we translate the hex stack trace in the panic message into procedure names and addresses. Using the adb i command for each of the hex addresses in the panic message stack trace, we get the following symbolic stack trace:
The address where the illegal data access occurred is flushq+0x60. The isr.ior in the panic message indicates that the data address that caused the panic is 0.d, and the instruction at flushq+0x60 is ldbs 0xD(r21),r22, so r21 must have been 0 at the time of the panic. So we are probably dereferencing a null pointer. Our first task is to find out which pointer this is. To do this we need to know which variable r21 was supposed to contain. We do not have source code for flushq(), because it is a STREAMS/UX internal procedure, but we do know from its man page what arguments it takes, and we do have the assembly version of the code. Here is the relevant portion of the assembly. The instruction where the panic occurred is marked with an "*".
We can find flushq()'s calling sequence in its man page in SVR4PG: void flushq(queue_t *q, int flag) It is more likely that *q or one of its members is NULL than the parameter flag being the cause of our problem. We will trace the use of the first argument, originally in arg0, through flushq, to see how it might be related to the contents of r21. At flushq+0x8, arg0 is pushed onto the stack at offset sp - 0x64. Neither arg0 nor -64(sp) is referenced again until flushq+0x34. At flushq+0x34, r21 is loaded with -64(sp), so at this point r21 contains *q. At flushq+0x38, r22 is loaded from memory location 4 + r21. Looking at the structure definition for queue_t, found in /usr/include/sys/stream.h, we see that the second word in a queue_t structure, which would be found at memory location r21 + 4, is the q_first pointer.
So r22 now contains q->q_first. At flushq+0x3C, r22 is stored back in the stack, at sp - 0x34. At this point, it may be useful to try and work backwards from flushq+0x5C, where r21 gets loaded from 0x14 + r20, because at the next instruction, flushq+0x60, we know that r21 is NULL. We notice that at flushq+0x58, r20 is loaded from sp - 0x34. At flushq+0x3C, we know that sp - 0x34 was q->q_first. Checking the instructions between flushq+0x3C and flushq+0x58 shows that sp - 0x34 has not been stored to by any of these instructions, only loaded from. So at flushq+0x58, r20 is loaded with q->q_first. At flushq+0x5C, r21 is loaded with some field of q->q_first. Looking at the structure definition for struct msgb, also found in /usr/include/sys/stream.h, we find that the sixth word of the msgb structure, which would be found at memory location r20 + 5 words == r20 + 0x14, is b_datap.
So our problem is that q->q_first->b_datap is NULL. We want to confirm this, and to look at the rest of the q structure. To do that we need to find the value of sp - 0x64 at the time of the panic. We may be able to extract this information from the stack if we know the value of sp at time of the panic. To get this information, we do a manual stack back-trace. See "Manual Stack Back-Tracing" for details on how this is done. The resulting table is shown below:
Now that we have the values of sp for flushq, we know the q address we are interested in is at 0x7ffe7420 - 0x64:
Looking at the first few words of the q structure, we can determine the value of q_first, which is the second word:
Looking at q_first, we can see that the sixth word, b_datap, is NULL:
We can also use strdb to look at q and q_first. See the strdb section of this chapter for more information. Because there may be several instances of the sp driver, each with a different minor number, we must look at each one until we find the stream which contains a queue whose address is the same as the address we have for q. The strdb STREAMS/UX subsystem la command will tell us what minor numbers are in use for the sp driver:
These instances of sp are far fewer than we had expected. lm on minor number 0x12 shows that lmodc has already been popped off the stream:
and using :qh sp 0x12, and o and n as needed to traverse all the queues in this stream shows that none of these queues have address 0x5e9c00. lm on sp 0x13 shows that lmodc is still pushed above sp on this stream, but traversing all the queues in this stream shows that none of them are the queue we are looking for. We can use the strdb primary mode :x command to format q_first as a struct msgb to confirm our finding from adb that q->q_first->b_datap is NULL. (We find the structure type for q_first from /usr/include/sys/stream.h).
b_datap could be NULL because its resources have been freed, or it could be NULL because the data structure was corrupted in some way. To try to narrow this down, we want to look at the message buffer b_cont. If its b_datap is also NULL, the possibility of corruption becomes less likely. We can use :x msgb 0x646400 to format the b_cont field of q->q_first. It is easier, however, to see if there is a navigation key available for the b_cont field. "?" lists the available navigation keys:
Using the c navigation key, we see that b_datap for b_cont is also NULL. This makes it very likely that this message has already been freed.
Now we try to get information about the queue which was pointing to this message at the time of the panic. We use :x to format 0x5e9c00 as a queue structure to see what information it may still contain.
Note that this is a read queue whose q_next pointer is NULL. This implies that this queue is not a connected part of a stream, and is in the process of being closed. To find out what driver or module this queue is being used by, we want to look at q_qinfo. We could use :x qinit 0x294160, or look for an appropriate navigation key:
We use the i navigation key to print the following:
Using the adb i command, we can find out the name of the qi_putp routine:
This means the module lmodc was using the queue on which the panic occurred. We can double check this by looking at the qi_minfo structure in strdb. Again, we can either use :x module_info 0x294148, or we could see if there is a navigation key available for qi_minfo:
Using the qinit i navigation key to print the module_info structure:
and using the adb s command to print mi_idname as a string:
So we had the panic occur on an lmodc read queue which was in the process of being closed. Our stack trace confirms this. We are making the exit system call, close all open file descriptors and as part of process clean-up. The last close of a stream causes each module and driver to be popped and its resources freed, including its message buffers. Whenever a panic occurs which involves b_datap being NULL, the cause is usually that the buffer has already been freed but a pointer to it was not zeroed out, and a module or driver continues to access the buffer through this non-zeroed pointer. The best way to find the cause of this problem is to look through the source code for all calls to freemsg() or freeb(), and check that all pointers to the buffer being freed are zeroed out.
For the sp driver, we found that spclose() calls freemsg(): freemsg() is called to free all messages held in the interim message queue in our private data area, but we do not zero out the pointers lp->mp or lp->last_mp, which point to the head and tail of the private interim queue. A call to sp_timeout() may still be pending in the timeout queue. When sp_timeout() is executed, because lp->mp is non-NULL, it will call putq() to pass lp->mp up to sp's read queue, where sp's service routine will call putnext() to put it in lmodc's read queue. When flushq() is called on lmodc's read queue, it tries to free this already freed message, causing a trap type 15 panic on the NULL b_datap. Adding the following code to spclose() will fix this problem:
|
|