Debugging OCI/Client-Side Tools & Applications

July 12, 2010

Applies to:

Oracle Server – Enterprise Edition – Version: 11.1.0.6
Information in this document applies to any platform.

Purpose

This note is for INTERNAL Oracle use.
It’s referencing the source of this document on the Development’s Wiki:

Debugging OCI/Client-Side Tools & Applications

This document describes techniques that can be used to diagnose problems in OCI on the client and is intended for internal use only.

Scope and Application

They have been compiled based on expertise and observations by the OCI development group over the years. It describes a step-by-step process to determine whether the problem is in OCI or not. Once the problem is determined to be in OCI, one can use the diagnostics implemented in OCI to obtain the necessary dumps and traces to quickly resolve them by having them available to the OCI development or DDR group. Any problems with internal logic may not be completely resolved by these dumps and traces but they are a good starting point as the logic follows the data stored in the OCI handles that are dumped. We hope to resolve 90-95% of the problems using these techniques there by reducing the one-off patches made specifically to diagnose the problems. A concerted effort was made to add sufficient tracing in problematic code but there may still be room to add more that will have to be addressed in the future. Hope you find this information helpful. Any suggestions to improve this document are welcome and can be sent to Kevin Neel or Rajendra Pingte. So are you ready to diagnose problems in OCI? Please refer to “Links to additional documentation” section for additional information.

Links to additional documentation:

These links are for Oracle internal use only.

OCI and OCCI fault diagnosability
Managing Diagnostic Data
Oracle Internal site for diagnosability

Debugging OCI/Client-Side Tools & Applications

Is it an OCI problem?

First step is to determine whether the problem is in OCI or not to resolve it expediently. There have been cases in the past where considerable time was spent to identify this fact because the application was provided by a third-party and the source code did not exist. With 11G, OCI has addressed this problem by sending the client characteristics to the server that can be queried from V$SESSION_CONNECT_INFO. Columns of interest are:

  • CLIENT_CHARSET – Shows the client character set
  • CLIENT_CONNECTION – Homogeneous or heterogeneous connection
  • CLIENT_OCI_LIBRARY – Whether home-based or instant client
  • CLIENT_VERSION – Version of the client library, for 11G it is 11.1.0.6.0
  • CLIENT_DRIVER – Empty when application uses OCI directly, JDBC driver will show “JDBC-thin”

One can query this information for the session and serial number in question to determine whether the application uses JDBC driver or not. In certain situations this check may not even be necessary because it may be obvious from the connect string, the library dependency (ldd on UNIX, “dumpbin /dependents” if Visual Studio is installed or “listdlls” from http://www.microsoft.com/technet/sysinternals/default.mspx on Windows), test case supplied or an internal Oracle product that links with OCI library. Sometimes it may be possible to determine that the fault is not with Oracle code from the stack trace or core dump when the application crashes.

Once it is determined that the application is using OCI, next step is to determine whether the problem is with application, Oracle-provided driver or OCI itself. Typical problems reported fall into following categories:

  1. Data loss / Incorrect results
  2. Protocol violations
  3. Crashes
  4. Hangs/Infinite loops
  5. Multi-threaded application failures
  6. Memory leaks
  7. Non-blocking errors
  8. Errors due to configuration problems
  9. Object-Relational defects
  10. Event-based tracing
  11. Client/Server communication

We look into each category of problem and potential ways to identify the offending layer (application, Oracle-provided driver or OCI itself) in detail below.

By default OCI now captures important data structure in its call context to aid in first failure analysis. This information will be dumped always when an internal error is encountered. See “First-failure capture and call context” section for details.

First-failure capture and call context:

In 11G, OCI has attempted to reduce the one-off patches required to diagnose problems. It also has attempted to capture sufficient information to be dumped to aid first-failure analysis. It is accomplished by recording important structures in its call context to be dumped as a result of internal errors. A signal handler has been implemented to dump the call stack that works on all platforms. It now reports the library dependency that can be used to determine the location where the shared objects were loaded from. Client characteristics, like character set, version, driver, connection type and client type can be queried from the server. Trace facility has been completely re-written and is thread-safe now. Object diagnostics have been improved and enhanced to help diagnose problems expediently. These topics are described in detail below. Sample output along with necessary setting are also provided.

OCI records the messages sent to the server in its call context to aid first-failure analysis. It records the RPC parameters sent to the server to be dumped when an internal error is encountered. Internal errors are:

case OER(21500): /* internal error */ case OER(602): /* core dump */ case OER(603): /* fatal error */

OCI will also dump the RPC when event 10842 is set to level 8 or higher for the following errors:

case OER(3114): /* not connected to Oracle */ case OER(3135): /* lost contact */ case OER(3122): /* closing window */ case OER(1033): /* initialization or shutdown in progress */ case OER(1034): /* Oracle not available */ case OER(1092): /* instance terminated */ case OER(12153): /* TNS: not connected */ case OER(28511): /* Lost rpc connection to heterogeneous agent */ case OER(22): /* The remote session is gone */

One can also dump the RPC for every call issues to the server by setting the event 10842 to level 16. OCI uses the DDE(Dynamic Data Extractor, a new mechanism in 11G) interface to dump the call context. In addition to RPC, it records user and statement handles to aid first-failure analysis further. All bind and defined variables are available in the statement handles; user handle provides information of about the connection. These recorded structures are dumped using the OCI’s generic print facility KGP that displays the filed name along with its value. This is also called the current or active context that is dumped whenever an internal error is encountered.

OCI also maintains its Process State and System State. They are also dumped when an internal error is encountered. Process and system state information provides detailed view of an application or mid-tier.

NOTE: When diagnosability is disabled by adding the lines “DIAG_ADR_ENABLED=FALSE “ and “DIAG_DDE_ENABLED=FALSE” in sqlnet.ora, first-failure capture facility is disabled.

1. Data Loss / Incorrect Results

We refer to the data that is sent by the application to the server and received from the server incorrectly. OCI performs data conversion based on NLS settings or the form-of-use specified in the bind and define buffers. One can enable SQL*Net tracing on the client and server for the specified session and look at the data sent and received from the server. If one sees that the data sent is incorrect, it may be due to conversion problems in OCI or the application/driver is sending wrong data to OCI. If the data received from the server as noted in the server trace file is incorrect then it must have been stored incorrectly or any conversion done on it prior to it being sent to the client may be incorrect. Currently, OCI does not provide a way to trace the conversion logic for relational data types therefore a patch may be necessary in such situations. For object data types, event 24473 can be set to level 3 to trace the pickle and unpickle operations that will dump the image representation to a trace file. See how to set the event on client side below.

2. Protocol Violations

A protocol violation is encountered when client and server do not understand the data received from each other. They are hard to reproduce and capturing the reason for the failure is important. In 11G, OCI will dump the data in the network buffer to be analyzed by the development group. When such errors are encountered, it is best to provide the trace file to the development group for further analysis. A reproducible test may be necessary or a complete sqlnet trace for the client and server may be required in certain situations to completely understand the reason.

3. Crashes

In a way crashes are easier to diagnose. OCI now provides a SEGV handler on all platforms to dump the call stack and information necessary to determine the cause. OCI registers its signal handler only when an application has not registered its own. Once registered, it dumps the call stack, registers and memory addresses that are necessary to aid first-failure analysis. Signals captured on UNIX are: SIGQUIT, SIGILL, SIGABRT, SIGFPE, SIGSEGV, SIGBUS, SIGSYS, SIGTRAP, SIGXCPU, and SIGXFSZ. On Windows, it uses OS-provided structured exception handling feature. For those having access to ADE and the source see test/tkpg/src/tkpgd08 for an example.

First step is to look at the call stack and locate the function that crashes. If the function name is an address then the crash may be in application code or some OS function. When only the address is displayed it is necessary to look further up the stack and see if there is any Oracle function on the stack. There are various ways to determine that the function belongs to Oracle client library: one can run the nm utility on UNIX or “dumpbin /exports” on Windows. There are other tools provided by different OS’s that also can be used and this document does not cover all of them.

Here is an example of the crash in tkpgd08 on Linux:

Dump continued from file: /ade/rpingte_rp1mn/oracle/log/oradiag_rpingte/diag/clients/user_rpingte/host_1318267835_11/trace/ora_30555_3046456240.trc
oci-24550
[11] [[si_signo=11] [si_errno=0] [si_c] [0xB594C8E8] [] [] [] [] []
========= Dump for incident 1 (oci 24550 [11]) ========

-----
Call Stack Trace -----
calling              call     entry                argument values in hex
location             type
    point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
dbgc_dmp()+179
      call     B67B0638             B594872C ? 3 ?
dbgexPhaseII()+298   call     00000000             8097D00 ? 3 ?
dbgexProcessError()
 call     dbgexPhaseII()       28 ? B594B184 ? 1060020 ?
+1306
dbgeExecuteForError  call     B67A4508             80974F8
? 80A86C4 ? 0 ? 0 ?
()+65
dbgePostErrorDirect  call     dbgeExecuteForError  80974F8 ? 80A86C4 ? 0 ? 0 ?
()+1747
                ()                   0 ?
kpeDbgSignalHandler  call     B67A0DB8             80974F8 ? B7DE68B4 ? 5FE6 ?
()+224
                                            1 ? 3 ? 0 ?
skgesig_sigactionHa  call     00000000             B594C8E8 ? B7FEADC0
?
ndler()+214
lxsCntXExc()+202     signal   00000000             B ? B594DDA0 ? B594DE20 ?
lxsCntExc()+57       call
  B67B08D8             FFFFFFF6 ? B594E1C0 ?
                                                   B594E1A8 ? 20000040 ?

                                               B594E1D0 ? 80C01E0 ?
ttctrmbk()+214       call     B67AB648             FFFFFFF6
? A ? 20000040 ?
                                                   B594E1D0 ? 80C01E0 ?
ttcacs()+2157        call     ttctrmbk()
          0 ? 0 ? 0 ? 144F01 ?
                                                   B7FF0001 ? 1 ?
ttcdrv()+991         call
    B67B6A18             80C00CC ? 80C2E50 ? 0 ? 0 ?
                                                   1 ? 80CBFBC ?
nioqwa()+48
         call     00000000             80C211C ? 0 ?
upirtrc()+1133       call     00000000             80C0170 ? 0 ? B7D7B4E0
?
                                                   80C211C ? 80C00F4 ? 0 ?
kpurcsc()+96         call     B679D258
      80C00CC ? 5E ? 80C1FFC ?
                                                   80C211C ? 80C2E50 ?

                               B7D7F8B4 ? 80C3650 ?
                                                   80CBFBC ? 1 ? B6803A8A
?
                                                   80CBF70 ?
kpuexec()+9214       call     B67A66E8             80903A8
? 5E ? 80C1FFC ?
                                                   80C211C ? 80C2E50 ?

                 B7D7F8B4 ? 80C3650 ?
                                                   80CBFBC ? 1 ? B6803A8A ?

                                           80CBF70 ? B5953364 ?
OCIStmtExecute()+50  call     B67B41A8             80903A8 ?
80CBF70 ? 8090988 ?
                                                   1 ? 0 ? 0 ? 0 ? 0 ? 0 ?
threadfunc2()+1498   call
   OCIStmtExecute()     80903A8 ? 80CBF70 ? 8090988 ?
                                                   1 ? 0 ? 0 ? 0 ? 0 ?
start_thread()+129
  call     00000000             BFFFAB80 ? 0 ? 0 ? 0 ?
__clone()+94         call     00000000             B5953BB0 ? 0 ? 0 ? 0 ? 0 ?
                                                   0 ?

tkpgd08 provided a bogus pointer in the bind variable which is passed to lxsCntExc() as the first argument having the hex value FFFFFFF6 resulting in this crash.

SEGV handler is on by default and one can turn it off by adding the following entry in sqlnet.ora:

DIAG_SIGHANDLER_ENABLED=FALSE

The location of dump files varies, please refer to http://st-doc.us.oracle.com/11/111/appdev.111/b28395/oci10new.htm#CHDCIAHJ and ADRCI documentation for details.http://st-doc.us.oracle.com/11/111/appdev.111/b28395/oci10new.htm#CHDCIAHJ and ADRCI documentation for details.

4. Hangs/Infinite loops

A client application can hang or run in an infinite loop either in its code or in the client shared library provided by Oracle. One can attach to the application process and determine the call stack. Debugger can be used if available. When a debugger is not available, one can use pstack on UNIX periodically, dbx on AIX or ADPlus on Windows (http://support.microsoft.com/kb/286350). Alternatively, one can issue a kill command and obtain the stack.

5. Multi-threaded application failures

Typical problems in a multi-threaded application are crashes, inconsistent results or dead locks. Diagnosis of crashes was provided in section “Crashes” earlier. Inconsistent results are generally due to synchronization problems and one can try to reproduce the problem without using the threaded mode. If it is difficult to modify the application or the source code itself does not exist, one can turn on event 10842 to trace the functions invoked by the application, parameters passed and the data sent and received from the server (see “First-failure capture and call context” section on capturing RPC information) Enabling tracing may not reproduce the problem as it will alter the timing, and a diagnostic patch may be necessary. A dead lock will result in a hang and to diagnose the problem one should obtain the stack traces using the utilities described in section ”Hangs/Infinite loops” section. Enabling event-based tracing described in “Event-based tracing” section will also help diagnosing the problem.

6. Memory leaks

Memory leaks in OCI can be determined using the event 10235, please refer to kgh.h for necessary levels or $VIEW/rdbms/notes/kgh_debug.txt ADE view on a development machine for details. When memory leak cannot be determined using the event 10235, one may modify the application or the test provided to add the following memory allocation and deallocation hooks which are specifically for Linux and Windows. Linux:

Add the following lines in the main .c file:

/* Prototypes for __malloc_hook, __free_hook */ #include <malloc.h> /* Prototypes for our hooks. */ static void my_init_hook (void); static void *my_malloc_hook (size_t, const void *); static void my_free_hook (void*, const void *); /* Override initializing hook from the C library. */ void (*__malloc_initialize_hook) (void) = my_init_hook; static void (*old_malloc_hook) (size_t, const void *); static void (*old_free_hook) (void*, const void *); static void my_init_hook (void) { old_malloc_hook = __malloc_hook; old_free_hook = __free_hook; __malloc_hook = my_malloc_hook; __free_hook = my_free_hook; } static void * my_malloc_hook (size_t size, const void *caller) { void *result; /* Restore all old hooks */ __malloc_hook = old_malloc_hook; __free_hook = old_free_hook; /* Call recursively */ result = malloc (size); /* Save underlying hooks */ old_malloc_hook = __malloc_hook; old_free_hook = __free_hook; /* printf might call malloc, so protect it too. */ printf ("allocated=%u bytes", (unsigned int) size); printf("\n"); /* Restore our own hooks */ __malloc_hook = my_malloc_hook; __free_hook = my_free_hook; return result; } static void my_free_hook (void *ptr, const void *caller) { /* Restore all old hooks */ __malloc_hook = old_malloc_hook; __free_hook = old_free_hook; /* Call recursively */ free (ptr); /* Save underlying hooks */ old_malloc_hook = __malloc_hook; old_free_hook = __free_hook; /* printf might call free, so protect it too. */ printf ("freed address=%p\n", ptr); /* Restore our own hooks */ __malloc_hook = my_malloc_hook; __free_hook = my_free_hook; }

NOTE: Preferred approach is to use LD_PRELOAD requiring no modification of the application source. A shared library may be provided in future to accomplish this.

Windows: Refer to crtdbgSetupHook interface on MSDN. OCI will add this capability in future release that can be enabled using the OCI’s trace event.

Memory leaks in object cache can be detected using event 24471 and is described in “Event-based tracing” section below.

7. Non-blocking errors

To diagnose problems with non-blocking enable event-based tracing described in section below.

8. Errors due to configuration problems

Any issues with configuration where the LD_LIBRARY_PATH or PATH are set incorrectly can be determined using the OS-provided tools (ldd on UNIX and dumpbin on Windows) OCI now reports the library dependency when an internal error is encounter and is noted in the incident trace file under section “Program dependency”. For example, tkpgd08 will show the following:

9. Object-Relational defects Typical problems in objects are: memory leaks, duration leaks, duration mismatch, mismatch between in-memory object representation and on-disk representation during unpickle and pickle operation, incorrect type descriptor passed to pickle and unpickle operation. Image handles and object memory pointers are now recorded in the call context and will be dumped when any internal errors are encountered during pickle or unpickle operations for first-failure analysis. If the information in the incident trace file is insufficient, one can enable various events in the object layer described below.    

10. Event-based tracing

Following events are available in OCI for tracing:

A. Event 10842

This event traces the functions invoked by the application and can be set on both the client and the server. On client, it can be set in the environment variable EVENT_10842 or sqlnet.ora by specifying EVENT_10842=level. Possible values are 0-16. The levels and their meanings have not changed in 11G. The trace facility is now thread-safe and works with diagnosability enabled and disabled. With diagnosability enabled, a separate trace file is created for each thread. Without diagnosability, two trace files are used and trace records are written to one file and then to the other when the file size limit is reached cycling between the two. With release 11G and beyond, “diagnosability” enabled is the preferred way to enable tracing. When diagnosability is turned off, one can specify a directory to write the traces to using the following environment variable:
  • ORA_CLIENTTRACE_DIR controls where the trace output is written. This environment variable is necessary when diagnosability is disabled which is not preferred in 11G. Output is written to the file $ORA_CLIENTTRACE_DIR/ora_skgu_pid.trc.
 
EVENT_10842 Settings on client
The environment variable EVENT_10842 controls what to trace. The value is specified in the format: "server=<>;user=<>;stmt=<>;level=<>;interval=<>" where:
server Comma separated list of servers "all" for all servers
user Comma separated list of users "all" for all users
stmt Comma separated list of statement types "all" for all statement types
  • level = 1 through 15 thus:
1 Trace all server attach and server detach calls for servers listed in "server" attribute of the environment variable
2 Trace all session begin, logon, session end, logoff calls for the users listed in "user" attribute of the environment variable
3 Trace all prepare, execute, fetch calls for the specified statement types listed in "stmt" attribute of environment variable
4 Trace all Bind, Define, Describe calls
5 Trace all OCI LOB calls
7 Get statistical info on all connection pooling /connection related calls
8 Get statistical info on all session info
9 Get statistical info on all handle info
10 Get statistical info on time taken in execute and fetch calls
11 Get statistical info on transaction related calls
15 Trace all calls with statistical info
  • interval = interval in seconds. Can be omitted.
  • filelen = length of the trace files in bytes. Can be omitted.
 
Examples
  1. To trace all OCI relational functions :
    setenv EVENT_10842 "server=all;user=all;stmt=all;level=15"
  2. To trace all OCI connection(server) calls for servers inst1 and inst2.
    setenv EVENT_10842 "server=inst1,inst2;user=all;stmt=all;level=1"

    In this case, the values set for user and stmt attributes of environment variable are ignored.

  3. To trace all OCI users (sessions) calls for users scott and miller.
    setenv EVENT_10842 "server=all;user=scott,miller;stmt=all;level=2"

    In this case, the values set for server and stmt attributes of environment variable are ignored.

  4. To trace all OCI statement execution related calls for statement type SELECT.
    setenv EVENT_10842 "server=all;user=all;stmt=select;level=3"

    In this case, the values set for server and user attributes of environment variable are ignored.

  5. To Trace all OCI bind, define related calls, the values of server, user and stmt does not matter.
    setenv EVENT_10842 "server=all;user=all;stmt=all;level=4"
  6. To trace all OCI LOB related calls, other attribute values are ignored, as only level number needs to be set to 5.
    setenv EVENT_10842 "server=all;user=all;stmt=all;level=5"
Event 10842 Settings on server

On server one can set the event 10842 to control what to trace. Only

the level is necessary, specifying other arguments mentioned above will

result in data base startup failure as the server does not know how to

parse them. Here is an example of setting the trace event to level 15

in the init.ora:

event="10842 trace name context forever, level 15"

B. Event 24471

Following levels are supported (from koh.h):

0x00000001 trace memory allocations & free
0x00000002 trace duration start & end
0x00000004 set optimal & max cache size to 0 to detect memory leaks. When this event is set object cache will free objects that are not used
0x00000008 trace kohih (object instance) allocations and frees
0x00000010 trace TSM events
0x00000020 Dump call stack for every koh function called. This is useful to find the caller of the object function. The appropriate levels and generate the trace files.

To detect memory leaks, use the method documented in test/tkpg/sosd/tkpgkohmem.pl.

To detect duration related issues, set the level to trace duration

start and end calls and follow the method described in

test/tkpg/sosd/tkpgkohdur.pl

C. Event 24472

This event is useful to trace object pin, duration and transaction operations in the object cache. Levels are as follows:

0x00000001 trace pins (pin, unpin, promote, etc)
0x00000002 trace duration operations
0x00000004 trace transaction operations

D. Event 24473

This event is useful for object data operations and the possible trace levels are:

0x00000001 trace pickle operations
0x00000002 trace unpickle operations
0x00000004 trace based on a type name

11. Client/Server communication

Issues

that may arise due to client and server communication include network

errors and errors in the range 3100-3199. To diagnose network related

issues, enable SQL*NET tracing. One can also use the method described

in “First-failure capture and call context” section above to dump the

data exchanged between the client and server

© 2010, Oracle Clinic. All rights reserved.

tags: , , , , , , , ,
posted in it技术, oracle by admin

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Leave Your Comment

 
CopyRight YOUYUS.COM