Moy Blog

September 27, 2009

Quick tip for debugging deadlocks

Filed under: C/C++,linux — moy @ 00:34

If you ever find yourself with a deadlock in your application, you can use gdb to attach to the application, then sometimes you find one of the threads that is stuck trying to lock mutex x. Then you need to find out who is currently holding x and therefore deadlocking your thread (and equally important why the other thread is not releasing it).

At least on recent libc implementations in Linux, the mutex object seems to have a member named “__owner”. Let me show you what I recently saw when debugging a deadlocked application.

(gdb) f 4
#4  0x0805ab46 in ACE_OS::mutex_lock (m=0xa074248) at include/ace/OS.i:1406
1406      ACE_OSCALL_RETURN (ACE_ADAPT_RETVAL (pthread_mutex_lock (m), ace_result_),
(gdb) p *m
$8 = {__data = {__lock = 2, __count = 0, __owner = 17828, __kind = 0, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}},
  __size = "\002\000\000\000\000\000\000\000�E\000\000\000\000\000\000\001\000\000\000\000\000\000", __align = 2}

We can see that the __owner is 17828. This number is the LWP (Light-weight process) id of the thread holding the lock. Now you can go to examine that thread stack and find out why that thread is also stuck.

This example also brings up a regular point of confusion for some Linux application developers. What is the difference between LWP and POSIX thread id ( the pthread_t type in pthread.h)?. The difference is that pthread_t is a user space concept, is simply an identifier for the thread library implementing POSIX threads to refer to the thread and its resources, state etc. However the LWP is an implementation detail of how the Linux kernel implements threads, which is done through the “thread group” concept and LWP’s, that are processes that share memory pages and other resources with the other processes in the same thread group.

From the Linux kernel point of view the pthread_t value doesn’t mean anything, the LWP id is how you identify threads in the kernel, and they share the same numbering as regular processes, since LWPs are just a special type of process. Knowing this is useful when using utilities like strace. When you want to trace a particular thread of a multi threaded application, you need to provide the LWP of the thread you want to trace, a common mistake is to provide the process id, which in a multithreaded application the process id is just the LWP of the first thread in the application (the one that started executing main()).

Here is how you get each identifier in a C program:

#include <stdio.h>
#include <syscall.h>
#include <pthread.h>
 
int main()
{
  pthread_t tid = pthread_self();
  int sid = syscall(SYS_gettid);
  printf("LWP id is %d\n", sid);
  printf("POSIX thread id is %d\n", tid);
  return 0;
}

It’s important to note that getting the POSIX thread id is much faster than the LWP, because pthread_self() is just a library call and libc most likely has this value cached somewhere in user space, no need to go down to the kernel. As you can see, getting the LWP requires a call to the syscall() function, which effectively executes the requested system call, this is expensive (well, compared with the time required to enter a simple user space function).

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress