The Case Of A Stuck LoadLibrary Call

I frequently encounter impossible bugs which after full analysis have a reasonable explanation although I have no source code to the problematic library. Todays fault is featured by

….

suspense

the Intel Math Kernel Library.

Lets use some fancy neural network library (AlgoLibrary!UsingCaffee) like Berkeley AI caffee.dll which in turn uses the Intel Math Kernel Library (mkl_core.dll) which hangs forever in a call to LoadLibrary. The first thing to do when you are stuck is taking memory dump (e.g. use Task Manager or procdump)

and loading it into Windbg or Visual Studio. Since you are reading this I assume you want to know more about the mysterious commands Windbg has to offer. You can load a memory dump into Windbg from the File – Menu

To get an overview where all threads of a process are standing or executing you can dump with the ~* command the following command on all threads. The dump thread stack command in Windbg is k. To dump all threads you need to execute therefore

~*k

107  Id: 67b0.8094 Suspend: 0 Teb: 00000049`9df88000 Unfrozen
Call Site
ntdll!NtWaitForSingleObject+0x14
ntdll!LdrpDrainWorkQueue+0xe5
ntdll!LdrpLoadDllInternal+0xc2
ntdll!LdrpLoadDll+0xf1
ntdll!LdrLoadDll+0x8c
KERNELBASE!LoadLibraryExW+0x16f
KERNELBASE!LoadLibraryExA+0x31
KERNELBASE!LoadLibraryA+0x39
mkl_core!mkl_serv_mpi_get_ppn+0x48
mkl_core!mkl_serv_mkl_malloc+0x7e7
caffe!caffe::CaffeMallocHost+0x16
caffe!caffe::SyncedMemory::to_cpu+0x116
caffe!caffe::SyncedMemory::mutable_cpu_data+0xe
caffe!caffe::Blob<float>::Reshape+0x175
caffe!caffe::Blob<float>::Reshape+0x138
caffe!caffe::InputLayer<float>::LayerSetUp+0x184
caffe!caffe::Layer<float>::SetUp+0x15
caffe!caffe::Net<float>::Init+0x898
caffe!caffe::Net<float>::Net<float>+0x473
AlgoLibrary!UsingCaffee

When you load a memory dump into Windbg you can issues with the k command a full stack walk which will output hopefully a nice stack trace after you have resolved your symbols from Microsoft, Intel and the caffee.dll. We see that LoadLibrary is stuck while caffee tries to allocate memory which internally loads a library. We would expect that our call from AlgoLibrary!UsingCaffee should be finished within a few ms. But this one hangs forever.

The allocation code comes from this:

 *ptr = mkl_malloc(size ? size:1, 64);

When LoadLibrary hangs it is due to some other thread holding the OS Loader lock. This loader lock is effectively serializing dll loading and global/thread static variables initialization and release. See https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain for a more elaborate description.

With Windbg and the !peb command we can check further

dt ntdll!_PEB 499df0d000
+0x0e8 NumberOfHeaps    : 0x15
+0x0ec MaximumNumberOfHeaps : 0x20
+0x0f8 GdiSharedHandleTable : 0x00000183`71d00000 Void
+0x100 ProcessStarterHelper : (null) 
+0x108 GdiDCAttributeList : 0x14
+0x10c Padding3         : [4]  ""
+0x110 LoaderLock       : 0x00007ffe`f447a568 _RTL_CRITICAL_SECTION
+0x118 OSMajorVersion   : 0xa
+0x11c OSMinorVersion   : 0
+0x120 OSBuildNumber    : 0x3839
+0x122 OSCSDVersion     : 0
+0x124 OSPlatformId     : 2

We can dump the Loader Lock structure by clicking on the blue underlined link in Windbg. But this formatting has not survived in WordPress.

0:107> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffef447a568)                 : 0x7ffef447a568 [Type: _RTL_CRITICAL_SECTION *]
[+0x000] DebugInfo        : 0x7ffef447a998 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
[+0x008] LockCount        : -2 [Type: long]
[+0x00c] RecursionCount   : 1 [Type: long]
[+0x010] OwningThread     : 0x6710 [Type: void *]
[+0x018] LockSemaphore    : 0x0 [Type: void *]
[+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Ahh now we know that thread 0x6710 is owning the lock. Switch to that thread with the command ~~[0x6710]s which allows you specify OS thread id directly. The usual ~dds command switches to debugger enumerated threads which start always at 0. That is convenient but sometimes you need to switch by thread id.

0:114> k

ntdll!NtYieldExecution+0x14
KERNELBASE!SwitchToThread+0x1d
mkl_core!mkl_serv_lock+0x6d
mkl_core!mkl_serv_thread_free_buffers+0x384
mkl_core+0x1016
mkl_core!sdttrsb_4dfl+0x31a
ntdll!LdrpCallInitRoutine+0x4b
ntdll!LdrShutdownThread+0x151
ntdll!RtlExitUserThread+0x3e
kernel32!BaseThreadInitThunk+0x1c
ntdll!RtlUserThreadStart+0x21

That stack trace is strange because it calls SwitchToThread which basically tells the OS if any other runnable thread for this core is ready to run it should run. If not this is essentially a CPU burning endless loop. Such constructs are normally used in Spinlocks where ones takes at first not lock at all to prevent the costs of a context switch at the expense of burning CPU cycles. In this case it looks like mkl_core is cleaning up some thread static variables which requires synchronization with some lock object which is taken by some other thread. Since I do not have the code for this I can only speculate which thread it is.

Lets get the CPU time this thread to see how long it tries to enter the lock:

0:114> !runaway
 User Mode Time
  Thread       Time
  114:6710     0 days 2:14:28.671
   17:62a4     0 days 0:00:47.656
   92:88bc     0 days 0:00:08.218
   22:107c     0 days 0:00:02.953
   19:3a9c     0 days 0:00:02.765

Nearly all CPU of the process (2h 15 minutes) is spent by this thread (2h 14 minutes) which wastes CPU cycles for hours.

Since we have only one other mkl thread running which is our stuck thread in the LoadLibrary call it all makes sense now. The LoadLibrary call hangs because thread 0x6710 is exiting and cleaning up some thread static variables which takes a lock which our stuck LoadLibrary thread already possesses. But since our shutting down thread owns the loader lock we have two dependant locks owned by two different threads -> Deadlock.

You need to be careful on which threads you execute MKL code which are exiting at some point in time while new threads are initializing new mkl variables. At some points in time you can see funny deadlocks which look really strange. One thread burns CPU like hell which is coming from an exiting thread and your own code is stuck in a LoadLibrary call which waits for the Loader Lock to become free.

If you dig deep enough you can find the root cause even when you do not have the full source code. As added bonus I have learned some arcane Windbg commands along the way. These will be helpful the next time you need to dig into some stuck threads.

2 thoughts on “The Case Of A Stuck LoadLibrary Call

  1. Instead of ~*k you can use !uniqstack which coalesces identical stacks. Or even better, open the dump in Visual Studio, and use the Parallel Stacks view, that gets you a nice and handy prefix tree breakdown of all call stacks.

    Like

    • Yes I use these also. Visual Studio has become pretty good at showing nice stacks but it often annoys me that some threads are missing, or stacks are shortened. E.g. if you have a stuck finalizer it will not show the full finalizer thread call stack. Or if you have an exiting thread VS tends to forget about this one and does not even show that thread in the thread list. For a normal analysis VS is a good start.

      Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.