I frequently encounter impossible bugs which after full analysis have a reasonable explanation although I have no source code to the problematic library. Todays fault is featured by
Lets use some fancy neural network library (AlgoLibrary!UsingCaffee) like Berkeley AI caffee.dll which in turn uses the Intel Math Kernel Library (mkl_core.dll) which hangs forever in a call to LoadLibrary. The first thing to do when you are stuck is taking memory dump (e.g. use Task Manager or procdump)
and loading it into Windbg or Visual Studio. Since you are reading this I assume you want to know more about the mysterious commands Windbg has to offer. You can load a memory dump into Windbg from the File – Menu
To get an overview where all threads of a process are standing or executing you can dump with the ~* command the following command on all threads. The dump thread stack command in Windbg is k. To dump all threads you need to execute therefore
107 Id: 67b0.8094 Suspend: 0 Teb: 00000049`9df88000 Unfrozen Call Site ntdll!NtWaitForSingleObject+0x14 ntdll!LdrpDrainWorkQueue+0xe5 ntdll!LdrpLoadDllInternal+0xc2 ntdll!LdrpLoadDll+0xf1 ntdll!LdrLoadDll+0x8c KERNELBASE!LoadLibraryExW+0x16f KERNELBASE!LoadLibraryExA+0x31 KERNELBASE!LoadLibraryA+0x39 mkl_core!mkl_serv_mpi_get_ppn+0x48 mkl_core!mkl_serv_mkl_malloc+0x7e7 caffe!caffe::CaffeMallocHost+0x16 caffe!caffe::SyncedMemory::to_cpu+0x116 caffe!caffe::SyncedMemory::mutable_cpu_data+0xe caffe!caffe::Blob<float>::Reshape+0x175 caffe!caffe::Blob<float>::Reshape+0x138 caffe!caffe::InputLayer<float>::LayerSetUp+0x184 caffe!caffe::Layer<float>::SetUp+0x15 caffe!caffe::Net<float>::Init+0x898 caffe!caffe::Net<float>::Net<float>+0x473 AlgoLibrary!UsingCaffee
When you load a memory dump into Windbg you can issues with the k command a full stack walk which will output hopefully a nice stack trace after you have resolved your symbols from Microsoft, Intel and the caffee.dll. We see that LoadLibrary is stuck while caffee tries to allocate memory which internally loads a library. We would expect that our call from AlgoLibrary!UsingCaffee should be finished within a few ms. But this one hangs forever.
The allocation code comes from this:
*ptr = mkl_malloc(size ? size:1, 64);
When LoadLibrary hangs it is due to some other thread holding the OS Loader lock. This loader lock is effectively serializing dll loading and global/thread static variables initialization and release. See https://docs.microsoft.com/en-us/windows/win32/dlls/dllmain for a more elaborate description.
With Windbg and the !peb command we can check further
We can dump the Loader Lock structure by clicking on the blue underlined link in Windbg. But this formatting has not survived in WordPress.
Ahh now we know that thread 0x6710 is owning the lock. Switch to that thread with the command ~~[0x6710]s which allows you specify OS thread id directly. The usual ~dds command switches to debugger enumerated threads which start always at 0. That is convenient but sometimes you need to switch by thread id.
That stack trace is strange because it calls SwitchToThread which basically tells the OS if any other runnable thread for this core is ready to run it should run. If not this is essentially a CPU burning endless loop. Such constructs are normally used in Spinlocks where ones takes at first not lock at all to prevent the costs of a context switch at the expense of burning CPU cycles. In this case it looks like mkl_core is cleaning up some thread static variables which requires synchronization with some lock object which is taken by some other thread. Since I do not have the code for this I can only speculate which thread it is.
Lets get the CPU time this thread to see how long it tries to enter the lock:
Nearly all CPU of the process (2h 15 minutes) is spent by this thread (2h 14 minutes) which wastes CPU cycles for hours.
Since we have only one other mkl thread running which is our stuck thread in the LoadLibrary call it all makes sense now. The LoadLibrary call hangs because thread 0x6710 is exiting and cleaning up some thread static variables which takes a lock which our stuck LoadLibrary thread already possesses. But since our shutting down thread owns the loader lock we have two dependant locks owned by two different threads -> Deadlock.
You need to be careful on which threads you execute MKL code which are exiting at some point in time while new threads are initializing new mkl variables. At some points in time you can see funny deadlocks which look really strange. One thread burns CPU like hell which is coming from an exiting thread and your own code is stuck in a LoadLibrary call which waits for the Loader Lock to become free.
If you dig deep enough you can find the root cause even when you do not have the full source code. As added bonus I have learned some arcane Windbg commands along the way. These will be helpful the next time you need to dig into some stuck threads.