The LD_PRELOAD trick
If you ever needed to find out where a specific function from an outside library was being called, how would you do this?
For example – in the land of Python – you have this single line of code that imports TorchDynamo
(I’m using Python 3.11 and PyTorch 2.3.1):
import torch._dynamo
You run the code above, and you get this loud warning:
/usr/lib/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
This warning is saying that there’s issues with the NVIDIA driver setup on your machine. This is correct – in fact your machine doesn’t have any GPUs, so this is completely expected. But this warning really annoys you, and you want to find out where it’s coming from. Why is a very commonly-used package spamming me with false positive warnings, you wonder?
Python is really nice in the sense that you can easily override any function in any package from inside your application. Here’s a short snippet of how I would take advantage of that to find the stack trace of where that warning is coming from.
import torch
from torch.cuda import device_count as orig_device_count
def my_device_count() -> int:
import traceback
traceback.print_stack()
return orig_device_count()
torch.cuda.device_count = my_device_count
import torch._dynamo
After some quick glances through the PyTorch source code, I realize that the torch.cuda.device_count
function is where this warning comes from.
Then, I simply create my own version of that function that wraps around the original PyTorch version, and inject whatever stuff I want around it!
In this case I inject some code to print out the stack trace, and now I can see the entire import tree leading to the annoying NVML warning.
Here’s what the output (with some redundant stuff stripped) looks like. Mission solved!
File "dynamo.py", line 11, in <module>
import torch._dynamo
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
from . import convert_frame, eval_frame, resume_execution
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 40, in <module>
from . import config, exc, trace_rules
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 50, in <module>
from .variables import (
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/variables/__init__.py", line 4, in <module>
from .builtin import BuiltinVariable
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 42, in <module>
from .ctx_manager import EventVariable, StreamVariable
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/variables/ctx_manager.py", line 12, in <module>
from ..device_interface import get_interface_for_device
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/device_interface.py", line 198, in <module>
for i in range(torch.cuda.device_count()):
File "dynamo.py", line 6, in my_device_count
traceback.print_stack()
/usr/lib/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Now, what if you wanted to do this in a language like C or C++ where you can’t just override functions in libraries?
Enter LD_PRELOAD
!
What exactly does LD_PRELOAD
do?
You can set it to the path of a shared library (.so
file), and that shared library will be loaded before any other shared library that your application loads.
This allows you to intercept functions in shared libraries and do your own thing in there.
This is similar to what we did above, but is not restricted to Python packages and works with most languages in a standard Linux system! Python (like all other languages on Linux) will load shared libraries too – you can see what shared libraries it loads using the ldd
command:
ldd /usr/lib/python
linux-vdso.so.1 (0x00008000001b3000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000155554f4a000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000155554f45000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x0000155554f3e000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000155554e57000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000155554c2f000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000155554c2a000)
/lib64/ld-linux-x86-64.so.2 (0x000015555551a000)
Here’s an example where I intercept calls to the mmap
system call.
Technically this doesn’t actually intercept the mmap
system call (it’s not possible to intercept system calls with LD_PRELOAD
), it intercepts the glibc
wrapper for the mmap
system call.
First, let’s create a toy application program that uses mmap
. Here’s one that ChatGPT generated (with some small modifications):
code
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>
int main() {
size_t length = 4096; // Size of the memory to map
// Create an anonymous memory mapping
char *mapped_memory = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mapped_memory == MAP_FAILED) {
perror("Error creating memory map");
exit(EXIT_FAILURE);
}
// Write some data to the mapped memory
const char *message = "Hello, memory-mapped world!";
strncpy(mapped_memory, message, strlen(message));
// Read and print the data from the mapped memory
printf("Mapped memory content: %s\n", mapped_memory);
// Unmap the memory
if (munmap(mapped_memory, length) == -1) {
perror("Error unmapping memory");
exit(EXIT_FAILURE);
}
return 0;
}
This is the output of the program by default:
$ gcc program.c -o program
$ ./program
Mapped memory content: Hello, memory-mapped world!
Now, let’s write a shared library to intercept the mmap
call and print out a stack trace!
code
#define _GNU_SOURCE
#include <execinfo.h>
#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>
typedef void* (*mmap_fn)(void*, size_t, int, int, int, off_t);
static mmap_fn orig_mmap;
void print_stack_trace() {
// Generated by ChatGPT
void *array[10];
size_t size;
char **strings;
size_t i;
// Get the array of void* pointers for the stack
size = backtrace(array, 10);
// Get the strings for each stack frame
strings = backtrace_symbols(array, size);
printf("Stack trace:\n");
for (i = 0; i < size; i++) {
printf("%s\n", strings[i]);
}
free(strings);
}
void __attribute__((constructor)) init_hooks(void) {
orig_mmap = dlsym(RTLD_NEXT, "mmap");
if (!orig_mmap) {
exit(-1);
}
}
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off) {
print_stack_trace();
return orig_mmap(addr, len, prot, flags, fildes, off);
}
Although this code may look foreign at first, conceptually it’s very simple:
- I define a special function
init_hooks
with the__attribute__((constructor))
tag. This tag is a compiler extension (i.e. not part of the language), but most compilers should support it. With the__attribute__((constructor))
taginit_hooks
is called when a shared library is loaded! - Inside the
init_hooks
function, I calldlsym(RTLD_NEXT, "mmap")
. This function call returns the address of the “next” occurence of themmap
function it can find in any library. Since we’ll beLD_PRELOAD
ing the application with a shared library built from this file, the first occurence ofmmap
will be the one we actually define right below! And then the next occurence ofmmap
will be themmap
implementation inglibc
. - We define our own
mmap
function. When we combine this code withLD_PRELOAD
allmmap
calls will go to this definition. As a result, we simply print the stack trace and call the originalmmap
implementation fromglibc
, so stuff still executes normally.
Now, let’s compile and run this.
$ gcc intercept.c -o intercept.so -ldl -fPIC -shared
$ LD_PRELOAD=./intercept.so ./program
Stack trace:
./intercept.so(print_stack_trace+0x2c) [0x155555514225]
./intercept.so(mmap+0x2c) [0x155555514316]
./toy_program(+0x1244) [0x555555555244]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x155555306d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x155555306e40]
./toy_program(+0x1145) [0x555555555145]
Mapped memory content: Hello, memory-mapped world!
Volia! Our hook is now working, and the program still functions normally.
Here’s some more practical use cases of stuff like this:
- Injecting test / debugging / profiling code just like we did above in a real application (fun fact: I have done this before, which led to me writing this post!)
- Replacing the standard glibc
malloc
implementation with a custom one (LD_PRELOAD=./my_optimized_malloc.so ./program
) - https://github.com/wolfcw/libfaketime