The LD_PRELOAD trick

If you ever needed to find out where a specific function from an outside library was being called, how would you do this?

For example – in the land of Python – you have this single line of code that imports TorchDynamo (I’m using Python 3.11 and PyTorch 2.3.1):

import torch._dynamo

You run the code above, and you get this loud warning:

/usr/lib/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

This warning is saying that there’s issues with the NVIDIA driver setup on your machine. This is correct – in fact your machine doesn’t have any GPUs, so this is completely expected. But this warning really annoys you, and you want to find out where it’s coming from. Why is a very commonly-used package spamming me with false positive warnings, you wonder?

Python is really nice in the sense that you can easily override any function in any package from inside your application. Here’s a short snippet of how I would take advantage of that to find the stack trace of where that warning is coming from.

import torch
from torch.cuda import device_count as orig_device_count

def my_device_count() -> int:
    import traceback
    traceback.print_stack()
    return orig_device_count()

torch.cuda.device_count = my_device_count

import torch._dynamo

After some quick glances through the PyTorch source code, I realize that the torch.cuda.device_count function is where this warning comes from. Then, I simply create my own version of that function that wraps around the original PyTorch version, and inject whatever stuff I want around it! In this case I inject some code to print out the stack trace, and now I can see the entire import tree leading to the annoying NVML warning. Here’s what the output (with some redundant stuff stripped) looks like. Mission solved!

File "dynamo.py", line 11, in <module>
  import torch._dynamo
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
  from . import convert_frame, eval_frame, resume_execution
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 40, in <module>
  from . import config, exc, trace_rules
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/trace_rules.py", line 50, in <module>
  from .variables import (
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/variables/__init__.py", line 4, in <module>
  from .builtin import BuiltinVariable
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 42, in <module>
  from .ctx_manager import EventVariable, StreamVariable
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/variables/ctx_manager.py", line 12, in <module>
  from ..device_interface import get_interface_for_device
File "/usr/lib/conda/lib/python3.11/site-packages/torch/_dynamo/device_interface.py", line 198, in <module>
  for i in range(torch.cuda.device_count()):
File "dynamo.py", line 6, in my_device_count
  traceback.print_stack()
/usr/lib/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:619: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")

Now, what if you wanted to do this in a language like C or C++ where you can’t just override functions in libraries? Enter LD_PRELOAD!

What exactly does LD_PRELOAD do? You can set it to the path of a shared library (.so file), and that shared library will be loaded before any other shared library that your application loads. This allows you to intercept functions in shared libraries and do your own thing in there. This is similar to what we did above, but is not restricted to Python packages and works with most languages in a standard Linux system! Python (like all other languages on Linux) will load shared libraries too – you can see what shared libraries it loads using the ldd command:

ldd /usr/lib/python
        linux-vdso.so.1 (0x00008000001b3000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000155554f4a000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000155554f45000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x0000155554f3e000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000155554e57000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000155554c2f000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x0000155554c2a000)
        /lib64/ld-linux-x86-64.so.2 (0x000015555551a000)

Here’s an example where I intercept calls to the mmap system call. Technically this doesn’t actually intercept the mmap system call (it’s not possible to intercept system calls with LD_PRELOAD), it intercepts the glibc wrapper for the mmap system call.

First, let’s create a toy application program that uses mmap. Here’s one that ChatGPT generated (with some small modifications):

code

#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>

int main() {
    size_t length = 4096;  // Size of the memory to map

    // Create an anonymous memory mapping
    char *mapped_memory = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (mapped_memory == MAP_FAILED) {
	perror("Error creating memory map");
	exit(EXIT_FAILURE);
    }

    // Write some data to the mapped memory
    const char *message = "Hello, memory-mapped world!";
    strncpy(mapped_memory, message, strlen(message));

    // Read and print the data from the mapped memory
    printf("Mapped memory content: %s\n", mapped_memory);

    // Unmap the memory
    if (munmap(mapped_memory, length) == -1) {
	perror("Error unmapping memory");
	exit(EXIT_FAILURE);
    }

    return 0;
}

This is the output of the program by default:

$ gcc program.c -o program
$ ./program
Mapped memory content: Hello, memory-mapped world!

Now, let’s write a shared library to intercept the mmap call and print out a stack trace!

code

#define _GNU_SOURCE

#include <execinfo.h>
#include <stdlib.h>
#include <stdio.h>
#include <dlfcn.h>

typedef void* (*mmap_fn)(void*, size_t, int, int, int, off_t);

static mmap_fn orig_mmap;

void print_stack_trace() {
    // Generated by ChatGPT
    void *array[10];
    size_t size;
    char **strings;
    size_t i;

    // Get the array of void* pointers for the stack
    size = backtrace(array, 10);

    // Get the strings for each stack frame
    strings = backtrace_symbols(array, size);

    printf("Stack trace:\n");
    for (i = 0; i < size; i++) {
        printf("%s\n", strings[i]);
    }
    free(strings);
}

void __attribute__((constructor)) init_hooks(void) {
  orig_mmap = dlsym(RTLD_NEXT, "mmap");
  if (!orig_mmap) {
    exit(-1);
  }
}

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off) {
  print_stack_trace();
  return orig_mmap(addr, len, prot, flags, fildes, off);
}

Although this code may look foreign at first, conceptually it’s very simple:

I define a special function init_hooks with the __attribute__((constructor)) tag. This tag is a compiler extension (i.e. not part of the language), but most compilers should support it. With the __attribute__((constructor)) tag init_hooks is called when a shared library is loaded!
Inside the init_hooks function, I call dlsym(RTLD_NEXT, "mmap"). This function call returns the address of the “next” occurence of the mmap function it can find in any library. Since we’ll be LD_PRELOADing the application with a shared library built from this file, the first occurence of mmap will be the one we actually define right below! And then the next occurence of mmap will be the mmap implementation in glibc.
We define our own mmap function. When we combine this code with LD_PRELOAD all mmap calls will go to this definition. As a result, we simply print the stack trace and call the original mmap implementation from glibc, so stuff still executes normally.

Now, let’s compile and run this.

$ gcc intercept.c -o intercept.so -ldl -fPIC -shared
$ LD_PRELOAD=./intercept.so ./program
Stack trace:
./intercept.so(print_stack_trace+0x2c) [0x155555514225]
./intercept.so(mmap+0x2c) [0x155555514316]
./toy_program(+0x1244) [0x555555555244]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x155555306d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x155555306e40]
./toy_program(+0x1145) [0x555555555145]
Mapped memory content: Hello, memory-mapped world!

Volia! Our hook is now working, and the program still functions normally.

Here’s some more practical use cases of stuff like this:

Injecting test / debugging / profiling code just like we did above in a real application (fun fact: I have done this before, which led to me writing this post!)
Replacing the standard glibc malloc implementation with a custom one (LD_PRELOAD=./my_optimized_malloc.so ./program)
https://github.com/wolfcw/libfaketime