Long running C++ NIF segfaults

Hi, I built a NIF out of a well known C++ Dicom library called DCMTK. The purpouse of this NIF is to stay loaded for the whole life of my Elixir/Phoenix application, where it’s functions are called many times.

The module that loads this NIF is called from an Agent and I thought it stay there forever, but as it is segfaulting after using it for a while I’m starting to think it is removed from memory after some time…

Let me show my code:

defmodule Prueba3.ImageCache do                                                                                                                                                                                                              
  use Agent                                                                                                                                                                                                                                  
  def start_link(_opts) do                                                                                                                                                                                                                   
    Agent.start_link(fn -> %{} end, name: :imagecache)                                                                                                                                                                                       
  def get(key) do                                                                                                                                                                                                                            
    Agent.get(:imagecache, &Map.get(&1, key))                                                                                                                                                                                                
  def put(key, value) do                                                                                                                                                                                                                     
    Agent.update(:imagecache, &Map.put(&1, key, value))                                                                                                                                                                                      
defmodule Dcmtknif do                                                                                                                                                                                                                        
  @on_load :load_nifs                                                                                                                                                                                                                        
  def load_nifs do                                                                                                                                                                                                                           
    :erlang.load_nif('./ssr/libssr', 0)                                                                                                                                                                                                      
  def _loadDicomFile(_dicomfile) do                                                                                                                                                                                                          
    raise "NIF _loadDicomFile/1 not implemented"                                                                                                                                                                                             
  def _getTagValue(_dcmff, _group, _element) do                                                                                                                                                                                              
    raise "NIF getTagValue/1 not implemented"                                                                                                                                                                                                
  def _getHeader(_dcmff) do                                                                                                                                                                                                                  
    raise "NIF getHeader/1 not implemented"                                                                                                                                                                                                  
  def _getPNG(_dcmff) do                                                                                                                                                                                                                     
    raise "NIF getPNG/1 not implemented"                                                                                                                                                                                                     
  def _setMinMaxWindow(_dcmff) do                                                                                                                                                                                                            
    raise "NIF setMinMaxWindow/1 not implemented"                                                                                                                                                                                            
  def _createScaledImage(_dcmff, _width, _height, _interpolate, _aspect) do                                                                                                                                                                  
    raise "NIF createScaledImage/5 not implemented"                                                                                                                                                                                          
  def _initialize() do                                                                                                                                                                                                                       
    raise "NIF initialize/0 not implemented"                                                                                                                                                                                                 
  def _finalize() do                                                                                                                                                                                                                         
    raise "NIF finalize/0 not implemented"                                                                                                                                                                                                   
  ... more functions

Do you see something wrong?

1 Like

I’m not an expert in NIFs, but in general the VM yields execution to the NIF when you call it, so you’re really not supposed to have “long-running NIFs”. When you enter a NIF, the VM cannot do GC, context, switching, and so on anymore.

I think a common solution in a case like yours, where you have essentially an external program that needs to run continuously, is to use ports. You spawn the program you want to run and you let your Elixir application manage its lifetime. Then, you communicate with the spawned program through stdout.

Does that make sense?

1 Like

Hi @whatyouhide, I think I wrongly explained what this NIF does. With “long-running” I mean it has state (internal variables pointing to data structures that can be accessed from the parent Elixir process), please don’t confuse with long loops or heavy processes inside it.

Regarding your ports suggestion, I can do that, but it adds complexity on the C++ side and I’m trying to maintain that part as small as possible.

1 Like

This is where the program crashes, I call the image function ~100 times (not always the same number of times) then it crashes:

I can assure you that the NIF hasn’t been unloaded. You would need to do that manually, and even then the VM takes great care not to pull the rug while anything references it.

I think I can see why it crashes though. What’s preventing image/2 from being called concurrently?

1 Like

Hi @jhogberg, nothing prevents that (in fact, it will be called concurrently at production), but as far as I understand the function createScaledImage_nif is thread safe (it receives the data that should use to work with by param, no global variables).

I added debug messages (std::cout) everywhere in the createScaledImage_nif, and found the app crashes just before entering into that function.

Imagine that you get several concurrent calls to image/2 with the same sopinstanceuid.

Isn’t there then a pretty good chance that they all end up calling createScaledImage_nif, setMinMaxWindow_nif, and getPNG_nif with the same resource?

Also, would you care to share the code for getPNG_nif?

1 Like

Yes, the program should allow concurrently call createScaledImage_nif, it has a pointer to a resource with the original image, then this function returns a scaled version of it, never overwriting the original.

Here’s the getPNG_nif function:

ERL_NIF_TERM getPNG_nif(ErlNifEnv *env, int argc, const ERL_NIF_TERM argv[])                                                                                                                                                                 
    dcmfileformat_t *ff_res;                                                                                                                                                                                                                 
    if(!enif_get_resource(env, argv[0], ff_res_t, (void **) &ff_res)) {                                                                                                                                                                      
        return enif_make_badarg(env);                                                                                                                                                                                                        
    DcmDataset *ds = ff_res->_ff->getDataset();                                                                                                                                                                                              
    std::vector<ui8> out;                                                                                                                                                                                                                    
    writePNGtoMemory(ff_res->_dicomimage, &out, 0);                                                                                                                                                                                          
    ErlNifBinary bin;                                                                                                                                                                                                                        
    enif_alloc_binary(out.size(), &bin);                                                                                                                                                                                                     
    memcpy(bin.data, out.data(), out.size());                                                                                                                                                                                                
    return enif_make_binary(env, &bin);                                                                                                                                                                                                      
    return enif_make_atom(env, "ok");                                                                                                                                                                                                        

I commented the block that uses the cached image (lines 27 to 38) and the program didn’t segfaults. It looks like something is filling the cachedDcmFF variable with garbage:

 22   def image(conn, params) do                                                                                                                                                                                                             
 23     sopinstanceuid = params["sopinstanceuid"]                                                                                                                                                                                            
 24     {width, ""} = Integer.parse(params["width"])                                                                                                                                                                                         
 25     {height, ""} = Integer.parse(params["height"])                                                                                                                                                                                       
 26     cachedDcmFF = ImageCache.get(sopinstanceuid)                                                                                                                                                                                         
 27     if cachedDcmFF do                                                                                                                                                                                                                    
 28       IO.puts("====> CACHED! <====")                                                                                                                                                                                                     
 29       Dcmtknif.createScaledImage(cachedDcmFF, width, height, 0, 0);                                                                                                                                                                      
 30       IO.puts("====> createScaledImage <====")                                                                                                                                                                                           
 31       Dcmtknif.setMinMaxWindow(cachedDcmFF)                                                                                                                                                                                              
 32       IO.puts("====> setMinMaxWindow <====")                                                                                                                                                                                             
 33       pngimage = Dcmtknif.getPNG(cachedDcmFF)                                                                                                                                                                                            
 34       IO.puts("====> getPNG <====")                                                                                                                                                                                                      
 35       conn                                                                                                                                                                                                                               
 36       |> put_resp_content_type("image/png")                                                                                                                                                                                              
 37       |> send_resp(200, pngimage)                                                                                                                                                                                                        
 38     else                                                                                                                                                                                                                                 
 39       im = ApiController.image(sopinstanceuid)                                                                                                                                                                                           
 40       IO.puts(im.imagepath)                                                                                                                                                                                                              
 41       if File.exists?(im.imagepath) do                                                                                                                                                                                                   
 42         IO.puts("====> NOT CACHED, LOADING FROM FILE <====")                                                                                                                                                                             
 43         dicomfile = im.imagepath                                                                                                                                                                                                         
 44         dcmff = Dcmtknif.load(dicomfile)                                                                                                                                                                                                 
 45         ImageCache.put(sopinstanceuid, dcmff)                                                                                                                                                                                            
 46         Dcmtknif.createScaledImage(dcmff, width, height, 0, 0);                                                                                                                                                                          
 47         Dcmtknif.setMinMaxWindow(dcmff)                                                                                                                                                                                                  
 48         pngimage = Dcmtknif.getPNG(dcmff)                                                                                                                                                                                                
 49         conn                                                                                                                                                                                                                             
 50         |> put_resp_content_type("image/png")                                                                                                                                                                                            
 51         |> send_resp(200, pngimage)                                                                                                                                                                                                      
 52       else                                                                                                                                                                                                                               
 53         conn                                                                                                                                                                                                                             
 54         |> send_resp(404, "File not found")                                                                                                                                                                                              
 55       end                                                                                                                                                                                                                                
 56     end                                                                                                                                                                                                                                  
 57   end

You are overwriting ff_res->_dicomimage however, so you’ll at the very least be leaking memory.

Please share what happens to ff_res->_dicomimage inside writePNGtoMemory, is there a delete ff_res->_dicomimage or similar in there?

Concurrent accesses would do that.

Maybe this is the culpit:

dcmff = Dcmtknif.load(dicomfile)


def load(dicomfile) do                                                                                                                                                                                                                     


DcmFileFormat * loadDicomFile(const char * dicomfile){                                                                                                                                                                                       
  DcmFileFormat *ff = new DcmFileFormat();                                                                                                                                                                                                   
  if (ff->loadFile(dicomfile).good())                                                                                                                                                                                                        
    return ff;                                                                                                                                                                                                                               
  return NULL;                                                                                                                                                                                                                               

ERL_NIF_TERM loadDicomFile_nif(ErlNifEnv *env, int argc, const ERL_NIF_TERM argv[])                                                                                                                                                          
  char dicomfile[MAXBUFLEN];                                                                                                                                                                                                                 
  enif_get_string(env, argv[0], dicomfile, sizeof(dicomfile), ERL_NIF_LATIN1);                                                                                                                                                               
  DcmFileFormat *ff = loadDicomFile(dicomfile);                                                                                                                                                                                              
  // Let's allocate the memory for a DcmFileFormat * pointer                                                                                                                                                                                 
  dcmfileformat_t *ff_res = (dcmfileformat_t *)enif_alloc_resource(ff_res_t, sizeof(dcmfileformat_t));                                                                                                                                       
  if (ff_res == NULL){                                                                                                                                                                                                                       
    std::cout << "ff_res = NULL" << std::endl;                                                                                                                                                                                               
  ff_res->_ff = ff;                                                                                                                                                                                                                          
  // We can now make the Erlang term that holds the resource                                                                                                                                                                                 
  ERL_NIF_TERM term = enif_make_resource(env, ff_res);                                                                                                                                                                                       
  // ...and release the resource so that it will be freed when Erlang garbage collects                                                                                                                                                       
  return term;                                                                                                                                                                                                                               

I created this Gist with the C++ side: ssr.cc · GitHub

Thanks, having read the NIF and DCMTK source it’s very likely that this is a thread-safety issue. None of the DCMTK functions you use are documented as thread-safe and I saw too many thread-unsafe things to count while reading their source.

That the problem went away after commenting-out lines 27 through 38 also suggests that this is the case.

Try caching results instead of resources if width/height are mostly static, or if they’re very dynamic, create a new process for each instance and store those in the cache. When a new request comes in, ask the cached process to scale the image.

1 Like

Thanks @jhogberg, yes, one process per instance could be the answer. Now how can I start learning about that?, any pointer?.

You’re welcome :slight_smile:

The gist of it would be:

  • When there’s a cache miss:
    1. Spawn a process that loads the file, then waits for requests to rescale the image (you might want to wait with a timeout to avoid having too many of these processes hanging around).
    2. Store that process’ pid in the cache.
    3. Ask said process to rescale the image.
  • When there’s a cache hit:
    1. Ask the cached process to rescale the image. If that fails (e.g. if we raced with the process dying due to timeout), treat it as a cache miss.

You’ll also want to fix all the memory leaks in your NIF module. At the very least, you need to free the _dicomimage after you’re done. This will be a lot easier if you fuse createScaledImage, setMinMaxWindow and getPNG into a single function.

1 Like

Hi again @jhogberg, before spawning new processes I’m fixing memory leaks and trying to understand when the segmentation fault happens and found it crashes when the same cached resource is requested more than once at the same time.

In one of your answers you said Try caching results instead of resources, by this do you mean using enif_make_binary instead of enif_make_resource?

Yes, that’s consistent with the problems I found in the source.

No, I meant putting finished PNG images into your ImageCache.

Yes I thought about that. The problem is when I need to resize the original image or apply window/level (similar to brightness & contrast) I will be forced to reload the image from disk, and that isn’t efficient.

I also was thinking about creating a binary of the image and pass back and forth between the NIF and the Elixir program, but that also adds overhead, because I must create in-memory copies of the whole image.

It must be a way to only share pointers to the data created by the NIF. The issue with that is that I must be very careful about memory leaks and race conditions.

To quote myself:

To word it differently, spawn a process whose only job is to rescale a certain image, then put the pid of that process into ImageCache. The process would load the image from disk once, and use the same resource to rescale images when requested to do so.

If several requests for the same image come in concurrently, they would all ask the same process to do this, but you wouldn’t need to worry about races since the process only handles one request at a time, and would finish them in the order they came in.

Makes sense?

Yes, of course that makes sense. Now I need to figure out how to do it.