Fractured Plane: May 2015

I am using a rather large C++ library to run some simulations. I have wrapped a function that calls this library which results in a large amount of processing. Very rarely and somewhat randomly the C++ library will segfault. It is not the point to debug the library because this simulation process views a segmentation similar to a failed test cases and the libraries scores will reflect this error.

Instead what I want to do is gracefully handle this segfault. The default behaviour in python is for the process to hang. Python is calling a library that runs a bunch of processing and python will wait until this function returns or a exception is thrown. However in the case I describe neither of these events occur.

Lets start with some basic C++ code:


extern "C" {

    int raise_a_fault(int r)
    {
        std::cout << "r is "<< r << std::endl;
        if (r > 3)
        {
            volatile int *p = reinterpret_cast(0);
            *p = 0x1337D00D; // force a segfault
            // raise(SIGSEGV); // this was not effective enough
        }
        else
        {
            return r;
        }
    }
}

This C++ library function will force a segmentation fault if r is greater than 3.

I am using the ctypes library from python to link to the C++ library


from ctypes import cdll
lib = cdll.LoadLibrary('./libFoo.so')

from multiprocessing import Pool

def raise_a_fault(dummy):
    return lib.raise_a_fault(dummy)

p = Pool(1)

items = [1, 2, 3, 4, 5]

results = p.map(raise_a_fault, items)


print results

This script will terminate with "Segmentation fault (core dumped)". raise_a_fault() will not return anything and the interpreter will exit. Now instead what if I want to run this over a number of processes to speed up processing time.



from ctypes import cdll
lib = cdll.LoadLibrary('./libFoo.so')

from multiprocessing import Pool

def raise_a_fault(dummy):
    return lib.raise_a_fault(dummy)

p = Pool(1)

items = [1, 2, 3, 4, 5, 4, 3, 2, 1, 10, 2]

results = p.map(raise_a_fault, items) 
print results

This will never terminate. The subprocess used in the pool with fault and exit without returning a value and the map function will wait, patiently, for a return value. For me I can't even end the script with crtl+c. I have to use ctrl+z which ends up leaving a orphaned process behind.

One supposed solution is to add a new signal handler to check for Segmentation fault signals


from ctypes import cdll
lib = cdll.LoadLibrary('./libFoo.so')

from multiprocessing import Pool

def raise_a_fault(dummy):
    return lib.raise_a_fault(dummy)

def sig_handler(signum, frame):
    print "segfault"
    return None

signal.signal(signal.SIGSEGV, sig_handler)

p = Pool(1)

items = [1, 2, 3, 4, 5, 4, 3, 2, 1, 10, 2]

results = p.map(raise_a_fault, items) 
print results

This still has some odd issues. There seems to be some race condition that leads the processes to consume the resources of the processor but the processes do not advance. It might have something to do with this bug bug. Generally passing signals to child processes is tricky business in Python. Child processes will ignore signals when they are busy processing worker functions. I have seen suggestions to instead add a timeout on to the call to p.map().(timeout=1). This had no effect for me.

A different solution that I have used before has more promise. This solution involves replacing map with apply_async(). In this case non of the signal handling is needed.


from ctypes import cdll
lib = cdll.LoadLibrary('./libFoo.so')

from multiprocessing import Pool

def print_results(result):
    print "callback result: ***************** " + str (result)

def raise_a_fault(dummy):
    print "raise_a_fault, pid " + str(os.getpid())
    try:
        return lib.raise_a_fault(dummy)
    except Exception as inst:
        print "The fault is " + str(inst)
        # ctypes.set_errno(-2)

processes_pool = Pool(2)
print "main, pid " + str(os.getpid())
items = [1, 2, 3, 4, 5, 4, 3, 2, 1, 10, 2]

try:
    for item in items:

        try:

            # this ensures the results come out in the same order the the experiemtns are in this list.
            result = processes_pool.apply_async(raise_a_fault, args = (item, ), callback = print_results)
            result.get(timeout=1)
        except Exception as inst:
            print "The exception is " + str(inst)
            continue
    processes_pool.join()
    
except Exception as inst:
    print "The Out exception is " + str(inst)

print "All Done!"

The output:

main, pid 23466
raise_a_fault, pid 23467
r is 1
callback result: ***************** 1
raise_a_fault, pid 23468
r is 2
callback result: ***************** 2
raise_a_fault, pid 23467
r is 3
callback result: ***************** 3
raise_a_fault, pid 23468
r is 4
The exception is
raise_a_fault, pid 23467
r is 5
The exception is
raise_a_fault, pid 23475
r is 4
The exception is
raise_a_fault, pid 23479
r is 3
callback result: ***************** 3
raise_a_fault, pid 23483
r is 2
callback result: ***************** 2
raise_a_fault, pid 23479
r is 1
callback result: ***************** 1
raise_a_fault, pid 23483
r is 10
The exception is
raise_a_fault, pid 23479
r is 2
callback result: ***************** 2
The Out exception is
All Done!

This method is almost great. We have the output we want but we have to use a timeout. Some, including myself, don't want to have to specify a definitive timeout where we expect the processing to be complete. It is not easy to find a nice solution to this problem, there are some bugs that make catching segfaults difficult.

Unfortunately, this story has a tragic end. There does not seem to be good support for trapping SIGSEGV or many other signals in Python. I also tried some options in the C++ code. You can change how signals are handled in the C++ code but all that can effectively be done is to raise another signal. Useful exceptions can not be throw from a signal handler because they are essentially on a different stack frame. The solution presented seems to work the best. It does have the added benefit that processes do not crash. By not crashing I am referring to having the process exit() resulting in the process pool reduce the number of processes in the pool which could eventually lead to no processes available.

The code I used to play around with possible solutions can be found here.

References:

https://docs.python.org/3/library/ctypes.html
http://stackoverflow.com/questions/1717991/throwing-an-exception-from-within-a-signal-handler

Monday 4 May 2015

Handling Segfaults in Python that occur in custom C++ libraries

Friday 1 May 2015

Knowledge vs Information