Issue migrated from https://its.cern.ch/jira/browse/ROOT-4055 to GitHub because there is already a PR open to address it (#1053). Having the issue associated to the PR also on GitHub is helpful to keep the discussion around this already-started work in one place
It is unsafe to allocate or free memory in a signal handler, nor call a long list of library functions, such as popen.
For example, consider the following stuck CMS process:
[root@node247 \~]# pstack 15619
#0 0x0000003199ae027e in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x0000003199a76c45 in _L_lock_57 () from /lib64/libc.so.6
#2 0x0000003199a6f4e3 in ptmalloc_lock_all () from /lib64/libc.so.6
#3 0x0000003199a9a5c6 in fork () from /lib64/libc.so.6
#4 0x0000003199a62e5d in _IO_proc_open@@GLIBC_2.2.5 () from /lib64/libc.so.6
#5 0x0000003199a630b9 in popen@@GLIBC_2.2.5 () from /lib64/libc.so.6
#6 0x00002b5f06845fa4 in TUnixSystem::StackTrace() () from /opt/osg/app/cmssoft/cms/slc5_amd64_gcc434/cms/cmssw/CMSSW_4_2_8/external/slc5_amd64_gcc434/lib/libCore.so
#7 0x00002b5f07c695bb in sig_dostack_then_abort () from /opt/osg/app/cmssoft/cms/slc5_amd64_gcc434/cms/cmssw/CMSSW_4_2_8/lib/slc5_amd64_gcc434/libFWCoreServices.so
#8 <signal handler called>
#9 0x0000003199a7051b in malloc_consolidate () from /lib64/libc.so.6
#10 0x0000003199a72bbc in _int_malloc () from /lib64/libc.so.6
#11 0x0000003199a74e2e in malloc () from /lib64/libc.so.6
#12 0x00002b5f07a5d9fd in operator new(unsigned long) () from /opt/osg/app/cmssoft/cms/slc5_amd64_gcc434/external/gcc/4.3.4-cms/lib64/libstdc++.so.6
#13 0x00002b5f07a5db19 in operator new[](unsigned long) () from /opt/osg/app/cmssoft/cms/slc5_amd64_gcc434/external/gcc/4.3.4-cms/lib64/libstdc++.so.6
Because the signal handler filed in malloc with a lock held, the process has deadlocked instead of printing the stack and exiting. This is an extremely painful situation for jobs running in batch systems - the job continues to idle the slot until the time limit is exhausted, earning the ire of the user and the sysadmin.
A list of signal-safe functions is found in "man 7 signal".
TUnixService::StackTrace really ought to be rewritten so ROOT can safely call it from a signal handler.
Issue migrated from https://its.cern.ch/jira/browse/ROOT-4055 to GitHub because there is already a PR open to address it (#1053). Having the issue associated to the PR also on GitHub is helpful to keep the discussion around this already-started work in one place
It is unsafe to allocate or free memory in a signal handler, nor call a long list of library functions, such as popen.
For example, consider the following stuck CMS process:
Because the signal handler filed in malloc with a lock held, the process has deadlocked instead of printing the stack and exiting. This is an extremely painful situation for jobs running in batch systems - the job continues to idle the slot until the time limit is exhausted, earning the ire of the user and the sysadmin.
A list of signal-safe functions is found in "man 7 signal".
TUnixService::StackTrace really ought to be rewritten so ROOT can safely call it from a signal handler.