mupuf.org // we are octopimupuf.org

Suspend2RAM, the Init Process and SSHFS

Or how a system will refuse to suspend because of design flaws

I’ve had a very boring problem for the last couple months, that I could never find the time to diagnose, till two days ago it finally got over my head. My computer would, sometimes, with no apparent reason, refuse to suspend (or actually, it would begin and then, after twenty seconds, interrupt the suspend procedure, breaking all my internet connections, and making the CPU and fans overwork).

This has been going on for a while, and even though I thought that it was linked to a VM I was working on in VirtualBox, I had no clue how to diagnose. Actually, the problem came from defunct processes trying to read a SSHFS share that was a directory in the VM. When the VM would reboot or be shut down, the SSHFS share became invalid. Having a cp process (and probably others like Thunar and ls) trying to connect to it would suffice to trigger the bug. The process would hang up, and killing it would often result in it being stuck as defunct (don’t ask me why, I don’t have the faintest idea).

So, how did this prevent Suspend2RAM from suspending my computer? Well, Suspend2RAM asks applications doing I/O activities to freeze, and it will not suspend if one of the tasks did not answer within 20 seconds, writing this message in dmesg instead :

Freezing of tasks failed after 20.01 seconds (1 tasks refusing to freeze, wq_busy=0): 

This is because the defunct process, that is still considered as currently doing I/O, is of course not able to respond to a signal. Now, this bug is annoying from a layman’s point of view and it’s pretty hard to figure out why such a behaviour can not be avoided. As far as I’m concerned, two design mistakes made this possible:

Processes being attached to Init + the zombie children processes

When a process dies in Linux, all of its children are attached to the Init process, which is the father of all processes and which should never be killed. And when a child process dies, it becomes a zombie, waiting for it’s parent process to acknowledge the death (by invoking the wait() system call) before it really disappears.

The combination of the two means that, when a process loses its parent and gets attached to Init (typically, any bash task still executing when you close the terminal window), and when afterwards this process dies, it sends a signal to Init in order to let it know. And if Init fails to wait for this process, it goes zombie (which happens, but normally extremely rarely, and which is not considered annoying). Now let’s see how this becomes a problem, and why it should be possible to terminate child processes manually, or to have processes that lose their parents attached to a process that does can be killed, ie. not Init.

Suspend2RAM

Please note: I do not know how Suspend2RAM works. All I’m writing in the next paragraphs is pure speculation.

When Suspend2RAM is triggered, it tries to determine which tasks should not be interrupted in order to prevent data loss or corruption, and it then lets these tasks know about the suspend procedure currently going on. Also, and I believe this is a good default choice, it will not try to force suspend if a task is not able to terminate (or suspend) properly within a given time frame. I am perfectly happy with this behaviour that I find pretty clever.

However, I fail to understand how a defunct process can be assumed to have any significant risk of causing trouble to the system if interrupted… since it’s already dead. Zombies are only walking in movies, and they most certainly are not in Linux systems. I believe my problem with the cp process is that it must have had some sort of lock set on another file of the system, misleading Suspend2RAM into waiting for the defunct process. The only way to sort the problem: rebooting to wipe off defunct processes, which kind of defeats the purpose of suspending your computer…

So, why am I whining instead of filing bug reports?

Well, you see, I’m a pretty busy lazy guy, but I just wanted my case to be documented somewhere at reach of Google indexation robots just in case somebody ever encounters the same situation. Either making it possible to really get rid of defunct processes (in the most possible proper way, ie. by having the system release all resources the process was holding), or having Suspend2RAM ignoring (or allowing to ignore) defunct processes would most likely do the trick and fix this problem, though. Now, if really you can’t find of anything better to do this week-end, that would be a great way to contribute to the Linux project. ;)

Comments