The scenario: Your server is not working right: You cannot start new processes! Simple commands like "ls" fail:
yourname@yourserver:~$ ls
-bash: fork: retry: Resource temporarily unavailable
This is a symptom of the kernel reaching the maximum number of processes. On most systems, this is 32768 processes, and on 64-bit systems this can be increased to as much as 4194304 (2^22). The current setting is available in /proc/sys/kernel/pid_max.
On a normal system there will be up to a couple of thousand processes running, and reaching 32000 processes will be extremely rare - and almost certainly a clear sign of a problem. (I would wager that this is likely the result of somebody's shell script going recursive or similar)
So how do you get the system back to a working state? You will need to stop some processes, and to do that you list to list the processes first. But that will fail too!
yourname@yourserver:~$ ps -ef
-bash: fork: retry: Resource temporarily unavailable
To fix this, you will need to become root. But.. You cannot create a new ssh session to the system (yes: because it is out of process slots). And even sudo will fail:
yourname@yourserver:~$ sudo -i
-bash: fork: retry: Resource temporarily unavailable
But there is a way: although you cannot create new processes, you can replace your shell with sudo using exec:
yourname@yourserver:~$ exec sudo -i
[sudo] password for yourname: *******
root@yourserver:~#
NOTE: If you get your password wrong, your sudo attempt will be rejected, and sudo will exit. And you will be instantly logged out (because your shell no longer exists). So don't do that!
If the misbehaving process is still running, increasing the maximum number of processes would be futile, as the rogue process would quickly spawn off more processes to reach the new limit. Nobody's typing would be quick enough here...
Killing any single one of the rogue processes will free up a process slot, but that slot will quickly be taken up by a new rogue process.. So this is futile too...
Even gracefully rebooting the system is unlikely to succeed: To reboot the system, you need to run the shutdown scripts. And the shutdown scripts cannot run since we are out of process slots!
One obvious option is to resort to a non-graceful reboot and then hope for the rogue processes are not caused by something that starts upon boot:
root@yourserver:~# reboot --force --force
But this is a rather crude option: rebooting systems is frowned upon, especially if people are using it and have not noticed anything wrong. And rebooting this way increases the risk of data corruption to uncomfortable levels.
It is possible to recover the system to a working state without resorting to desperate measures. But it is tricky, as we have to work within some rather severe constraints:
- We cannot execute any commands that require new processes. This limits us to using shell built-in commands.
- The rogue processes will re-spawn processes as soon as one is killed. So killing them one at a time is not an option.
The first problem is listing processes. This can be accomplished using only shell builtins and the /proc file system:
cd /proc;
for pid in [0-9]*; do
test -r $pid/exe &&
read pid comm state ppid pgrp session tty_nr tpgid junk < $pid/stat &&
read loginuid < $pid/loginuid ;
printf '%5d %5d %-25s %-10s %5d %5d %5d %5d\n' $pid $ppid $comm $loginuid $pgrp $session $tty_nr $tpgid;
done
This will give a somewhat basic equivalent to ps(1) and should allow us to spot the name of the rogue processes.
The next step is to freeze the rogue processes by sending the STOP signal to them. This will prevent them from spawning new child processes once process slots become available: Replace "(rogue-process)" with the name of the process:
cd /proc;
for x in 1 2 3 4 5; do
for pid in [0-9]*; do
test -r $pid/exe &&
read pid name state ppid junk < $pid/stat &&
[ $name = '(rogue-process)' ] &&
kill -STOP $pid;
done;
done
Repeat the above multiple time if you find multiple binaries being responsible: you need to make sure you get them all.
Make sure that you do not freeze the processes you are using - i.e. your shell, ssh daemon etc. If you do that, you are dead in the water...
Once the rogue processes have been frozen, we can get away with killing them one at a time as they cannot respawn stuff (again: Replace "(rogue-process)" with the name of the rogue process):
cd /proc;
for pid in [0-9]*; do
test -r $pid/exe &&
read pid name state ppid junk < $pid/stat &&
[ $name = '(rogue-process)' ] &&
kill -KILL $pid;
done
and you should be back to normal!