What to do if critical process died?

What is a critical process?

A critical process is a computer program or service that is essential for the continued stable operation of a computer system. Critical processes are those that the system relies upon in order to function correctly. If a critical process unexpectedly terminates or “dies”, it can cause serious problems for the system and users.

Some examples of critical processes include:

  • Init system (init)
  • Login manager (login)
  • Desktop environment (gnome-session, kdeinit, etc)
  • Window manager (metacity, kwin, etc)
  • Audio server (pulseaudio)
  • X11 server (Xorg)

On Linux and other Unix-like systems, critical processes are often daemon processes that start during boot and run in the background continuously. They may provide essential services to other programs and users.

Why do critical processes die?

There are several potential reasons why a critical process may unexpectedly terminate:

  • The process crashed due to a bug, memory issue, or other fault.
  • The process was forcibly killed, either by the system or manually.
  • There was a dependency issue, e.g. a library or file it relies on became unavailable.
  • The system ran out of some critical resource like memory or disk space.
  • There was a hardware failure or the system lost power.

Finding the exact cause typically requires checking log files such as /var/log/syslog. The log may provide details about what error or condition caused the process exit.

How to tell if a critical process died

There are a few main ways to detect that a critical process has unexpectedly terminated or is no longer running:

1. System/service failures

Often the first sign is failures, crashes, or odd behavior in dependent services and applications. For example:

  • The graphical desktop environment stops responding or crashes.
  • Programs can no longer connect to the network or Internet.
  • You are unable to log in or access your desktop after rebooting.
  • Audio fails or makes strange noises before cutting out entirely.

Issues like these indicate that an underlying process like the window manager, network manager, or audio daemon has failed.

2. Error messages

The system may display an error notification when a process exits unexpectedly. For example:

  • “kded4 has crashed” on KDE desktops
  • “Sorry, Ubuntu 18.04 has experienced an internal error” on GNOME
  • “Oh no! Something has gone wrong.” on Elementary OS

These types of messages explicitly alert you to a process crashing.

3. Checking running processes

You can manually check which processes are running or not running. For example:

  • ps aux | grep processname – See if process exists
  • pgrep processname – Returns PID if running
  • pidof processname – Returns PID(s) if running
  • pstree – Display running processes in tree form

If a critical process is missing from the displayed list of running processes, that indicates it is no longer running.

4. System log files

Log files under /var/log record status and errors from system services. The main log file to check is /var/log/syslog. Look for error messages around the time the process failed. This can provide details on why a process crashed or was killed.

Some other log files that may contain relevant error messages:

  • /var/log/kern.log – Kernel logs
  • /var/log/auth.log – Authentication logs
  • /var/log/daemon.log – System daemon logs
  • /var/log/messages – Global system logs

How to restart a crashed critical process

If you determine a critical process like dbus or NetworkManager has crashed or been killed, you will want to restart it to restore functionality. Here are some ways:

Use the init system

On modern Linux distributions using systemd, you can use systemctl to restart failed services. For example:


sudo systemctl restart NetworkManager

This will directly restart the NetworkManager process managed by systemd.

Restart the service daemon directly

Many critical processes on Linux are daemons that can be started with an init script or executable. For example, to restart the pulseaudio audio server:


pulseaudio --start

Or restart the Xorg display server:


sudo /etc/init.d/xorg restart

Knowing the correct command to restart a crashed process takes some system knowledge.

Reboot the system

Rebooting the system will restart all critical processes that run by default. This ensures you start with a clean slate. If you’re unsure how else to restart the specific failed process, rebooting can often resolve the issue quickly at the expense of some downtime.

How to prevent critical processes from crashing

To help avoid critical system process crashes in the future:

1. Update the system regularly

Installing system updates provides security fixes, bug patches, and stability improvements that reduce the chance of crashes.

2. Check for misbehaving programs

A program that starts utilizing too much CPU, memory, or disk I/O can sometimes interfere with critical processes like the desktop manager. Check for any high resource usage and stop misbehaving programs.

3. Don’t manually kill processes

Unless you know exactly what you’re doing, don’t forcibly kill processes with kill, pkill, or similar utilities. This can lead to instability if you terminate something critical.

4. Check log files for causes

Review log files like /var/log/syslog* when crashes happen to look for error messages indicating the root cause. Address any underlying system issues.

5. Add more memory/swap space

If processes are dying due to the system being out of memory, adding more RAM or swap space can help provide breathing room.

6. Clean up disk space

Lack of disk space can cause critical components to fail. Remove unneeded files and packages to free up room.

7. Test hardware for faults

Faulty memory, bad power supplies, overheating, and other hardware issues can potentially contribute to unstable processes. Test components if crashes seem truly random.

Advanced process management

For advanced users and special situations, there are some other ways to manage and restrict critical processes:

Priority control with nice and renice

The nice and renice commands modify process priorities. Giving critical processes higher priority access to system resources can help keep them running.

Control groups (cgroups)

Cgroups allow limiting and partitioning resources per-process, e.g. restricting max CPU for a process. This prevents runaways.

Systemd process control directives

For processes managed by systemd, directives like PrivateDevices, PrivateTmp, ProtectSystem can restrict resources and permissions.

SELinux and AppArmor MAC

Mandatory access control systems like SELinux and AppArmor confine processes and resources, improving security and isolation.

Kernel tuning

Tuning kernel parameters related to processes, memory usage, scheduling, and resources can improve critical process stability.

Docker containers

Containerizing unstable processes like buggy server daemons into Docker can limit the impact of crashes and improve robustness.

Distributed process managers

Tools like fleet, systemd, and Kubernetes manage critical distributed processes across server clusters for high reliability. They automatically handle crashes by restarting processes.

Example scenarios

Here are some examples of how to identify and handle different critical process failures:

Xorg server crash

The graphical desktop environment stops responding. Checking pgrep Xorg shows Xorg is no longer running. /var/log/Xorg.0.log contains a “Segmentation fault at address” error indicating Xorg crashed. Restart with:

sudo /etc/init.d/xorg restart

This restarts the Xorg server to restore the graphical desktop.

PulseAudio failure

Audio suddenly stops working. pgrep and ps do not show the pulseaudio process running anymore. The /var/log/syslog file contains pulseaudio crash report errors. Try restarting the pulseaudio daemon:

pulseaudio --start

If that doesn’t help, reboot the machine to restart all audio services.

Kernel panic

The system suddenly froze and displayed a “Kernel panic” message. This indicates the Linux kernel encountered a fatal error and crashed. Reboot the system – it will automatically reload the kernel during startup. Check dmesg and /var/log/syslog after rebooting to identify what caused the kernel panic, such as bad hardware.

Init process died

Attempting to ssh into the system fails with a “connection refused” error. physically checking the machine shows it is stuck at an emergency mode prompt after boot and no login prompt appears. This indicates that init – the first process launched by the kernel at boot – has crashed. Reboot the machine and check /var/log/syslog for init errors. If the issue persists, it may require reinstalling core system files to restore init.

Diagnosing from a frozen system

If the system fully freezes and becomes unresponsive, there are a few things you can try without rebooting:

  • Check if you can still SSH into the machine from another system. If so, inspect running processes remotely.
  • Switch to a different virtual terminal with Ctrl+Alt+F2 to see if you can log in from there.
  • Check the system monitor application (top, htop, gnome-system-monitor) for anomalies.
  • Do a remote SSH login and run uptime – a very high load average indicates resource starvation issues.
  • Do a remote SSH poweroff or reboot if unable to diagnose or recover the frozen system.

These steps may provide clues about misbehaving processes contributing to the freeze, without disturbing physical evidence.

Recovering data from sudden crashes

If the entire system crashes instantly due to a kernel panic, power loss, or hardware issue, there is a risk of data loss or filesystem corruption. Once you have restored the system to a bootable state, try to recover any lost or damaged data. Potential approaches include:

  • Remount filesystems read-only and run fsck to check and repair corruption.
  • Boot from a rescue CD/USB and investigate filesystem integrity.
  • Restore data from backups if available.
  • Use recovery tools like testdisk or photorec to recover lost files.
  • Repair corrupted superblocks using backups or boot sector copies.
  • Revert corrupted files from source control or revision history.

Sudden failures underscore the importance of regular, versioned backups to limit data loss. For increased fault tolerance, use networked storage and redundant disks (RAID).

Conclusion

Critical process crashes are a fact of computing life. By understanding what vital background daemons and services support normal system operation, you can quickly identify failures when they happen. Armed with the right tools and system logs, a little diligent troubleshooting can get things restarted and minimize downtime when things go wrong. Keeping your system updated, running clean, and managing resources effectively helps minimize instability in those crucial under-the-hood processes.