How do I troubleshoot a server not responding?

A non-responsive server can be very frustrating for administrators. When a server stops responding, it fails to communicate and interact as expected, which can disrupt services and access for users.

There are a few common causes of a server becoming unresponsive:

  • Network connectivity issues – If the server is unable to connect to the network, it will be unable to send or receive data.
  • Hardware failures – Faulty hardware like a bad motherboard, failed hard drive, or overheating CPU can lead to the server freezing or crashing.
  • Software errors – Bugs, incompatibility issues, infinite loops in code can cause the server OS to freeze up.
  • Resource exhaustion – Heavy load, insufficient RAM, or disk space filling up can bring the server to a halt.

This guide will go through a systematic troubleshooting approach to identify the root cause and restore server responsiveness.

Check Network Connectivity

Before diving into the server itself, first check that the network connectivity to the server is working properly. Simple network issues can manifest as a “server not responding” situation. Use the ping command to verify basic connectivity to the server. As explained on How to Troubleshoot Network Connectivity Issues (https://obkio.com/blog/how-to-troubleshoot-network-connectivity-issues/), ping sends ICMP echo request packets to the target and listens for ICMP echo response packets in return. If the request times out, there is likely an intermediate network issue preventing connectivity.

In addition to testing with ping, do a physical inspection of all network cabling to and from the server, as advised by Troubleshoot Network Connectivity: The Ultimate Guide for Sysadmins (https://www.comparitech.com/net-admin/troubleshoot-network-connectivity/). Make sure cables are fully plugged in, undamaged, and connected to the correct ports. Toggling cables and switches may help reset connections. If the network hardware such as switches or routers are faulty, this can manifest as an unresponsive server.

Verify Server Power

One of the first things to check when troubleshooting a non-responsive server is the power supply. Here are some steps to verify the server is receiving power properly:

First, check that the power cable running from the power supply to the wall outlet is firmly connected at both ends. Loose connections can interrupt power flow to the server. Inspect the cable for any frays or damage which could be preventing a solid connection.

Next, check the power supply itself. On Dell PowerEdge servers like the R710, the power supply is a module that can be replaced. Try swapping in a known good power supply if you have one available to test if the issue is with the existing power module. Refer to the server documentation for instructions on hot swapping the PSU.

If the server has multiple power supplies, try unplugging all but one PSU to isolate the faulty module. Test each power supply individually to identify any not powering on the system properly. Replace any faulty power supplies.

You can also try disconnecting all peripherals and drives from the server to reduce the power load during testing. Press the power button and watch for LED indicators on the power supply modules to confirm they are operating normally.

For more detailed troubleshooting tips from Dell, see: https://www.dell.com/support/kbdoc/en-us/000127944/poweredge-psu-how-to-troubleshoot-a-server-power-supply-unit

Check Server Hardware

One of the most common causes of server issues is faulty hardware components. It’s important to thoroughly check for any overheating issues or failing parts. Overheating can lead to unexpected shutdowns and crashes. Inspect fans and heatsinks for dust buildup, and clean them if needed. Also check CPU and GPU temperatures in the BIOS or using hardware monitoring software. The recommended maximum operating temperature is usually around 60-70°C.

Faulty RAM sticks are another frequent culprit behind server problems. Run memtest to check for errors and failures. If issues are found, try reseating or replacing the DIMM modules. Also check for issues with hard drives and storage controllers. Run diagnostics tests and check SMART status of drives. Replace any defective disks or RAID controllers. Adapter cards can also malfunction – update drivers, reset the device, or swap out the card if trouble persists. Regularly monitoring hardware health stats and logs can help identify failing components early.

Cisco provides extensive troubleshooting guides for diagnosing hardware issues on UCS servers, including checking diagnostic LEDs, testing DIMM memory, and resolving problems with CPUs, drives, and adapters (Cisco UCS Manager Troubleshooting Reference Guide). Rigorously testing all server hardware is crucial before escalating to more complex troubleshooting steps.

Review System Logs

One of the first steps when troubleshooting a non-responsive server is to review the system event logs and error messages. The system logs record various events and errors that occur on the server, which can provide clues into what may be causing the issue.

On Windows servers, key logs to review include the System, Application, and Security event logs (https://www.loggly.com/ultimate-guide/troubleshooting-with-windows-logs/). Check for critical errors, authentication failures, crashes, hardware issues, or other anomalous events around the time the server became unresponsive. On Linux servers, important logs include auth, syslog, kernel, and service-specific logs. Scan for login failures, kernel errors, outages, and application crashes (https://www.loggly.com/ultimate-guide/troubleshooting-with-linux-logs/).

In addition to the logs, check any application or services running on the server for error messages. These may point to specific processes that are failing or resources that are unavailable. Note down any error codes or details to research further. Compare the events and timing against when the server became unresponsive to narrow down potential root causes.

Reviewing the system event logs and error messages provides an overview of the server’s status and where to focus troubleshooting efforts. Matching log entries to the outage time can reveal key events that preceded and likely caused the server to become unresponsive.

Try Restarting Services

One way to troubleshoot a non-responding server is to stop and restart key services. This can help reset any services that may have gotten stuck or failed to start properly.

First, identify the critical services for network connectivity and core server functionality. These usually include the DNS Client, Network Location Awareness, Remote Procedure Call (RPC), and Windows Event Log services (Source 1). Open the Services console and sort the services by status to see which ones are stopped or not running.

To restart a service, right-click on it and select Restart. If it fails to start, you may need to set it to start automatically on boot. Stopping and restarting the services in sequence can often get things running again (Source 2).

Additionally, check for any disabled or problematic services related to the server roles and features installed. For example, restart print spooler for print issues or the Windows Update service if you suspect a bad update. Restarting core services is an easy first troubleshooting step before diving deeper.

Update Drivers and Firmware

Updating drivers and firmware on the server can help resolve issues caused by outdated or buggy versions. Start by checking the manufacturer’s website for the latest versions of drivers and firmware for your specific server model. Major vendors like Dell and HP provide utilities or portals to scan your hardware and identify necessary updates.

For Dell servers, use the Dell Update utilities or Dell Lifecycle Controller to check for firmware and driver updates. HP offers the SPP and iLO portals to manage updates. Windows Server also has built-in mechanisms like Windows Update and Device Manager to check for updated drivers.

Before installing updates, carefully review the release notes to check for fixes, compatibilities, and potential issues. Test the updates first in a non-production environment when possible. After installing, monitor the servers closely for any post-update problems. Allow sufficient time for updates like firmware flashes to fully complete before rebooting. Roll back problematic updates if they introduce instability or problems.

Check for Malware

One of the first troubleshooting steps when a server is unresponsive or acting strangely is to check for viruses and malware. Malicious software like worms or Trojans could be consuming system resources, corrupting files, or causing other issues that lead to poor server performance or downtime. There are a few ways to scan for potential infections on a server:

  • Use antivirus software to perform a full system scan. Many servers run endpoint protection software like Windows Defender or commercial options that can detect malware. Review recent scan reports or initiate a new full scan of all files and memory.
  • Check running processes and services for anything suspicious or unknown using Task Manager. Malware often starts unwanted background processes or services.
  • Review event logs for signs of malware detection or unusual errors that could stem from an infection.
  • Use specialty rootkit detection tools that dig deeper than typical antivirus scans, like RootkitRevealer or Malwarebytes Anti-Rootkit.

Detecting and removing malware on servers requires specialized tools and methods to access protected areas of the system. Work carefully to avoid disruption to critical business services. You may need to boot the server from clean external media to fully scan all files before restoring from backup.

Restore From Backup

One troubleshooting step to try is restoring the server from a recent backup to a last known good state. This can help resolve issues caused by a recent system change or configuration problem. To restore from backup:

1. Identify the most recent viable backup created before the issues occurred. Verify the backup file is not corrupted.

2. Stop any services running on the server to avoid further issues during the restore process.

3. Follow the backup software’s procedures to restore the backup file to the server. Many backup tools like Veeam and Commvault have detailed documentation on the restore process.

4. Once the restore is complete, restart the server. Monitor system logs and functionality to confirm the restore was successful and issues are resolved.

If restoring from the latest backup does not resolve the problem, you may need to attempt restoring an earlier backup to pinpoint when the failure occurred. For more details, refer to the backup software’s support documentation such as this Veeam restore guide.

Advanced Troubleshooting

If basic troubleshooting steps do not resolve the issue, more advanced diagnostic utilities and component replacement may be required to troubleshoot the unresponsive server.

Run hardware diagnostics to check for issues like faulty RAM or a bad hard drive. Most server manufacturers provide a bootable diagnostics CD or USB key to test components like the CPU, memory, and storage. For example, Dell provides the Dell Diagnostics tool, and HP offers HP Insight Diagnostics.[1]

Monitor system health metrics using utilities like Dell OpenManage or HP Systems Insight Manager. These can identify hardware issues like high temperatures, fan failures, and power supply problems.[2]

If diagnostics reveal a faulty component like bad RAM or a failed hard drive, replacement parts may be needed. On enterprise servers, components are typically hot swappable allowing replacement while the server is still running.

For network connectivity issues, inspect the physical cabling and network infrastructure. Replace patch cables or Ethernet switches as needed. Network monitoring tools like Wireshark can help analyze network communication at a packet level.

In some cases, the operating system or critical system files may need to be repaired or reinstalled. Rebuilding a server from scratch can help eliminate software issues or OS corruption as a cause.