Why is database recovery pending?

There are a few key reasons why a database recovery may get stuck in a pending state and not complete successfully. Here are some of the most common causes of pending database recovery and how to troubleshoot them:

Table of Contents

Unstable Hardware

Hardware instability is one of the most common causes of a stuck database recovery. Things like CPU, memory, storage, and network issues can all contribute to a recovery hanging indefinitely. Some signs of hardware problems include:

High CPU usage/throttling

Insufficient memory
Storage latency or disconnects
Network drops or high latency

Troubleshooting hardware stability issues involves checking for overutilization, errors, disconnects, and other signs of strain on your servers. Monitoring tools can help identify constrained resources. You may need to upgrade or replace failing hardware components.

Storage Corruption

If critical database files become corrupted or inaccessible, the recovery process can get stuck. Storage corruption is often tied to underlying hardware issues, but software bugs or crashes can also be a cause. Symptoms of storage-related recovery failures include:

I/O errors when accessing data or log files

Inability to access critical database files
Checksum failures on pages/blocks
Media errors reported at the storage level

Addressing storage corruption involves identifying and replacing any corrupted files from backup. You may need to revert to an earlier backup if the corruption is widespread. Scrubbing storage devices to detect and correct errors may also help.

Conflicting or Incomplete Transactions

Database recovery relies on being able to roll forward completed transactions found in the transaction logs. Any incomplete or uncommitted transactions block this process from completing. Usually transaction issues stem from a non-graceful shutdown, such as a power loss or crash. Some typical symptoms include:

Repeated rollback attempts

Stuck rolling forward or rolling back transactions
Orphaned prepared transactions
Inconsistent transaction log records

Fixing transaction-related recovery problems requires careful examination of the logs to identify and resolve any incomplete transactions. Manual intervention may be needed to commit or roll back stuck transactions. Point-in-time recovery to before the issue may be necessary.

System Resource Exhaustion

Lack of sufficient resources like memory, threads, disk space, or network bandwidth can all cause database recovery processes to hang or fail. If the server is too constrained, recovery cannot complete. Watch for:

Consistently high CPU, memory, or I/O usage

Database or transaction log disk space filling up
Memory errors or swapping/paging
Thread deadlocks or resource waits

Add more resources like RAM, CPUs, storage capacity, or faster networks to resolve resource starvation. Tune configuration settings that limit resources used by recovery processes like memory caches or parallel threads.

Blocking Locks and Latches

Locks and latches held by active sessions/queries can stall a recovery in progress from making changes. Types of blocking include:

Locks on database objects

Log file latches
IO requests waiting on busy files
Locks taken for crash recovery

Identify any long-running transactions, sessions, or queries that are holding needed locks. Kill blocking processes or roll back problematic transactions if possible. Disable extraneous applications and services during recovery to reduce contention.

Startup Errors

Problems that occur when initially starting database services after a failure can also lead to stuck recovery routines. For example:

Errors mounting database files

Crash recovery failures
Cannot access transaction logs
System table corruption

Review logs for initialization errors and failed health checks. Resolve startup failures like repairing files or restoring data before attempting recovery again. The database may need reinstalling or recreating if system table errors occur.

apply and undo problems

During recovery, changes recorded in the transaction logs are applied to database files, and any uncommitted transactions are rolled back or undone. Failures in this apply and undo phase can cause issues:

Errors applying logged changes

Crashing during rollback
Undo records not found
Timeout rolling back large transactions

Enable detailed recovery logging to diagnose apply/undo errors. Resolve any file discrepancies found between files and logs. Adjust rollback segments and UNDO tablespace for longer-running transactions. Prioritize problem transactions.

Media Recovery Failures

Media recovery is the process of restoring database files from backup before recovery is run. Media recovery problems like the following will lead to recovery stalling:

Backup files unavailable or corrupted

Not enough space to restore files
Incomplete or inconsistent backups
Network errors or timeouts

Use reliable backup storage and test restores periodically. Monitor space for your backups. Maintain a valid backup strategy with frequent full and incremental backups. Resolve network issues between database and backup servers.

Upgrade Issues

Recovering a database after a failed or interrupted upgrade can also pose challenges. For example:

Rollback failures after upgrade

Incompatibilities between versions
Corruption from migration errors
Feature differences between releases

Thoroughly test upgrades and rollbacks in Dev/QA to avoid production issues. Consult release notes for version differences. Scrutinize logs for migration-related errors and revert from backup if needed.

Log Transport Problems

Databases using log transport to send transaction logs to a standby system can have recovery problems if the transport fails or has issues:

Network connectivity problems

Log sender/receiver errors
Standby missing logs
Log IO latency

Monitor and resolve networking problems between primary and standby. Tune log transport settings for performance. Ensure standby receives all logs from primary via checksums. Fall back to native logging if needed.

Security and Permission Issues

Recovery processes may fail to start or complete if security policies are misconfigured:

Service accounts lack proper privileges

File/folder permissions prevent access
Audit settings blocking actions
Firewall blocking network access

Review permissions and grant recovery processes the access they need. Disable unnecessary auditing during recovery. Whitelist IP addresses if firewall rules are blocking access.

Manual Errors

Admin mistakes made during the recovery process can also lead to stalled or failed jobs:

Point-in-time recovery target is incorrect

Important steps skipped
Critical files or backups missed
Improper shutdown/startup sequence

Follow database vendor recovery instructions closely. Double check process and inputs. Take your time and don’t skip steps. Ask for assistance if unsure of any recovery details.

Unsupported Recovery Options

Attempting to use unsupported recovery methods or options can also cause issues:

Third party tools not certified for the database

Features labeled deprecated or end-of-life
Unsupported crossover/downgrade versions
Recovering to different hardware

Stick to vendor supported recovery tools and techniques. Favor in-place recoveries when possible. Don’t change server hardware or downgrade versions unless explicitly certified.

Unoptimized Configuration

Certain configuration settings can slow down recovery processes or lead to problems:

Undersized memory caches or buffers

Too few or oversubscribed CPU cores
Aggressive resource throttling
Log files on slow storage

Tune configurations to ensure adequate resources for recoveries. Size memory appropriately. Assign sufficient cores/sockets. Place transaction logs on fast storage. Stress test on non-production first.

Service and Process Failures

Stopped, crashed, or misbehaving database services and processes will stall recovery:

Database engine won’t start up or initialize

Recovery manager not starting
Crashes or assertion failures
Stuck on loading files or redo logs

Inspect process status and logs to identify failures. Restart stopped or crashed services. Apply corrective actions like patches for any known issues. Enable debugging or tracing for more details on process problems.

Limitations of Recovery Manager

Design limitations in the database recovery manager can also sometimes lead to problems:

Limited concurrency and parallelism

Poor scaling at higher database sizes
Slow redo log application
Exhaustion of rollback segments

Engage vendor support for suspected recovery manager bottlenecks. Request guidance on optimizing for large databases. Upgrade recovery manager module if better scaling is available in later releases.

Conclusion

While database recovery issues can stem from many different causes, the common theme is identifying and resolving the underlying problem blocking the recovery process from completing. Careful diagnosis of symptoms along with a solid methodology for observing, gathering data, and testing theories is key.

Treating recovery problems like any other performance investigation and avoiding shortcuts or assumptions is critical. Proper monitoring, logging, and tools for deep visibility into the recovery process helps troubleshoot the root cause.

Database vendors also provide extensive guidance on optimized recovery configurations, troubleshooting steps, and best practices that take much of the guesswork out of the process. Leveraging all available documentation and assistance to make the recovery process go smoother is highly recommended.

With robust backup and recovery procedures in place ahead of time, thorough issue analysis, and persistence in eliminating roadblocks, even stubborn database recovery failures can usually be overcome and resolved, preventing potential data loss scenarios.

Recovery Failure Type	Common Causes	Troubleshooting Steps
Hardware instability	High utilization, errors, component failures	Monitor usage, check logs, test components, upgrade hardware
Storage corruption	Hardware issues, software bugs, crashes	Scan and replace corrupted files, revert to backup
Transaction issues	Crashes, incomplete transactions	Examine logs, manually resolve transactions
Resource exhaustion	Memory, CPU, disk space depleted	Scale up resources, tune configurations
Blocking locks	Long queries, concurrency issues	Identify and resolve blocking queries
Startup failures	File access issues, crashes	Repair corruptions, restore files
Apply/undo failures	File discrepancies, long rollbacks	Prioritize transactions, extend rollback segments
Media recovery issues	Backup problems, incomplete data	Maintain backup systems, monitor closely
Upgrade problems	Compatibility issues, migration corruption	Test rigorously, check release notes