There are a few key reasons why a database recovery may get stuck in a pending state and not complete successfully. Here are some of the most common causes of pending database recovery and how to troubleshoot them:
Unstable Hardware
Hardware instability is one of the most common causes of a stuck database recovery. Things like CPU, memory, storage, and network issues can all contribute to a recovery hanging indefinitely. Some signs of hardware problems include:
- High CPU usage/throttling
- Insufficient memory
- Storage latency or disconnects
- Network drops or high latency
Troubleshooting hardware stability issues involves checking for overutilization, errors, disconnects, and other signs of strain on your servers. Monitoring tools can help identify constrained resources. You may need to upgrade or replace failing hardware components.
Storage Corruption
If critical database files become corrupted or inaccessible, the recovery process can get stuck. Storage corruption is often tied to underlying hardware issues, but software bugs or crashes can also be a cause. Symptoms of storage-related recovery failures include:
- I/O errors when accessing data or log files
- Inability to access critical database files
- Checksum failures on pages/blocks
- Media errors reported at the storage level
Addressing storage corruption involves identifying and replacing any corrupted files from backup. You may need to revert to an earlier backup if the corruption is widespread. Scrubbing storage devices to detect and correct errors may also help.
Conflicting or Incomplete Transactions
Database recovery relies on being able to roll forward completed transactions found in the transaction logs. Any incomplete or uncommitted transactions block this process from completing. Usually transaction issues stem from a non-graceful shutdown, such as a power loss or crash. Some typical symptoms include:
- Repeated rollback attempts
- Stuck rolling forward or rolling back transactions
- Orphaned prepared transactions
- Inconsistent transaction log records
Fixing transaction-related recovery problems requires careful examination of the logs to identify and resolve any incomplete transactions. Manual intervention may be needed to commit or roll back stuck transactions. Point-in-time recovery to before the issue may be necessary.
System Resource Exhaustion
Lack of sufficient resources like memory, threads, disk space, or network bandwidth can all cause database recovery processes to hang or fail. If the server is too constrained, recovery cannot complete. Watch for:
- Consistently high CPU, memory, or I/O usage
- Database or transaction log disk space filling up
- Memory errors or swapping/paging
- Thread deadlocks or resource waits
Add more resources like RAM, CPUs, storage capacity, or faster networks to resolve resource starvation. Tune configuration settings that limit resources used by recovery processes like memory caches or parallel threads.
Blocking Locks and Latches
Locks and latches held by active sessions/queries can stall a recovery in progress from making changes. Types of blocking include:
- Locks on database objects
- Log file latches
- IO requests waiting on busy files
- Locks taken for crash recovery
Identify any long-running transactions, sessions, or queries that are holding needed locks. Kill blocking processes or roll back problematic transactions if possible. Disable extraneous applications and services during recovery to reduce contention.
Startup Errors
Problems that occur when initially starting database services after a failure can also lead to stuck recovery routines. For example:
- Errors mounting database files
- Crash recovery failures
- Cannot access transaction logs
- System table corruption
Review logs for initialization errors and failed health checks. Resolve startup failures like repairing files or restoring data before attempting recovery again. The database may need reinstalling or recreating if system table errors occur.
apply and undo problems
During recovery, changes recorded in the transaction logs are applied to database files, and any uncommitted transactions are rolled back or undone. Failures in this apply and undo phase can cause issues:
- Errors applying logged changes
- Crashing during rollback
- Undo records not found
- Timeout rolling back large transactions
Enable detailed recovery logging to diagnose apply/undo errors. Resolve any file discrepancies found between files and logs. Adjust rollback segments and UNDO tablespace for longer-running transactions. Prioritize problem transactions.
Media Recovery Failures
Media recovery is the process of restoring database files from backup before recovery is run. Media recovery problems like the following will lead to recovery stalling:
- Backup files unavailable or corrupted
- Not enough space to restore files
- Incomplete or inconsistent backups
- Network errors or timeouts
Use reliable backup storage and test restores periodically. Monitor space for your backups. Maintain a valid backup strategy with frequent full and incremental backups. Resolve network issues between database and backup servers.
Upgrade Issues
Recovering a database after a failed or interrupted upgrade can also pose challenges. For example:
- Rollback failures after upgrade
- Incompatibilities between versions
- Corruption from migration errors
- Feature differences between releases
Thoroughly test upgrades and rollbacks in Dev/QA to avoid production issues. Consult release notes for version differences. Scrutinize logs for migration-related errors and revert from backup if needed.
Log Transport Problems
Databases using log transport to send transaction logs to a standby system can have recovery problems if the transport fails or has issues:
- Network connectivity problems
- Log sender/receiver errors
- Standby missing logs
- Log IO latency
Monitor and resolve networking problems between primary and standby. Tune log transport settings for performance. Ensure standby receives all logs from primary via checksums. Fall back to native logging if needed.
Security and Permission Issues
Recovery processes may fail to start or complete if security policies are misconfigured:
- Service accounts lack proper privileges
- File/folder permissions prevent access
- Audit settings blocking actions
- Firewall blocking network access
Review permissions and grant recovery processes the access they need. Disable unnecessary auditing during recovery. Whitelist IP addresses if firewall rules are blocking access.
Manual Errors
Admin mistakes made during the recovery process can also lead to stalled or failed jobs:
- Point-in-time recovery target is incorrect
- Important steps skipped
- Critical files or backups missed
- Improper shutdown/startup sequence
Follow database vendor recovery instructions closely. Double check process and inputs. Take your time and don’t skip steps. Ask for assistance if unsure of any recovery details.
Unsupported Recovery Options
Attempting to use unsupported recovery methods or options can also cause issues:
- Third party tools not certified for the database
- Features labeled deprecated or end-of-life
- Unsupported crossover/downgrade versions
- Recovering to different hardware
Stick to vendor supported recovery tools and techniques. Favor in-place recoveries when possible. Don’t change server hardware or downgrade versions unless explicitly certified.
Unoptimized Configuration
Certain configuration settings can slow down recovery processes or lead to problems:
- Undersized memory caches or buffers
- Too few or oversubscribed CPU cores
- Aggressive resource throttling
- Log files on slow storage
Tune configurations to ensure adequate resources for recoveries. Size memory appropriately. Assign sufficient cores/sockets. Place transaction logs on fast storage. Stress test on non-production first.
Service and Process Failures
Stopped, crashed, or misbehaving database services and processes will stall recovery:
- Database engine won’t start up or initialize
- Recovery manager not starting
- Crashes or assertion failures
- Stuck on loading files or redo logs
Inspect process status and logs to identify failures. Restart stopped or crashed services. Apply corrective actions like patches for any known issues. Enable debugging or tracing for more details on process problems.
Limitations of Recovery Manager
Design limitations in the database recovery manager can also sometimes lead to problems:
- Limited concurrency and parallelism
- Poor scaling at higher database sizes
- Slow redo log application
- Exhaustion of rollback segments
Engage vendor support for suspected recovery manager bottlenecks. Request guidance on optimizing for large databases. Upgrade recovery manager module if better scaling is available in later releases.
Conclusion
While database recovery issues can stem from many different causes, the common theme is identifying and resolving the underlying problem blocking the recovery process from completing. Careful diagnosis of symptoms along with a solid methodology for observing, gathering data, and testing theories is key.
Treating recovery problems like any other performance investigation and avoiding shortcuts or assumptions is critical. Proper monitoring, logging, and tools for deep visibility into the recovery process helps troubleshoot the root cause.
Database vendors also provide extensive guidance on optimized recovery configurations, troubleshooting steps, and best practices that take much of the guesswork out of the process. Leveraging all available documentation and assistance to make the recovery process go smoother is highly recommended.
With robust backup and recovery procedures in place ahead of time, thorough issue analysis, and persistence in eliminating roadblocks, even stubborn database recovery failures can usually be overcome and resolved, preventing potential data loss scenarios.
Recovery Failure Type | Common Causes | Troubleshooting Steps |
---|---|---|
Hardware instability | High utilization, errors, component failures | Monitor usage, check logs, test components, upgrade hardware |
Storage corruption | Hardware issues, software bugs, crashes | Scan and replace corrupted files, revert to backup |
Transaction issues | Crashes, incomplete transactions | Examine logs, manually resolve transactions |
Resource exhaustion | Memory, CPU, disk space depleted | Scale up resources, tune configurations |
Blocking locks | Long queries, concurrency issues | Identify and resolve blocking queries |
Startup failures | File access issues, crashes | Repair corruptions, restore files |
Apply/undo failures | File discrepancies, long rollbacks | Prioritize transactions, extend rollback segments |
Media recovery issues | Backup problems, incomplete data | Maintain backup systems, monitor closely |
Upgrade problems | Compatibility issues, migration corruption | Test rigorously, check release notes |