What to do when database is in recovery pending?

What does “recovery pending” mean for a database?

When a database is in a “recovery pending” state, it means that the database is currently undergoing crash recovery after an unexpected shutdown or failure. During crash recovery, the database system goes through a process of replaying transaction logs to bring the database back to a consistent state. Some key things to understand about recovery pending:

  • User connections cannot be accepted during this state. Any attempt to connect to the database will be rejected.
  • No reads or writes can be performed against the database until recovery completes.
  • Recovery pending indicates the database is offline and unavailable for normal operations.
  • The database will automatically transition out of this state once crash recovery completes.
  • The length of recovery pending depends on amount of transactions that need replaying.

So in summary, recovery pending is a temporary offline mode for the database to safely recover. User applications will be unable to connect during this process.

How can you monitor the recovery pending state?

There are a few ways to monitor the progress of a database undergoing recovery:

  • Check the database logs – The database logs will provide detailed information on the stages of recovery and any associated errors or warnings.
  • Use database console/utilities – Most database platforms provide a console or utilities that can query the status of a recovering database.
  • Check service/process status – The database process/service can be queried to see if it is still in a starting or recovering state.
  • Attempt connections – Trying to open connections periodically can indicate when recovery completes and connections are allowed again.
  • Monitor performance metrics – Disk IO, CPU usage may be heightened during recovery pending.

Automated monitoring scripts can also poll the database and logs at a frequent interval to detect when recovery has completed and normal operations can resume.

What maintenance can be performed during recovery pending?

Since the database is in an offline state during recovery pending, there are limited maintenance tasks that can be performed:

  • Review database logs – The recovery logs can provide useful insight into any recurring issues or errors.
  • Adjust database parameters – Some static configuration parameters may be changed temporarily for the next startup.
  • Take database backups – A backup can be taken of the database files/volumes after recovery completes.
  • Review capacity needs – Use the downtime to evaluate if capacity needs to be expanded.
  • Schedule maintenance window – Plan a maintenance window after recovery to apply any updates or patches.
  • Check for corruption – Once recovered, verify there is no data corruption or inconsistencies.
  • Upgrade hardware – If recovery took a long time due to hardware constraints, an upgrade may be warranted.

However, no direct changes or active administration of the recovering database itself can be done until it is back online.

What causes a database to go into recovery pending?

Some common triggers for a database entering a recovery pending state include:

  • Server crash or failure – An OS crash, power loss, or storage failure results in abrupt database shutdown.
  • Manual database shutdown – Administrators forcefully shutdown the database without proper quiescing.
  • Memory errors – Memory leaks, segmentation faults, or heap corruption crashes the database process.
  • Resource constraints – Lack of disk space, memory, CPU can cause database failure.
  • Code defects – Bugs, flaws, or race conditions in database code crash the instance.
  • Hardware faults – CPU, memory, motherboard, storage faults can crash a database.
  • Recovery stalling – A prior recovery fails or stalls, forcing another recovery attempt.

Determining the root cause of the original failure can provide insight to prevent future unplanned outages requiring recovery. Detailed information is usually logged during the recovery process itself as well.

How can you avoid recovery pending states?

Some best practices to avoid unexpected database crashes and recovery pending states:

  • Use redundant, fault-tolerant server infrastructure.
  • Follow a regular backup schedule for disaster recovery.
  • Monitor system health metrics like disk, memory, CPU usage.
  • Keep the database software up-to-date with latest fixes.
  • Tune database server for optimal performance.
  • Set up alerting for key performance metrics.
  • Adhere to database’s recommended maintenance schedule.
  • Ensure database connection limits are sized appropriately.
  • Isolate databases on separate infrastructure when possible.
  • Avoid manual database restarts/shutdowns during production.

Proper database configuration, stable well-managed infrastructure, and regular maintenance are key to maximizing uptime and avoiding unplanned downtime.

What steps should be taken after a database recovers?

Once a database completes recovery and is back online, the following post-recovery steps should be taken:

  • Verify connectivity – Confirm applications can establish connections to database.
  • Check data integrity – Spot check data and run integrity checks to ensure proper recovery.
  • Review logs – Look for errors, warnings, or any anomalous activity during recovery.
  • Determine recovery time – Calculate the downtime and recovery time for the incident.
  • Take fresh backups – Take new backups immediately in case they are needed.
  • Monitor system health – Keep an eye on system metrics like CPU, memory in case of issues.
  • Notify stakeholders – Update relevant teams/users that database is recovered and online.
  • Identify cause – Dig deeper into logs, events leading up to failure to determine root cause.
  • Assess impact – Document business impact, revenue loss, SLA breaches, and damage from the downtime.
  • Retune databases – Look at undoing any temporary parameter tweaks made for recovery.
  • Evaluate capacity needs – Review if capacity needs to be expanded to handle workload.
  • Update runbook – Improve recovery runbook with lessons learned from the incident.

Proactively assessing the recovery process and applying any lessons learned can better prepare for dealing with any future database outages.

How to restart a database instance stuck in recovery pending?

If a database instance remains stuck in recovery pending for an excessive amount of time, a restart may be required to get out of the pending state. Some approaches include:

  • Kill database process – Forcibly terminate the database process/service to trigger a restart.
  • Restart server – A complete server restart will stop the database process as part of the reboot.
  • Disable recovery – Temporarily disable crash recovery features to startup database.
  • Adjust undo retention – Reduce undo retention period to speed up rollbacks on restart.
  • Start mount-only – Initiate a mount-only startup then open database to skip recovery.
  • Restore backup – Restore a backup taken before the failure to bypass recovery.
  • Call support – If available, engage database vendor support for further troubleshooting.

However, repeatedly restarting a recovery pending database without fixing underlying issues can result in data loss or corruption. The root cause should be thoroughly investigated first.

Troubleshooting steps for prolonged recovery pending

If a database remains stuck in recovery pending beyond an acceptable timeframe, administrators should investigate potential reasons:

  • Review logs – Check recovery logs for errors and investigate any reported issues.
  • Check disk space – Verify sufficient disk space exists for rollback segments.
  • Monitor system resources – Look for bottlenecks like CPU, memory, IO that slow recovery.
  • Validate database files – Make sure all datafiles, log files are present and accessible.
  • Adjust memory settings – Insufficient memory allocated to recovery process will delay.
  • Tune rollback segments – Having too few may bottleneck, while too many may also slow.
  • Disable triggers/constraints – Consider temporarily disabling these database objects.
  • Start mount-only – Try starting database in mount-only mode then opening normally.
  • Open database read-only – Open database in read-only initially to inspect state.
  • Engage support – Database vendor support can provide guidance if available.

Slow, stalled recovery scenarios are complex. Follow a methodical process of checking logs, system resources, database objects to isolate the underlying problem.

What are the most common errors when a database is recovering?

Some typical errors that can be encountered during database crash recovery include:

  • Missing datafiles – Recovery cannot start due to corrupt or absent datafiles.
  • Redo log corruption – Damaged redo logs prevent proper transaction replay.
  • Archived log IO issues – Failure to read required archived redo logs.
  • Space errors – Lack of disk space for rollback segments and temporaries.
  • Memory issues – Insufficient memory allocated for recovery processing.
  • Object contention – Locks or latches held on database objects block access.
  • Statement failures – Replay of corrupt or invalid SQL causes errors.
  • Deadlocks – Concurrent transactions deadlock during rollback and replay.
  • Restart bottlenecks – Repeated recoveries compete for resources, slowing all.
  • Rollback segment errors – Too few or too small rollback segments bottleneck.

Analyzing the specific recovery errors reported can provide troubleshooting clues. Isolating the first error shown often highlights the root cause.

How to recover from catastrophic database failure?

Recovering from a catastrophic database failure where datafiles or redo logs are corrupted requires an extensive data recovery process:

  • Take database offline immediately – Prevent further writes to corrupted files.
  • Review storage failure alerts – Check for disk errors, out of space conditions.
  • Assess damage scope – Determine database files affected and timeframes.
  • Restore clean backups – Restore up-to-date uncorrupted full and incremental backups.
  • Apply archived redo – Reapply archived redo logs until right before corruption occurred.
  • Open database – Open database to begin completion of crash recovery.
  • Verify via DBCC – Run DBCC CHECKDB to validate no corruption after recovery.
  • Compare reports – Compare old and new query reports to verify accuracy.
  • Update documentation – Document failure timeline, recovery process, and findings.
  • Notify users – Inform necessary parties of recovery process and any restored data timeframes.

A methodical approach of restoring backups combined with reapplying archived logs can effectively recover a database from catastrophic failures in most cases.

How to recover from a prolonged recovery pending?

Some potential solutions for resolving a database stuck in prolonged recovery pending include:

  • Review logs for errors – Address any reported errors or problems preventing completion.
  • Allocate more temp space – Increase temp tablespace size if running out of space.
  • Reduce recovery workload – Limit number of concurrent rollbacks by increasing processes.
  • Disable non-essential features – Turn off resource-intensive features like auditing.
  • Start mount-only – Attempt starting up in mount mode then opening database.
  • Restore backup – Restore backups from before the failure to recover faster.
  • Call support – Engage vendor database support for specific recommendations.
  • Upgrade hardware – Improve server resources if recovery slow due to constraints.
  • Adjust tuning parameters – Tweak parameters around undo retention, processes, memory.
  • Recover read-only – Make database read-only temporarily until recovery stabilizes.

Isolating the bottleneck is key, whether that is disk space, memory, processing resources or database configuration parameters.

What are performance considerations during recovery?

Database crash recovery places significant demands on system resources that can impact performance:

  • Disk IO – Redo log reading and writing rollback blocks generates heavy disk load.
  • CPU usage – Log scanners and processes rolling back transactions consume CPU cycles.
  • Memory – Increased buffer cache usage for reading in rolled back blocks.
  • Network IO – Restoring archived redo logs requires high network bandwidth.
  • Concurrency – More rollback segments and processes may be needed to reduce contention.
  • Temp space – Large amounts of temporary tablespace is utilized during recovery.
  • Read consistency – Maintaining full read consistency during recovery incurs overhead.
  • Redo generation – Heavy redo volume leads to higher IO impact.
  • Compression – Disabling compression on redo logs eliminates this CPU cost during recovery.
  • Index maintenance – Rebuilding indexes and statistics post-recovery adds load.

Understanding these resource usage patterns allows properly sizing systems, tuning configurations and scheduling maintenance to support databases during crash recovery scenarios.

Conclusion

Recovery pending is an unavoidable database state that encapsulates the process of replaying transactions to restore consistency after a failure. Monitoring resources, tuning configurations, and analyzing logs help minimize recovery times during this period. Restoring from backups combined with archived redo can recover even from catastrophic data failures. Careful system sizing, regular maintenance, and definig recovery strategies are key to maintaining uptime and availability through outages requiring recovery.