What was the cause of the FAA glitch?

The recent nationwide outage of the Federal Aviation Administration’s (FAA) Notice to Air Missions (NOTAM) system that led to the grounding of all domestic departing flights on January 11, 2023, has left many wondering what exactly caused such a major failure. While the root cause is still under investigation, early analysis points to a database failure during a routine overnight system update as the trigger that started the cascading breakdown throughout the morning.

What is the NOTAM system?

NOTAMs are notices containing information essential to personnel concerned with flight operations, but not known far enough in advance to be publicized by other means. They contain information on unanticipated or temporary changes to components of the National Airspace System, such as:

  • Closed runways
  • Equipment outages
  • Taxiway closures
  • Inoperable navaids and lights
  • Bird hazard warnings

Pilots are required to check NOTAMs before each flight to see if there is any information that could affect their route. The NOTAM system provides this critical safety information that lets pilots know what operating conditions to expect during a flight.

When did the outage occur?

On January 11, 2023 at approximately 1:00 am PT, the FAA began experiencing issues with the NOTAM system during a regularly scheduled outage window used for maintenance. Updates to the database failed, setting off a chain reaction of problems.

As pilots and airlines began preparing for morning departures on the East Coast, it quickly became clear something was wrong. NOTAM information was incomplete, leading to uncertainty about operating restrictions and requirements. Just before 6:00 am ET, the FAA ordered all domestic departures grounded pending resolution of the NOTAM system issue.

How long did the failure last?

The NOTAM system remained in a failed state for approximately 10 hours. At 7:05 am ET, the FAA reported they were still working to validate the integrity of NOTAM data. With uncertainty about the accuracy of the information pilots rely on to operate safely, the ground stop continued.

Limited departure operations resumed around 8:30 am ET, but delays and cancellations continued to snowball. It wasn’t until just before 9:00 am ET that the FAA fully lifted the ground stop on all domestic departures and began allowing operations to fully resume.

How many flights were impacted?

During the outage, more than 1,200 flights within, into, or out of the United States were delayed or canceled. This included more than 9,000 delayed flights and nearly 1,500 canceled flights according to the flight tracking website FlightAware.

The impact was felt across the country, but was most significant for the major airlines operating out of East Coast transit hubs like Atlanta, Washington D.C., New York, and Miami. Ripples spread globally as banks of delayed planes caused downstream disruptions at international departure points.

What was the economic impact?

While a full post-mortem on losses stemming from the outage has yet to be tallied, early estimates on direct costs to the industry reach hundreds of millions of dollars:

  • Lost operating revenue from canceled flights
  • Added costs to airlines for delayed crew, aircraft, and passengers
  • Refunded tickets and rebooked travel for impacted passengers
  • Lost economic opportunity from business travel and cargo transport delays

There is also the harder to quantify erosion of public confidence and inconvenience caused by the disruption. For an industry still rebuilding operations after Covid-related drawdowns, the outage was an untimely setback.

What caused the initial system failure?

Investigators believe the root cause was an overloaded data file that crashed the application servers generating NOTAM updates. Here is what is known so far about the sequence of events:

  1. Routine overnight update initiated at approximately 1:00 am PT
  2. Database patch inserted large data file exceeding capacity limits
  3. Application servers crashed trying to process update
  4. Backup failover servers also overloaded and failed
  5. NOTAM system entered invalid state with incomplete, inaccurate data

So in essence, a preventable software error during a seemingly innocuous database update triggered a cascade of failures the system was not resilient enough to recover from.

How could a basic error cause such an extensive outage?

While the specific failure triggering the NOTAM system crash was a simple mistake, it exposed weaknesses in the FAA’s IT infrastructure:

  • Fragile legacy system – The NOTAM database relies on outdated hardware and software vulnerable to failures.
  • Lack of redundancy – Backup systems did not have enough capacity to maintain operations when primary servers failed.
  • Inadequate safeguards – Validation controls and health monitoring lacked sophistication to catch critical errors.
  • Delayed testing – New emergency measures were slow to be simulated before deploying to production.

So while human error may have sparked the outage, observers fault FAA management for poor technology governance that left the NOTAM system susceptible to breakdowns.

What emergency measures were taken?

Facing a national flight crisis, the FAA needed to find a workaround fast. They pursued three stopgap solutions to restore NOTAM data access while engineers worked to recover the main system:

  1. Coming Soon NOTAM Viewer – Read-only view of NOTAM data made available to airlines.
  2. Teleconferences – Hourly phone calls between airlines, airports and the FAA to relay critical NOTAM details.
  3. Emergency Reduced Operating Picture (eROP) – Web interface allowing airlines to request specific NOTAM information.

Rolling out these contingency measures took several hours, causing extended uncertainties and delays. But once in place, operations began normalizing.

What will the post-mortem find?

Aviation experts expect the FAA’s final incident report will cite some key vulnerabilities:

  • Antiquated hardware and patchwork legacy systems
  • Inadequate network redundancy and resiliency measures
  • Poor boundary controls and testing procedures
  • Slow crisis response and contingency rollouts
  • Lack of coordination and clarity around operational decisions

Observers also noted communication breakdowns between airlines and the FAA caused additional delays during the crisis. Overall, the NOTAM failure reflects systemic weaknesses in technology management and emergency preparedness.

What improvements were recommended?

In the aftermath, aviation stakeholders proposed several measures to make NOTAM and other FAA systems more robust:

  • Upgrade aging hardware infrastructure to modernize technology
  • Build additional redundancy at both datacenters and hardware levels
  • Implement more sophisticated controls around testing and verification
  • Streamline contingency plans to bypass standard change processes
  • Establish emergency operational mandates to minimize disruption
  • Conduct more joint training between airlines, airports and the FAA

While costly, these steps would help turn outdated systems into the always-on, resilient operations required for modern aviation management.

What long-term actions did the FAA announce?

Facing scrutiny, the FAA announced several initial measures focused on both prevention and recovery:

  • Order a two-week pause in non-essential IT changes to review processes
  • Establish a joint team including DOT, airlines, and airports to oversee IT reviews
  • Expedite ongoing efforts to modernize aviation infrastructure and data sharing
  • Review contingency plans and mature future crisis response capabilities
  • Publish a report on findings and a plan for enhancing aviation system resiliency

While a good start, many still want to see thorough modernization initiatives funded and prioritized following years of information technology neglect. Lasting improvements will require major strategic changes.

What upgrades are planned for NOTAM?

Even before this outage, the FAA had initiated a NOTAM improvement program with several components:

  • NOTAM Search – New web-based search and filtering tools for more user-friendly access.
  • NOTAM to Digital – Transition from 1990s era mainframe to modern digital information architecture.
  • SWIM Connectivity – API integration with the FAA’s System Wide Information Management data network.

These efforts were already overdue prior to this event. The NOTAM failure underscores the urgent need to modernize this critical system. Target timelines must also be accelerated.

NOTAM Search Upgrades

Some enhancements slated for the main NOTAM search application include:

  • Advanced filtering and favorites to ease access to relevant NOTAMS.
  • Geographic selection to highlight local operating conditions.
  • User personalization such as watchlists and custom layouts.
  • Upgraded alerting for changes of interest.
  • Ability to print/export results improving information sharing.

These features will help reduce information overload and make critical NOTAM data more usable for pilots and operators once deployed. Originally promised by 2025, need to assess accelerating delivery following recent outage.

NOTAM Data Modernization

The planned transition from mainframe to digital platforms includes initiatives like:

  • Migrating to modern SQL database from flat file legacy system.
  • Building in API connectivity for distributing NOTAM data.
  • Redundant infrastructure and load balancing for higher availability.
  • Elastic computing capabilities to handle usage spikes.
  • Automated monitoring dashboards with advanced alerting.

Retiring outdated hardware will prevent crises like the recent one caused by overtaxed mainframes. This program is currently under procurement with completion eyed by 2025. The timeline should be revisited after the outage root cause analysis is complete.

SWIM System Integration

SWIM or System Wide Information Management is FAA’s nextgen aviation data network. NOTAM capability upgrades in the pipeline include:

  • SWIM publishing to make NOTAM data available across all SWIM endpoints.
  • Machine-to-machine API delivery replacing manual lookup process.
  • Shared airspace restrictions data via the Common Status and Structure data service.
  • Collaborative real-time editing with airports and airlines.

The benefits from API integration will be huge. Many observers believe SWIM should ultimately become the single authoritative source for critical operating conditions data. But the effort remains in scoping and design stages.

Conclusion

In summary, the crippling of the FAA NOTAM system on January 11, 2023, lays bare the vulnerabilities and risks associated with relying upon outdated aviation infrastructure. While human error may have triggered this specific outage, it is clear that only modernization and proactive enhancement of these mission critical systems will avoid similar failures in the future.

The FAA has committed to accelerating ongoing upgrades. But demonstrable urgency, adequate resourcing and strong leadership will be required to deliver meaningful improvements following years of technical debt accumulation. All aviation stakeholders must unite to help drive this transformation while holding the FAA accountable.

The safely of the traveling public and the stability of air travel demand that NOTAM and supporting capabilities meet the highest standards of availability and reliability. This outage makes it vividly clear that reaching those standards will require significant further investment, oversight and transparency around the FAA’s IT strategy and execution.