Our facility is a multispecialty group practice with fully integrated electronic health record (EHR), laboratory, radiology, practice management, administrative, and file servers hosted onsite.
On December 26, 2022, at about 4 a.m., the head of maintenance was alerted by the fire alarm monitoring company that the power to the building was out. Investigation revealed that water from a leaking sink one floor above the main electric breaker room had seeped into the breaker circuit board, leading to an immediate loss of power. The incident sparked an electrical fire that triggered the smoke detectors, which, in turn, alerted the alarm monitoring company.
Within the hour, the information technology (IT) team was in the building to shut down the main and backup servers. The team noted that the servers had already shut down, as the backup batteries had run out of power. This led to a catastrophic shutdown of all EHR servers.
The power was restored later that day, and the EHR server was restarted. The all-clear was given for the clinical staff to start using the EHR. Over the next 48 hours, the EHR began generating error messages and eventually had to be shut down. It would not be fully operational for another 48 hours, causing a major disruption in operations. Eventually, it was determined that the EHR formulary database had been corrupted by the catastrophic shutdown of the server.
As part of the database checks related to the server disruption, the administrative file server was found to be corrupted, unrelated to the EHR formulary database issues. The data restoration process also resulted in data loss that crippled administrative operations for more than a week.
ROOT CAUSE ANALYSIS AND FINDINGS
The root cause analysis (RCA) looked at the administrative protocols, physical security, and technical security measures in place at the time of the incident. The primary quality improvement (QI) tools used were the Fish Bone RCA (Figure 1) and Failure Mode and Effects Analysis (FMEA) (Figure 2).
Figure 1. Cause and Effect Analysis for EHR Crash
Figure 2. FMEA for Server Security
Administrative policies and software for protecting the servers from unauthorized breaches were being followed at the time of the incident, and no lapses were noted during the RCA. However, a retracing of events indicated that the physical safeguards appear to have failed.
The main power supply was located one floor directly below a sink drain. A water leak from the drain dripped onto the electric breaker box, which in turn started an electrical fire. This chain of events led to a sudden power loss to the building and the server room. Smoke from the fire then set off an alarm, which automatically alerted the fire department. The fire department contacted the building maintenance department to gain access.
The server had adequate backup for uninterrupted power supply (UPS) from two separate battery packs. One pack lasted 20 minutes, and the other 45 minutes. The UPS was an older model designed with an alert feature that notifies IT by email about power failure; it did not have an audible text message feature. Although a message was sent at 4 a.m., it was not seen until the fire service notified the maintenance team, which notified IT about the power loss.
The IT team responded promptly to the power loss notification, but the servers still shut down in a catastrophic manner, leading to corruption of the database. The delay in the IT team getting to the building meant that by the time they arrived, the servers had begun to shut down in an uncontrolled mode, as the backup power supply was exhausted.
Additional alert features in the server room included two Sensaphone (sensaphone.com) devices that sense water, heat, and power loss. When triggered by any of these events, they alert an IT staff member. The devices had a battery backup for power loss, but unfortunately, the batteries had not been inspected or changed for several years, and on the day of the power loss, the systems failed because the batteries were dead.
The IT staff re-started the servers that morning and noted no problems at the time of the re-start. Over the next 48 hours, clinical staff continued to document in the EHR, which was still “working.” Clinical staff, however, began to receive error messages from the EHR module, and eventually most EHR functions ceased. Although the IT team investigated the early warnings of a corrupted database, they did not immediately connect it to the server crash.
It was also determined that backup protocols for the server were adequate. There are primary and secondary servers, the latter serving as a mirror server for the primary server. There are hourly incremental backups, daily, weekly and monthly backups, and the backups are sent to cloud storage. There was a small threat to data loss, but access to data was the bigger problem due to the unavailability of the EHR.
In the days after the file server crash, the administrative staff continued storing files on the file server as usual, and the IT team performed a data restore. When the file server was re-started after the initial crash, the IT team restored to the last full backup, and during this process, inadvertently failed to restore the daily incremental backup settings, so a week after the initial crash, another file server data-restore was done.
Since the files in the days after the crash had not been saved, after the data restoration, the work done from that point on was lost. It took more than a week for an outside firm to retrieve and restore the lost data. During that time, the file server was unavailable, and some critical functions in the organization were curtailed.
This situation was attributable primarily to a failure of the physical security of the EHR database servers and file servers. Physical security protects the hardware from fire, water, natural disasters, electric power disruptions, internet connectivity, and physical intrusions, whether intentional or unintentional. Server technical security is composed of firewalls, role-based security, virtual private network (VPN), and threat detection software that detect intrusion and prevent loss of data at rest, in motion, or in use. Administrative protocols govern backup timing, downtime protocols, and regulatory compliance.
If the servers are hosted onsite, it is the responsibility of the organization to provide these protections. For those unable to provide the protections or those who want to transfer the risk, a cloud-based provider is an option — an option which was chosen after this incident.
As part of our root cause analysis, we noted that in the design of the server room, a proper Failure Mode and Effects Analysis (FMEA) for physical threats was not employed. FMEA is a quality improvement (QI) tool for risk analysis.(1) Additional information on QI tools can be found on the CMS website.(2,7) Most of the planning relied on technical safeguards, and in the building designed before current IT systems were installed, some physical threats were ignored expecting that the early warning systems outlined previously would be sufficient.
The FMEA analysis predicted a higher-than-desired probability of recurrence and negative outcomes in the current situation (Tables 1–4). Based on this evidence, a decision was made to move the EHR to cloud storage, but servers were still needed for administrative functions and other information systems.
Again, the FMEA indicated a relocation of the servers, which has been done with proper attention to both technical and physical threats. Because of the FMEA, plans were made to install a dedicated backup generator for the server and to make the main electric breaker room waterproof. The analysis further indicated the need for a reliable internet connection; one line running into the building was considered too risky, and a separate second line was recommended.
The old UPS battery backup sent emails, not audible text messages. It is unlikely that an email will be read at 4 a.m. but an audible alert will get attention; newer backups are both text and audible message. We needed to keep up with technology changes and perform more frequent process improvement checks on our physical safeguards, all of which are easily forgotten in the technological charge of modern IT systems.
In our case, the problem did not originate in the server room. It was an inconsequential drain above a main electric breaker. This stresses the importance of QI tools when planning mission-critical systems. Also, do not forget the simple things like the battery in a Sensaphone. This is where a rigorous FMEA can keep us informed of the little things we overlook.
We also learned that in case of a catastrophic shutdown, rather than re-starting systems when the power is back, it is best practice to run a database scan first to make sure the database integrity has not been compromised. The initial error messages should have been a red flag. Luckily the corruption in the database affected only the formulary database; had it affected the electronic medical record, it would have involved a reset and data backup restore.
The clinical staff continued to document for two days after the incident. If the database had been corrupted and they had found it necessary to do a restore, the corrupted data would have been overridden by the backup from the period right up to the catastrophic shutdown but would not have included data after that; the organization would have lost two days of valuable clinical documentation.
The same lessons applied to the file server. Unplanned power loss to a server can result in damage to files and hardware. Since servers are handling large volumes of data at the moment of an unplanned power cut, the amount of data corruption can be massive. In this instance, a straightforward resumption of server activity would not be prudent, until a database scan identified any such issues. In the event of data loss or corruption, a backup restore is the remedy.
In the case of our institution, at the inception of our EHR, we already had an extensive network of servers and a well-established IT department. When we initially compared the cost of onsite hosting versus cloud computing, it was cheaper to host onsite, which is what we did. What this EHR crash showed us was that losing IT services can be very costly. A large organization might lose significant amounts of revenue every day it is shut down, and there are risks to patient care when staff cannot access medical records. Our experience showed us the hidden cost of local hosting if you don’t adequately prepare for the eventuality of an EHR crash. We decided that the cloud-based option was a more reliable one.
Organizations choosing to host onsite must bear in mind that physical servers need replacing and maintenance, and the cost savings of hosting your IT infrastructure may not be that much if you factor in what can happen with a loss of service from an event such as ours.
The advantages cloud computing offers over non-cloud computing include low cost, flexible operational expense, speed, agility, flexibility, elasticity, mobility (can be accessed anywhere there’s an internet connection), and IT support that reduces IT personnel needs.(3)
A disadvantage of cloud computing as compared to onsite hosting is security concerns. Because of distributed architecture of cloud computing and easy, unlimited access for subscribers, malicious actors can gain legitimate access and attack the system. Digital forensics become more complicated in cloud-based systems. Cloud computing creates concerns for the jurisdiction in which the data are stored; there is a lack of transparency from the host about its operations and how data are kept and controlled.
Another concern is data de-duplication to save storage space which creates the potential for loss of data. Also, breakdown of internet connectivity renders your cloud service unreachable.(4)
It is important to note that the U.S. Department of Health and Human Services (HHS) requires a business associate agreement with your cloud service provider.(5)
When planning for onsite hosting and even after installation, use approved QI methods to mitigate both physical and technological threats, incorporating methods such as FMEA and Perrow’s Normal Accident Theory.(6)
Backup and restoration should be tested rigorously. As this story shows, a simple power outage can lead to a domino effect of unintended consequences.
In the event of an uncontrolled shutdown of your servers, be deliberate about resuming normal work procedures. Test for the integrity of the database and take any warning messages seriously until you have excluded more critical database faults. It is best not to start saving data before you are sure the data are not going into a corrupt database.
These are lessons from an outpatient facility but can certainly be generalized to any setting, and we hope that by sharing our experience, others can mitigate the threats to their IT infrastructure.
Liu H-C, Chen X-Q, Duan C-Y, Wang Y-M. Failure Mode and Effect Analysis Using Multi-Criteria Decision Making Methods: A Systematic Literature Review. Computers & Industrial Engineering. 2019;135:881–897.
Centers for Medicare and Medicaid Services. How to Use the Fishbone Tool for Root Cause Analysis. Accessed May 27, 2023..
Pichan A, Lazarescu M, Soh ST. Cloud Forensics: Technical Challenges, Solutions and Comparative Analysis. Digital Investigation. 2015;13:38–57.
Tabrizchi H, Kuchaki Rafsanjani M. A Survey on Security Challenges in Cloud Computing: Issues, Threats, and Solutions. The Journal of Supercomputing. 2020; 76(12):9493–9532.
U.S. Department of Health and Human Services. Guidance on HIPAA & Cloud Computing. HHS.gov.. Published February 2, 2023. Accessed April 22, 2023.
Perrow C. Normal Accidents: Living with High Risk Technologies — Updated Edition. Princeton NJ: Princeton University Press; 2011.
Centers for Medicare and Medicaid Services. Guidance for Performing Failure Effects Analysis with Performance Improvement Projects.. Accessed April 23, 2023.