Мой личный склад идей post 133 (EN)

Languages

Original post

The Most Expensive Mistake I Have Witnessed as a CTO Once, I had to deal with the aftermath of a mistake that cost the business a lot of money, and it took about a year to analyze the consequences. Today, I will share the insights I gained from this situation that happened over 10 years ago. Murphy's Law: “Anything that can go wrong will go wrong” Have you ever experienced a series of unrelated events happening in life, which by themselves do not have critical consequences, and their probability is relatively low, but for some reason they start occurring simultaneously, as if in collusion, eventually turning into a avalanche of troubles? Well, such an avalanche of “coincidences” once led to a situation in a company with tens of thousands of users, where on a sunny day, the database for the last 3 months was lost. All new users, user data, payments, and other information were irretrievably lost. What happened For obvious reasons, I cannot disclose all the details, but in general: the server hosting the service lost power, and after powering it back on, the database installed on it failed to start. After unsuccessful attempts to restore the database, a decision was made to restore it from a recent backup. The database was restored, the service was back online, and… it turned out that the “current” backup did not contain data from the last 3 months. How it happened At that time, the backup system was configured as follows: the main database was mirrored in real-time to a backup database, and daily backups were taken from the backup database. And exactly over those 3 months, the data mirroring mechanism to the backup database failed. Meanwhile, the monitoring system reported “everything is OK, the synchronization mechanism is working,” but in reality, new data stopped coming into it, and it “hung” in one state. Consequently, all daily backups taken from it no longer contained the latest data. Consequences Naturally, for the business, this was a significant reputational loss: many users lost all their data for that period. A lot of effort and time went into communicating with users and trying to recover at least some data. It took about a year to somewhat recover from this incident — restore user trust, stabilize the product, and bring the business back to operational status. Lessons Learned When I took on the task, my main goal was to prevent such a situation from happening again. The backup system was completely redesigned: starting from verifying that backups are 100% taken from current data, to ensuring that restored backups contain up-to-date information. Of course, scenarios were also considered where the data center storing the backups could be completely destroyed (here are examples of such incidents happening: https://habr.com/ru/news/546264/ and https://habr.com/ru/articles/954512/). Key Takeaways The importance of backups cannot be overstated — no wonder there’s a saying: people fall into three categories: those who don’t make backups yet, those who already do, and those who verify their backups. But this mistake was not purely technical; it was managerial. The problem was not the lack of backups, but that no one asked the question: what will happen if one element of this system stops working unnoticed? If at least one event in this chain had been anticipated, the consequences could have been minimized or avoided altogether.

Open channel in Telegram Open original in Telegram

Summary

The article recounts a costly mistake experienced by a CTO over a decade ago, highlighting the importance of reliable backup systems and proactive management. The incident involved a company with tens of thousands of users, where a server power failure led to database corruption and data loss spanning three months. The backup system, designed to mirror data in real-time and perform daily backups, failed silently due to a synchronization issue, resulting in backups that did not contain the latest data. When the database failed to restart after a server outage, restoring from the flawed backup caused significant reputational damage and data loss for users. The incident underscored that technical backups alone are insufficient; managerial oversight and system verification are crucial. The CTO responded by redesigning the backup strategy, emphasizing verification of backup integrity and up-to-date data recovery. The story emphasizes the critical importance of comprehensive backup procedures, regular verification, and anticipating potential system failures to prevent catastrophic data loss. It also highlights that technical failures often have managerial roots, and proactive planning can mitigate severe consequences, ensuring business continuity and customer trust.

Keywords

data backup best practicesdatabase recovery strategiesIT disaster preventionbackup system verificationpreventing data lossbusiness continuity planningdatabase failover solutionssystem redundancy and resilienceIT management lessonsbackup failure preventionreal-time data mirroringdatabase backup mistakes

Newer Older

Мой личный склад идей

Channel posts