GitLab data loss: six hours of production data deleted by hand, and five backups that didn't work

What happened

The incident began during routine maintenance on GitLab's production databases. Engineers were dealing with replication lag caused by high load and spam-related traffic. To restore synchronization, an engineer needed to clear the secondary database directory before resynchronizing replication.

Working late at night and connected to the wrong terminal, the engineer accidentally ran the deletion command against the primary production database instead of the secondary. By the time the mistake was stopped, roughly 300GB of live production data had been deleted.

The deletion itself was serious but still recoverable. The real failure appeared during recovery.

GitLab believed the database was protected by several backup systems, including scheduled PostgreSQL dumps, Amazon S3 backups, Azure snapshots, replication systems, and LVM snapshots. During the recovery effort, engineers discovered that almost none of them worked.

The scheduled backups had silently failed because an older PostgreSQL backup tool version was running against a newer database version. The backup jobs appeared successful but produced unusable files.

The alert emails warning about those failures were never received because of a separate DMARC email configuration issue.

The Azure disk snapshots were not enabled on the affected servers. The database replica could not be used because replication logs had already been purged. In the end, GitLab recovered only because an engineer happened to have created a manual snapshot earlier that day for staging purposes.

The company avoided total data loss largely through luck.

The damage

Roughly 5,000 projects, 5,000 comments, and 700 user accounts created during a six-hour period were permanently lost.

GitLab.com experienced major downtime while engineers attempted recovery. Repositories themselves survived, but the database layer around them did not.

The deeper damage was loss of confidence. The company had a backup strategy that appeared robust from the outside, but when recovery mattered most, every major safeguard failed at the same time.

The incident became one of the clearest modern examples of why backups are meaningless if nobody has verified they can actually be restored.

How QA would have prevented this

The deletion command itself was not the most important QA failure. The recovery path was.

First, backups needed restore testing. A backup system should never be considered healthy just because a job reports "success." A scheduled restore drill, restoring data into a clean environment and verifying integrity, would likely have exposed the broken backup chain long before the incident.

Second, alerting systems needed verification too. The backup warnings existed, but nobody received them because the email delivery path itself was broken. Monitoring systems are also systems under test. A deliberate failure should trigger an alert that a real human confirms receiving.

Third, destructive operations needed stronger production safeguards. A tired engineer on the wrong terminal is not a rare edge case. Production systems should have visible environment separation, confirmation steps before destructive commands, restricted permissions, and guard rails around irreversible actions.

Fourth, disaster recovery needed rehearsal. Recovery procedures that are only executed during a real emergency are effectively untested. Recovery drills expose weaknesses before they become company-level incidents.

Most teams already have backups. The real question is whether anyone has restored from them recently.

The backup that fails you will usually fail silently. You discover it only at the exact moment you can least afford to.

Before you trust your recovery systems, the important QA questions are simple:

Has anyone restored from this backup successfully? Was the restored data complete? Do alerts actually reach someone? Can destructive commands reach production too easily? Has the recovery process been rehearsed under pressure?

A focused QA pass treats the recovery path as seriously as the feature path, because recovery is often the system you eventually bet the company on.

What happened

The damage

How QA would have prevented this

A QA Audit catches patterns like this before they ship.