GitLab.com, a web service for hosting and syncing source code, similar to GitHub, has gone down last night at around 18:00 ET, January 31, and after 11 hours, at the time of publishing, the website is still down.
"We accidentally deleted production data and might have to restore from backup," the GitLab team tweeted an hour after the incident started.
Tired admin deleted the wrong folder
According to tech news site The Register, a tired GitLab admin working late in the Netherlands might have accidentally deleted the wrong folder in a planned maintenance operation. The company never confirmed this exact scenario, but admitted that someone deleted something they shouldn't have.
It's believed that GitLab lost 295.5 GB of customer data during this snafu. The company immediately started recovery operations from backups.
"The incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis)," GitLab later tweeted. "Data transfer has been slow."
Recovery operations didn't go as planned, are very slow
In a Google Docs file, GitLab staff had been using to keep track of their operations, the company detailed a grim scenario.
"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place," the document reads.
GitLab staff also detail some of the other problems they've encountered trying to recover the lost data from backups.
- LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
- Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
- SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
- Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
- The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
- The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
- SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
- Our backups to S3 apparently don’t work either: the bucket is empty
- We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.
In a tweet posted two hours before this article and 8 hours after starting recovery operations, GitLab said it was only at 42% in the database recovery operation.
At this point, the GitLab downtime looks to extend well into the following day, February 1.
UPDATE: The GitLab outage has been fixed. A GitLab spokesperson has provided the following statement regarding the events that have transpired in the last day.
Comments
Allen - 7 years ago
Damn
DodoIso - 7 years ago
rm -r is definitely a merciless killer.