GitLab downtime

GitLab.com, a web service for hosting and syncing source code, similar to GitHub, has gone down last night at around 18:00 ET, January 31, and after 11 hours, at the time of publishing, the website is still down.

"We accidentally deleted production data and might have to restore from backup," the GitLab team tweeted an hour after the incident started.

Tired admin deleted the wrong folder

According to tech news site The Register, a tired GitLab admin working late in the Netherlands might have accidentally deleted the wrong folder in a planned maintenance operation. The company never confirmed this exact scenario, but admitted that someone deleted something they shouldn't have.

It's believed that GitLab lost 295.5 GB of customer data during this snafu. The company immediately started recovery operations from backups.

"The incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis)," GitLab later tweeted. "Data transfer has been slow."

Recovery operations didn't go as planned, are very slow

In a Google Docs file, GitLab staff had been using to keep track of their operations, the company detailed a grim scenario.

"So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place," the document reads.

GitLab staff also detail some of the other problems they've encountered trying to recover the lost data from backups.

  1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
  2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
  3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
  4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
  5. The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
  6. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
  7. SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
  8. Our backups to S3 apparently don’t work either: the bucket is empty
  9. We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.

In a tweet posted two hours before this article and 8 hours after starting recovery operations, GitLab said it was only at 42% in the database recovery operation.

At this point, the GitLab downtime looks to extend well into the following day, February 1.

UPDATE: The GitLab outage has been fixed. A GitLab spokesperson has provided the following statement regarding the events that have transpired in the last day.

On Tuesday, GitLab experienced an outage for one of its products, the online service GitLab.com. This outage did not affect our Enterprise customers or the wide majority of our users. As part of our ongoing recovery efforts, we are actively investigating a potential data loss. If confirmed, this data loss would affect less than 1% of our user base, and specifically peripheral metadata that was written during a 6-hour window. We have been working around the clock to resume service on the affected product, and set up long-term measures to prevent this from happening again. We will continue to keep our community updated through Twitter, our blog and other channels.

Related Articles:

GitLab affected by GitHub-style CDN flaw allowing malware hosting

Intel and Lenovo servers impacted by 6-year-old BMC flaw

Misconfigured Firebase instances leaked 19 million plaintext passwords

Ukraine claims it hacked Russian Ministry of Defense servers